Getting Started on MaTEx Other ML Algorithms

abhinavvishnu edited this page Feb 20, 2017 · 3 revisions

MaTEx supports other ML algorithms such as Support Vector Machines (SVM) by implementing Sequential Minimal Optimization (SMO) algorithm. MaTEx also provides MPI based implementation of k-means Clustering and Association Rule Mining (ARM).

Download

 git clone https://github.com/abhinavvishnu/matex

Installation

 mkdir build
 cd build
 ../configure && make -j 4 && make install

Dataset Layout

MaTEx Other ML Algorithms support sparse data format. The dataset is expected to have one sample/vector on each line, separated by , or :.

Sample Data Layout for Classification Algorithms

Dataset(s) for Classification algorithms are expected to adhere to libsvm sparse data format. An example is here, where each line in the dataset is expected to look like as follows:

 class col1:val1 col2:val2 col3:val3

Following format is also acceptable (equivalent to CSV format):

 class,col1,val1,col2,val2

Empty lines are ignores. Each sample may contain arbitrary number of samples.

Sample Data Layout for Clustering Algorithms

Dataset(s) for Clustering algorithms are expected to follow a sparse data format as follows. Since datasets for clustering do not have class variable, each line in the dataset is expected to look like as follows:

 col1:val1 col2:val2 col3:val3

Following format is also acceptable (equivalent to CSV format):

 col1,val1,col2,val2

The dataset may contain any arbitrary number of spaces.

Sample Data Layout for Association Rule Mining (ARM) Algorithms

Dataset(s) for ARM are expected to follow a sparse data format as follows. The datasets for ARM are not expected to have val associated with column. Hence, each line in the dataset is expected to look like:

 col1 col2 col3 col4

Following format is also acceptable (equivalent to CSV format):

 col1,col2,col3,col4

The dataset may contain any arbitrary number of spaces.

Running MaTEx algorithms

Each algorithm requires different parameters:

SVM Example

SVM requires a training set and a testing set in the libsvm format (see above). The hyperparameters need to be provided as well (C and sigmasqr). As an example to run svm with 16 processes on adult training set (a9a) and testing set (a9a.t) with parameters C and sigmasqr, 32 and 64, respectively:

 mpirun -np 16 ./smo a9a a9a.t 32 64