# Support Vector Machine (SVM)


This notebook provides and example of how to use SML to read in a dataset, split split the dataset into training and testing data, and perform classification on the dataset. For this use-case we use the publicly aviliable [iris dataset](http://archive.ics.uci.edu/ml/datasets/Iris) to predict the class of iris plants.

## SML Query

### Imports

We import the nescessary library to use SML. 

In [1]:
from sml import execute

### Query

Next we create a query statement to `READ` the iris dataset, perform a 80%/20% `SPLIT` on the dataset for the training and testing set respectively, we use the algorithm SVM to prdict the 5th column in the dataset using columns 1-4 as the features, and lastly we execute the statement.

In [2]:
query = 'READ "../data/iris.csv" AND \
 SPLIT (train = .8, test = 0.2) AND \
 CLASSIFY (predictors = [1,2,3,4], label = 5, algorithm = svm)'

execute(query, verbose=True)


Sml Summary:
   Dataset Path:        ../data/iris.csv
   Delimiter:      None
   Training Set Split:       80.00%
   Testing Set Split:        20.00%
   Predictiors:        ['1', '2', '3', '4']
   Label:         5
   Algorithm:     svm
   Dataset Preview:
     0    1    2    3  4
0  5.1  3.5  1.4  0.2  0
1  4.9  3.0  1.4  0.2  0
2  4.7  3.2  1.3  0.2  0
3  4.6  3.1  1.5  0.2  0
4  5.0  3.6  1.4  0.2  0




## Manually

The subsequent ceels below show how the same actions of a SML query can be performed manually.

### IMPORTS

We begin by importing the necessary statements libraries needed to perform the same actions as the SML query above.

In [3]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import label_binarize
import sklearn.cross_validation as cv
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, auc

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### READ
 
 Next we read in the dataset into a pandas dataframe, by default this dataset does not contain a header, so we manually specify this.

In [4]:
names = ['sepal length(cm)', 'sepal width(cm)', 'petal length(cm)', 'petal width(cm)', 'species']
data = pd.read_csv('../data/iris.csv', names=names)

### Preprocessing

Next we seperate the features from the labels. Lastly, we binarize the data so that it is possible to generate metrics such as ROC curves, and identify all of the classes in the iris dataset. .

In [5]:
iris_classes = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
features = np.c_[data.drop('species',1).values]
labels = label_binarize(data['species'], classes=iris_classes)

n_classes = labels.shape[1]

### SPLIT

We then split the dataset using 75% of it for the training set and 25% for testing set.

In [6]:
(x_train, x_test, y_train, y_test) = cv.train_test_split(features, labels, test_size=0.25)

### CLASSIFY
#### Training

Next we create a svm model and fit it on the training data.

In [7]:
svm = OneVsRestClassifier(SVC(kernel='linear', probability=True))
model = svm.fit(x_train, y_train)
print('Accuracy:', model.score(x_test, y_test))

Accuracy: 0.631578947368
