# SPLIT

This notebook provides and example of how to use SML to read in a dataset, split the dataset into training and testing data. For this use-case we use the publicly aviliable [iris dataset](http://archive.ics.uci.edu/ml/datasets/Iris) to predict the class of iris plants.

### Imports

We import the nescessary library to use SML. 

In [1]:
from sml import execute

### Query

Next we create a query statement to `READ` the iris dataset, perform a 80%/20% `SPLIT` on the dataset for the training and testing set respectively, we use the algorithm SVM to prdict the 5th column in the dataset using columns 1-4 as the features, and lastly we execute the statement.

In [2]:
query = 'READ "../data/iris.csv" AND \
         SPLIT (train = .8, test = 0.2)'
execute(query)





## Manually

The subsequent ceels below show how the same actions of a SML query can be performed manually.

### IMPORTS

We begin by importing the necessary statements libraries needed to perform the same actions as the SML query above.

In [3]:
import pandas as pd
import numpy as np

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import label_binarize


### READ
Next we read in the dataset into a pandas dataframe, by default this dataset does not contain a header, so we manually specify this.

In [4]:
data = pd.read_csv('../data/iris.data', header=0)
data[:5]  # Show first 5 rows of data

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


### SPLIT

Next we seperate the features from the labels. We then split the dataset using 75% of it for the training set and 25% for testing set.

#### Spliting data into Training and Testing Datasets

It's important to note here, that if you want to use generate ROC Curves, or perhaps lattice plots with class information, you'll need to manually binarize the labeled data. You'll see this performed in subseqent tutorials.  

In [8]:

features = np.c_[data.drop('species',1).values]
labels = data['species']

(X_train, X_test, y_train, y_test) = train_test_split(features, labels, test_size=0.25)

#### Training Data: Features

In [9]:
X_train[:5]

array([[ 5.5,  4.2,  1.4,  0.2],
       [ 7.7,  3. ,  6.1,  2.3],
       [ 4.7,  3.2,  1.6,  0.2],
       [ 5.4,  3.9,  1.7,  0.4],
       [ 5.5,  2.3,  4. ,  1.3]])

#### Training Data: Labels

In [16]:
y_train[:5]

33         Iris-setosa
135     Iris-virginica
29         Iris-setosa
5          Iris-setosa
53     Iris-versicolor
Name: species, dtype: object

#### Testing Data: Features

In [17]:
X_test[:5]

array([[ 6. ,  2.2,  5. ,  1.5],
       [ 6.5,  3. ,  5.5,  1.8],
       [ 6.3,  3.3,  6. ,  2.5],
       [ 4.6,  3.2,  1.4,  0.2],
       [ 7.7,  2.8,  6.7,  2. ]])

#### Testing Data: Labels

In [18]:
y_test[:5]

119    Iris-virginica
116    Iris-virginica
100    Iris-virginica
47        Iris-setosa
122    Iris-virginica
Name: species, dtype: object