# SPLIT

This notebook provides an example of how to use SML to read in a dataset, and split the data into training and testing data. For this use-case we use publicly availiable dataset [Wine Data Set](https://archive.ics.uci.edu/ml/datasets/Wine) .

## SML Query
### Imports
We import the nescessary library to use SML.

In [2]:
from sml import execute

### Query
Next we create a query statement to `READ` in the data and the file is delimited by ';', the header is not used, next we `SPLIT` the dataset and use 80% of it for training and 20% of it for testing.

In [5]:
query = 'READ "../data/wine.csv" (separator = ";", header = 0) AND SPLIT (train = .8, test = 0.2)'

execute(query)



## Manually

The subsequent cells below show how the same actions of a SML query can be performed manually.

### Imports
Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [6]:
import pandas as pd
import numpy as np

from sklearn.cross_validation import train_test_split


### Read

By default the Wine dataset does not include it's headers, so we specify it manually, and read that file into a pandas dataframe.

In [7]:
header = ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 
          'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity ', 'Hue',
          'OD280/OD315 of diluted wines', 'Proline']
df = pd.read_csv('../data/wine.csv',sep=',', names = header)

df.head()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


### SPLIT
Next, we use a sklearn function to perform a 80%/20% split our training and testing dataset respectively.

In [8]:
X_train, X_test = train_test_split(df, train_size=0.8, test_size=0.2)

### Training Data:

In [9]:
X_train[:5]

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
2,11.84,2.89,2.23,18.0,112,1.72,1.32,0.43,0.95,2.65,0.96,2.52,500
1,14.06,1.63,2.28,16.0,126,3.0,3.17,0.24,2.1,5.65,1.09,3.71,780
2,12.08,1.33,2.3,23.6,70,2.2,1.59,0.42,1.38,1.74,1.07,3.21,625
2,13.67,1.25,1.92,18.0,94,2.1,1.79,0.32,0.73,3.8,1.23,2.46,630
1,13.87,1.9,2.8,19.4,107,2.95,2.97,0.37,1.76,4.5,1.25,3.4,915


### Testing Data:

In [10]:
X_test[:5]

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
1,14.2,1.76,2.45,15.2,112,3.27,3.39,0.34,1.97,6.75,1.05,2.85,1450
3,12.25,4.72,2.54,21.0,89,1.38,0.47,0.53,0.8,3.85,0.75,1.27,720
3,14.16,2.51,2.48,20.0,91,1.68,0.7,0.44,1.24,9.7,0.62,1.71,660
2,12.51,1.73,1.98,20.5,85,2.2,1.92,0.32,1.48,2.94,1.04,3.57,672
2,11.03,1.51,2.2,21.5,85,2.46,2.17,0.52,2.01,1.9,1.71,2.87,407
