# SPLIT

This notebook provides an example of how to use SML to read in a dataset, and split the data into training and testing data. For this use-case we use publicly availiable dataset [Wine Data Set](https://archive.ics.uci.edu/ml/datasets/Wine) .

## SML Query
### Imports
We import the nescessary library to use SML.

In [1]:
from sml import execute

### Query
Next we create a query statement to `READ` in the data and the file is delimited by ';', the header is not used, next we `SPLIT` the dataset and use 80% of it for training and 20% of it for testing.

In [2]:
query = 'READ "../data/wine.csv" (separator = ";", header = 0) AND SPLIT (train = .8, test = 0.2)'

execute(query, verbose=True)


Sml Summary:
   Dataset Path:        ../data/wine.csv
   Delimiter:      ;
   Training Set Split:       80.00%
   Testing Set Split:        20.00%
   Predictiors:        None
   Label:         None
   Algorithm:     None
   Dataset Preview:
   1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
0                                                148               
1                                                168               
2                                                 31               
3                                                 81               
4                                                110               




## Manually

The subsequent cells below show how the same actions of a SML query can be performed manually.

### Imports
Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [3]:
import pandas as pd
import numpy as np

from sklearn.cross_validation import train_test_split


### Read

By default the Wine dataset does not include it's headers, so we specify it manually, and read that file into a pandas dataframe.

In [4]:
header = ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 
          'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity ', 'Hue',
          'OD280/OD315 of diluted wines', 'Proline']
df = pd.read_csv('../data/wine.csv',sep=',', names = header)

df.head()

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


### SPLIT
Next, we use a sklearn function to perform a 80%/20% split our training and testing dataset respectively.

In [5]:
X_train, X_test = train_test_split(df, train_size=0.8, test_size=0.2)

### Training Data:

In [6]:
X_train[:5]

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
2,12.72,1.81,2.2,18.8,86,2.2,2.53,0.26,1.77,3.9,1.16,3.14,714
3,12.25,4.72,2.54,21.0,89,1.38,0.47,0.53,0.8,3.85,0.75,1.27,720
2,12.17,1.45,2.53,19.0,104,1.89,1.75,0.45,1.03,2.95,1.45,2.23,355
2,12.04,4.3,2.38,22.0,80,2.1,1.75,0.42,1.35,2.6,0.79,2.57,580
2,13.86,1.51,2.67,25.0,86,2.95,2.86,0.21,1.87,3.38,1.36,3.16,410


### Testing Data:

In [7]:
X_test[:5]

Unnamed: 0,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
2,12.64,1.36,2.02,16.8,100,2.02,1.41,0.53,0.62,5.75,0.98,1.59,450
2,12.33,1.1,2.28,16.0,101,2.05,1.09,0.63,0.41,3.27,1.25,1.67,680
1,14.22,3.99,2.51,13.2,128,3.0,3.04,0.2,2.08,5.1,0.89,3.53,760
1,14.02,1.68,2.21,16.0,96,2.65,2.33,0.26,1.98,4.7,1.04,3.59,1035
2,12.37,1.17,1.92,19.6,78,2.11,2.0,0.27,1.04,4.68,1.12,3.48,510
