# SPLIT

This notebook provides an example of how to use SML to read in a dataset and split the data into training and testing data For this use-case we use publicly availiable dataset [Auto MPG Data Set](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) and use logistic regression to classify the MPG.

## SML Query

### Imports
We import the nescessary library to use SML.

## Imports 

In [2]:
from sml import execute

In [3]:
query = 'READ "../data/auto-mpg.csv" (separator = "\s+", header = None) AND \
        SPLIT (train = .8, test = .2, validation = .0)'

execute(query, verbose=False)



## Manually

The subsequent cells below show how the same actions of a SML query can be performed manually.

### Imports
Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [4]:
import pandas as pd
import numpy as np

from sklearn.cross_validation import train_test_split

### Read

By default the Auto MPG data does not include it's headers, so we specify it manually, and read that file into a pandas dataframe.

In [5]:
#Names of all of the columns
names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
         'model_year', 'origin', 'car_name']

#Import dataset
data = pd.read_csv('../data/auto-mpg.csv', sep = '\s+', header = None, names = names)

data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


### SPLIT
We then seperate our labels from our features and use a sklearn function to perform a 80%/20% split our training and testing dataset respectively.

In [7]:

# Sep Predictiors From Labels
X = data[['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', "origin"]]

#Select target column
y = data['mpg']

### Spliting data into Training and Testing Datasets

In [8]:
#Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2)

#### Training Data: Features

In [9]:
X_train[:5]

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
88,8,302.0,137.0,4042.0,14.5,73,1
161,6,250.0,105.0,3897.0,18.5,75,1
16,6,199.0,97.0,2774.0,15.5,70,1
292,8,360.0,150.0,3940.0,13.0,79,1
148,4,116.0,75.0,2246.0,14.0,74,2


#### Training Data: Labels

In [13]:
y_train.head()

88     14.0
161    16.0
16     18.0
292    18.5
148    26.0
Name: mpg, dtype: float64

#### Testing Data: Features

In [12]:
X_test[:5]

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
284,6,225.0,110.0,3360.0,16.6,79,1
220,4,85.0,70.00,1945.0,16.8,77,3
122,4,121.0,110.0,2660.0,14.0,73,2
126,6,200.0,?,2875.0,17.0,74,1
94,8,440.0,215.0,4735.0,11.0,73,1


#### Testing Data: Labels

In [14]:
y_test.head()

284    20.6
220    33.5
122    24.0
126    21.0
94     13.0
Name: mpg, dtype: float64