# Linear Regression

This notebook provides an example of how to use SML to read in a dataset, split the data into training and testing data, replace troublesome values such as 'NaNs' from the dataset, perform classifcation on the dataset, and Lastly, generates lattice plots, other visual metrics. For this use-case we use publicly availiable dataset [Auto MPG Data Set](https://archive.ics.uci.edu/ml/datasets/Auto+MPG) and use logistic regression to classify the MPG.

## SML Query

### Imports
We import the nescessary library to use SML.

In [1]:
from sml import execute

### Query
Next we create a query statement to `READ` in the data and the file is delimited by a fixed width, the header is not used, next we `REPLACE` any values of '?' with the mode of the column, `SPLIT` the dataset and use 80% of it for training and 20% of it for testing, and lastly, we perform linear regression on the 1st column, using columns 2-8 as the predictiors.

In [2]:
query = 'READ "../data/auto-mpg.csv" (separator = "\s+", header = None) AND\
 REPLACE ("?", "mode")'

execute(query, verbose=True)


Sml Summary:
   Dataset Path:        ../data/auto-mpg.csv
   Delimiter:      \s+
   Training Set Split:       None
   Testing Set Split:        None
   Predictiors:        None
   Label:         None
   Algorithm:     None
   Dataset Preview:
      0  1      2   3       4     5   6  7    8
0  18.0  8  307.0  39  3504.0  12.0  70  1   22
1  15.0  8  350.0  68  3693.0  11.5  70  1  169
2  18.0  8  318.0   2  3436.0  11.0  70  1  296
3  16.0  8  304.0   2  3433.0  12.0  70  1  294
4  17.0  8  302.0  26  3449.0  10.5  70  1  163




## Manually

The subsequent cells below show how the same actions of a SML query can be performed manually.

### Imports
Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
from sklearn.learning_curve import learning_curve, validation_curve

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
plt.rcParams['figure.figsize']=(12,12)
sns.set()

### Read

By default the Auto MPG data does not include it's headers, so we specify it manually, and read that file into a pandas dataframe.

In [4]:
#Names of all of the columns
names = [
       'mpg'
    ,  'cylinders'
    ,  'displacement'
    ,  'horsepower'
    ,  'weight'
    ,  'acceleration'
    ,  'model_year'
    ,  'origin'
    ,  'car_name'
]

#Import dataset
data = pd.read_csv('../data/auto-mpg.csv', sep = '\s+', header = None, names = names)

data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


### REPLACE
Next we drop convert all '?' symbols to NaNs and drop all rows that have a NaN in it.

In [5]:
# Remove NaNs
data_clean=data.applymap(lambda x: np.nan if x == '?' else x).dropna()