# Elastic Net Regression

This notebook provides an example of how to use SML to read in a dataset, split the data into training and testing data, replace troublesome values such as 'NaNs' from the dataset, and perform Elastic Net Regression on the dataset. For this use-case we use publicly availiable dataset: [Housing Data Set](https://archive.ics.uci.edu/ml/datasets/Housing) and use elastic net regression to predict **[Ask Mike about this...]**.

## SML Query
### Imports
We import the nescessary library to use SML.

In [1]:
from sml import execute

In [2]:
query = 'READ "../data/boston.csv" (separator = "\s+", header = 0) AND \
        SPLIT (train = .8, test = .2, validation = .0) AND \
        REGRESS (predictors = [1,2,3,4,5,6,7,8,9,10,11,12,13], label = 14, algorithm = elastic)'

execute(query, verbose=True)


Sml Summary:
   Dataset Path:        ../data/boston.csv
   Delimiter:      \s+
   Training Set Split:       80.00%
   Testing Set Split:        20.00%
   Predictiors:        ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13']
   Label:         14
   Algorithm:     elastic
   Dataset Preview:
   0.00632  18.00  2.310  0  0.5380  6.5750  65.20  4.0900  1  296.0  15.30  \
0  0.02731    0.0   7.07  0   0.469   6.421   78.9  4.9671  2  242.0   17.8   
1  0.02729    0.0   7.07  0   0.469   7.185   61.1  4.9671  2  242.0   17.8   
2  0.03237    0.0   2.18  0   0.458   6.998   45.8  6.0622  3  222.0   18.7   
3  0.06905    0.0   2.18  0   0.458   7.147   54.2  6.0622  3  222.0   18.7   
4  0.02985    0.0   2.18  0   0.458   6.430   58.7  6.0622  3  222.0   18.7   

   396.90  4.98  24.00  
0  396.90  9.14   21.6  
1  392.83  4.03   34.7  
2  394.63  2.94   33.4  
3  396.90  5.33   36.2  
4  394.12  5.21   28.7  




## Manually

The subsequent cells below show how the same actions of a SML query can be performed manually.

### Imports
Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [3]:
import pandas as pd
import numpy as np

from sklearn import cross_validation, metrics
from sklearn.linear_model import ElasticNet
from sklearn.cross_validation import train_test_split
from sklearn.metrics import r2_score
from sklearn import cross_validation, metrics

### READ
By default the Boston Housing data does not include it's headers, so we specify it manually, and read that file into a pandas dataframe.

In [4]:
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', ' MEDV']
f = pd.read_csv("../data/boston.csv", sep = "\s+", header = None, names=names)
f.head()



Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222.0,18.7,396.9,5.33,36.2


### SPLIT
We then seperate our labels from our features and use a sklearn function to perform a 80%/20% split our training and testing dataset respectively.

In [5]:
features = f.drop('LSTAT', 1)
labels = f['LSTAT']
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=.2, random_state=42)

### REGRESS

We fit our Elastic Net Regression model with our training dataset and make predictions on our testing dataset and display the accuracy.

In [6]:
enet = ElasticNet(alpha=0.1)
enet.fit(X_train, y_train)
pred = enet.predict(X_test)
print('Accuracy:', r2_score(pred, y_test))


Accuracy: 0.605909902196
