# Linear Regression

This notebook provides an example of how to use SML to read in a dataset, split the data into training and testing data, replace troublesome values such as 'NaNs' from the dataset, and perform linear regression on the dataset. For this use-case we use publicly availiable dataset [Computer Hardware Dataset]https://archive.ics.uci.edu/ml/datasets/Computer+Hardware) and use ridge regression to predict compute performance.

## SML Query

### Imports
We import the nescessary library to use SML.

In [1]:
from sml import execute

### Query

Next we create a query statement to `READ` in the data and the file is delimited by a ',', the header is not used, next we `SPLIT` the dataset and use 80% of it for training and 20% of it for testing, and lastly, we perform linear regression on the 10th column, using columns 1-9 as the predictiors.

In [2]:
query = 'READ "../data/computer.csv" (separator = ",", header = 0) AND \
SPLIT (train = .8, test = .2, validation = .0) AND \
REGRESS (predictors = [1,2,3,4,5,6,7,8,9], label = 10, algorithm = ridge)'

execute(query, verbose=True)


Sml Summary:
   Dataset Path:        ../data/computer.csv
   Delimiter:      ,
   Training Set Split:       80.00%
   Testing Set Split:        20.00%
   Predictiors:        ['1', '2', '3', '4', '5', '6', '7', '8', '9']
   Label:         10
   Algorithm:     ridge
   Dataset Preview:
   adviser  32/60  125   256   6000  256.1  16  128  198  199
0       22    137   29  8000  32000     32   8   32  269  253
1       22     95   29  8000  32000     32   8   32  220  253
2       22    120   29  8000  32000     32   8   32  172  253
3       22    161   29  8000  16000     32   8   16  132  132
4       22     51   26  8000  32000     64   8   32  318  290




## Manually

The subsequent cells below show how the same actions of a SML query can be performed manually.

### Imports
Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [3]:
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer

from sklearn.cross_validation import train_test_split
from sklearn.metrics import r2_score
from sklearn import linear_model



### READ

By default the Computer Hardware data does not include it's headers, so we specify it manually, and read that file into a pandas dataframe.

In [4]:
names = ['vendor name', 'Model Name', 'MYCT', 'MMIN', 'MMAX', 'CACH', 'CHMN', 'CHMAX', 'PRP', 'ERP']

df = pd.read_csv('../data/computer.csv', header = None, names=names)
df.head()

Unnamed: 0,vendor name,Model Name,MYCT,MMIN,MMAX,CACH,CHMN,CHMAX,PRP,ERP
0,adviser,32/60,125,256,6000,256,16,128,198,199
1,amdahl,470v/7,29,8000,32000,32,8,32,269,253
2,amdahl,470v/7a,29,8000,32000,32,8,32,220,253
3,amdahl,470v/7b,29,8000,32000,32,8,32,172,253
4,amdahl,470v/7c,29,8000,16000,32,8,16,132,132


#### Preprocessing 

Here we iterate through the dataframe and convert columns of dtype object to categorical.

In [5]:
def encode_categorical(df, cols=None):
  categorical = list()
  if cols is not None:
    categorical = cols
  else:
    for col in df.columns:
        if df[col].dtype == 'object':
            categorical.append(col)

  for feature in categorical:
      l = list(df[feature])
      s = set(l)
      l2 = list(s)
      numbers = list()
      for i in range(0,len(l2)):
          numbers.append(i)
      df[feature] = df[feature].replace(l2, numbers)
  return df

df2 =  encode_categorical(df)
df2.head()



Unnamed: 0,vendor name,Model Name,MYCT,MMIN,MMAX,CACH,CHMN,CHMAX,PRP,ERP
0,13,27,125,256,6000,256,16,128,198,199
1,23,138,29,8000,32000,32,8,32,269,253
2,23,96,29,8000,32000,32,8,32,220,253
3,23,121,29,8000,32000,32,8,32,172,253
4,23,162,29,8000,16000,32,8,16,132,132


### SPLIT
We then seperate our labels from our features and use a sklearn function to perform a 80%/20% split our training and testing dataset respectively.# SPLIT

In [6]:
features = df2.drop('PRP',1)
label = df2['PRP']
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=42)

### REGRESS

We fit our Ridge regression model with our training dataset and make predictions on our testing dataset and display the accuracy.

In [7]:
ridge = linear_model.Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
pred = ridge.predict(X_test)
r2_score(pred, y_test)

0.9607123194827184