# Lab 4: Random numbers, splitting data, evaluating model performance

- **Author:** Niall Keleher ([nkeleher@uw.edu](mailto:nkeleher@uw.edu))
- **Date:** 22 Jan 2018
- **Course:** INFX 574: Core Methods in Data Science

### Learning Objectives:
By the end of the lab, you will be able to:
* create dummy variables for use in regressions
* generate random numbers for use in randomization and train-test splits
* identify measures for evaluating regression performance

### Topics:
1. Qualitative/Categorical predictors
2. Generating random numbers 
3. Splitting data into training and test sets
4. Running regressions & generating predictions
5. Model performance

### References: 
* [Pandas - get_dummies()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)
* [random library](https://docs.python.org/2/library/random.html)
* [Sci-kit Learn Cross Validation](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html)
* [Introduction to Statistical Learning, Lab #5](http://www-bcf.usc.edu/~gareth/ISL/Chapter%205%20Lab.txt)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [3]:
auto_df = pd.read_csv('Auto.csv')
auto_df = auto_df[auto_df.horsepower != '?']

In [4]:
auto_df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


### 1. Qualitative/Categorical predictors -  Generate dummy variables in python

In [4]:
auto_df.cylinders.value_counts()

4    199
8    103
6     83
3      4
5      3
Name: cylinders, dtype: int64

In [5]:
pd.get_dummies(auto_df.cylinders).head()

Unnamed: 0,3,4,5,6,8
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,0,0,1
3,0,0,0,0,1
4,0,0,0,0,1


In [8]:
cyl_dummies = pd.get_dummies(auto_df.cylinders, prefix='cyl')

In [9]:
auto_df2 = pd.concat([auto_df, cyl_dummies], axis=1)

In [10]:
auto_df2.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,cyl_3,cyl_4,cyl_5,cyl_6,cyl_8
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0,0,0,0,1
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,0,0,0,0,1
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,0,0,0,0,1
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,0,0,0,0,1
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,0,0,0,0,1


### 2. Generating random numbers - randomizing treatment assignment

In [11]:
import random

In [12]:
random.random()  # Random float x, 0.0 <= x < 1.0

0.7354865969322709

In [13]:
random.uniform(1,100)  # Random float x, 0.0 <= x < 100.0

80.59187062851099

In [14]:
random.randint(1, 10)  # Integer from 1 to 10, endpoints included

10

In [15]:
random.sample([1, 2, 3, 4, 5],  3)

[5, 1, 3]

In [16]:
random.seed(47653)

In [17]:
raw_data = {'first_name': ['Niall', 'Josh', 'Li', 'Lavi', 'Jevin', 'Emma'],  
        'sex': ['male', 'male', 'female', 'male', 'male', 'female']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'sex'])

In [21]:
df

Unnamed: 0,first_name,sex
0,Niall,male
1,Josh,male
2,Li,female
3,Lavi,male
4,Jevin,male
5,Emma,female


In [22]:
df['rand'] = df.apply(lambda row: random.random(), axis=1)

In [23]:
df

Unnamed: 0,first_name,sex,rand
0,Niall,male,0.009981
1,Josh,male,0.897681
2,Li,female,0.804464
3,Lavi,male,0.147438
4,Jevin,male,0.942135
5,Emma,female,0.426891


In [26]:
df['treat'] = (df['rand']<.5)

In [27]:
df

Unnamed: 0,first_name,sex,rand,treat
0,Niall,male,0.009981,True
1,Josh,male,0.897681,False
2,Li,female,0.804464,False
3,Lavi,male,0.147438,True
4,Jevin,male,0.942135,False
5,Emma,female,0.426891,True


### 3. Splitting data into training and test sets

In [48]:
auto_df['rand'] = auto_df.apply(lambda row: random.random(), axis=1)

In [49]:
auto_df['train'] = (auto_df['rand']>.33)

In [50]:
len(auto_df)

392

In [51]:
len(auto_df[auto_df['train']])

279

In [52]:
auto_train = auto_df[auto_df['train']]

Using Scikit-Learn

In [53]:
from sklearn.model_selection import train_test_split

In [54]:
X = auto_df['weight']

In [55]:
y = auto_df['mpg']

In [57]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [58]:
len(X_train)

262

In [59]:
len(y_train)

262

In [60]:
len(X_test)

130

In [61]:
len(y_test)

130

### 4. Running regressions & generating predictions

In [62]:
auto_df.head(1)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name,rand,train
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,0.710177,True


In [63]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error, r2_score

In [40]:
overfit_mod = smf.ols(formula='mpg ~ weight', data = auto_df)
overfit_result = overfit_mod.fit()
print(overfit_result.summary())

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.693
Model:                            OLS   Adj. R-squared:                  0.692
Method:                 Least Squares   F-statistic:                     878.8
Date:                Sat, 27 Apr 2019   Prob (F-statistic):          6.02e-102
Time:                        12:01:40   Log-Likelihood:                -1130.0
No. Observations:                 392   AIC:                             2264.
Df Residuals:                     390   BIC:                             2272.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     46.2165      0.799     57.867      0.0

In [64]:
train_mod = smf.ols(formula='mpg ~ weight', data = auto_train)
train_result = train_mod.fit()
print(train_result.summary())

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.706
Model:                            OLS   Adj. R-squared:                  0.705
Method:                 Least Squares   F-statistic:                     666.5
Date:                Sat, 27 Apr 2019   Prob (F-statistic):           1.08e-75
Time:                        12:48:14   Log-Likelihood:                -798.91
No. Observations:                 279   AIC:                             1602.
Df Residuals:                     277   BIC:                             1609.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     45.8825      0.912     50.323      0.0

### Exercise

#### Use scikitlearn to train a model to predict mpg using weight, horsepower, cylinders, displacement, acceleration, origin and year

Reference: http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

In [65]:
from sklearn import linear_model

lin_mod = linear_model.LinearRegression()

**Removing the dependent variable and the independent variables that are dummy encoded**

In [89]:
X1 = auto_df2.drop(['mpg', 'cylinders', 'name'] , axis=1)
Y1 = auto_df2['mpg']

**Splitting data into training set and test set**

In [90]:
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X1, Y1, test_size=0.33, random_state=42)

**Fitting a linear model over the training data**

In [91]:
lin_mod.fit(X_train_1, y_train_1)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

**Coefficients in our regression model**

In [95]:
lin_mod.coef_

array([ 1.66453084e-02, -4.80275207e-02, -5.24730126e-03,  4.28742471e-03,
        6.99890873e-01,  1.64065843e+00, -4.34239499e+00,  2.33912542e+00,
        2.43999711e+00, -1.08498393e+00,  6.48256389e-01])

**Intercept of the linear model**

In [96]:
lin_mod.intercept_

-15.90109427188423

**Model accuracy calculation. In our model, 83.12% of the variability in dependent variable can be explained by the independent variables**

In [97]:
lin_mod.score(X_test_1, y_test_1)

0.8312019406413886

**Printing the predicted values**

In [98]:
y_predict = lin_mod.predict(X_test_1)

In [103]:
print(y_predict)

[26.98637278 26.20285914 35.2097379  25.75315646 29.09663092 31.32662325
  9.72127473 31.45702835 19.27038315 29.87680757 12.5809563  23.22916175
 17.24644093 30.54106059 19.38669583 29.52240768 18.97327547 33.98766723
 27.59596626 30.46141067 17.58094859 35.15628245 35.95272465 15.91161917
 29.37441084 26.30146287 22.38242795 15.07012302 30.07450921 25.37668184
 14.50350986 22.01520971 21.26784714 32.74638873 13.56134585 36.64902752
 12.21439099 25.16867213 12.61968087  6.89030441 14.20307882 28.41243039
 34.9800616  27.37309572 13.23526452  9.67906327 16.54807592 32.09782462
 25.52791949 30.9440646  12.83495595 27.18355754 23.96537295 35.2500702
 25.71518716 16.47694263 21.11400246 20.98331814 23.49289621 26.73960059
  8.31505062 21.27305559 23.19002709 24.53879511 30.48584266 29.37512968
 26.56746599 30.24331624 22.43334794 10.27590938 24.08926293 13.44269749
 24.89005449 21.52582688 22.8978685  24.96902249 14.97797454 15.27369122
 26.62595666 18.79317976 25.75003103 19.73829467 13.

**Checking the Mean squared error of the model**

In [104]:
regression_model_mse = mean_squared_error(y_predict, y_test_1)
regression_model_mse

8.914115246646885

In [105]:
regression_model_mse**(0.5)

2.985651561493217