<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Get-the-features-and-target" data-toc-modified-id="Get-the-features-and-target-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Get the features and target</a></span></li><li><span><a href="#The-brute-force-approach-for-model-evaluation" data-toc-modified-id="The-brute-force-approach-for-model-evaluation-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The brute-force approach for model evaluation</a></span><ul class="toc-item"><li><span><a href="#Divide-the-data-into-training-and-testing" data-toc-modified-id="Divide-the-data-into-training-and-testing-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Divide the data into training and testing</a></span></li><li><span><a href="#Standardization" data-toc-modified-id="Standardization-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Standardization</a></span></li><li><span><a href="#Training-and-testing" data-toc-modified-id="Training-and-testing-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Training and testing</a></span></li></ul></li><li><span><a href="#A-better-approach" data-toc-modified-id="A-better-approach-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>A better approach</a></span><ul class="toc-item"><li><span><a href="#Divide-the-data-into-$k$-folds" data-toc-modified-id="Divide-the-data-into-$k$-folds-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Divide the data into $k$ folds</a></span></li><li><span><a href="#Standardization,-training,-and-testing" data-toc-modified-id="Standardization,-training,-and-testing-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Standardization, training, and testing</a></span></li></ul></li><li><span><a href="#The-best-practice:-Pipeline-+-cross_val_score-+-KFold" data-toc-modified-id="The-best-practice:-Pipeline-+-cross_val_score-+-KFold-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>The best practice: Pipeline + cross_val_score + KFold</a></span><ul class="toc-item"><li><span><a href="#Create-the-pipeline" data-toc-modified-id="Create-the-pipeline-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Create the pipeline</a></span></li><li><span><a href="#Calculate-the-average-accuracy-using-the-pipeline,-cross_val_score,-and-KFold" data-toc-modified-id="Calculate-the-average-accuracy-using-the-pipeline,-cross_val_score,-and-KFold-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Calculate the average accuracy using the pipeline, cross_val_score, and KFold</a></span></li></ul></li></ul></div>

<b>

<p>
<center>
<font size="5">
Machine Learning I (DATS 6202 - O10), Spring 2019
</font>
</center>
</p>

<p>
<center>
<font size="4">
Exercise 2 (Solution)
</font>
</center>
</p>

<p>
<center>
<font size="3">
Data Science, Columbian College of Arts & Sciences, George Washington University
</font>
</center>
</p>

<p>
<center>
<font size="3">
Author: Yuxiao Huang
</font>
</center>
</p>

</b>

# Overview
- Apply Linear Regression on the Housing dataset
- Specifically, you should use the following linear model
$$
MEDV = w_0 + w_1 x_1 + w_2 x_2 +, \ldots, + w_n x_n.
$$
- Here:
    - $MEDV$ is the target, which is the median value of owner-occupied homes in \$1000s
    - $x_1, x_2, \ldots, x_n$ are the features in the input dataset
    - $w_0, w_1, \ldots, w_n$ are the parameters
- The goal of this exercise is to
    - walk through the steps in data preprocessing, training, testing, and model evaluation, by dividing the data using a single split and by using cross validation
    - introduce the best practice for data preprocessing, training, testing, and model evaluation
- Complete the missing parts indicated by # Implement me
- Particularly, the code should
    - be bug-free (while the output produced by your solution being the same as the provided output does not necessarily mean your code is bug-free, it is very likely that there is a bug in your code when the two kinds of output are different)
    - be commented

# Load data

In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/rasbt/'
                 'python-machine-learning-book-2nd-edition'
                 '/master/code/ch10/housing.data.txt',
                 header=None,
                 sep='\s+')

df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
              'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
              'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

# Get the features and target

In [2]:
# Implement me
X = df.iloc[:, :-1].values
y = df['MEDV'].values

# The brute-force approach for model evaluation
Here we use a single split to divide the data into training and testing. The accuracy of the model (learned from the training data) is then obtained from the testing data.

## Divide the data into training and testing

In [3]:
from sklearn.model_selection import train_test_split

# Divide the data into training and testing (with test_size=0.3 and random_state = 0)
# Implement me
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

## Standardization

In [4]:
from sklearn.preprocessing import StandardScaler

# Standardize the features and target
# Implement me
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
y_train = ss.fit_transform(y_train.reshape(-1, 1)).reshape(-1)
y_test = ss.transform(y_test.reshape(-1, 1)).reshape(-1)

## Training and testing

In [5]:
from sklearn.linear_model import LinearRegression

# Declare the linear regression model
lr = LinearRegression()

# Fit the model
# Implement me
lr.fit(X_train, y_train)

print('The score of the model: ', round(lr.score(X_test, y_test), 3))

The score of the model:  0.673


# A better approach
Here we will use cross validation to divide the data into $k$ folds. Unlike in the previous case (when using a single split) where there is only one round of training and testing, here there are $k$ rounds. For round $i$, fold $i$ is used for testing and the remaining folds are used for training. This results in a list of $k$ accuracy across the $k$ rounds. We can then report the mean and standard deviation (std) of this list. The mean measures the average accuracy of the model. Generally the higher the mean the more accurate the model. The std, on the other hand, measures the stability of the model. Generally the lower the std the more stable the model. 

## Divide the data into $k$ folds

In [6]:
from sklearn.model_selection import KFold

# StratifiedKFold (with n_splits=10 and random_state=0)
# Implement me
kf = KFold(n_splits=10, random_state=0)

## Standardization, training, and testing

In [7]:
import numpy as np
accs = []

for train_idx, test_idx in kf.split(X, y):
    # Get the training and testing data using train_idx and test_idx
    # Implement me
    X_train, y_train = X[train_idx], y[train_idx]
    X_test, y_test = X[test_idx], y[test_idx]

    # Standardize the features and target
    # Implement me
    ss = StandardScaler()
    X_train = ss.fit_transform(X_train)
    X_test = ss.transform(X_test)
    y_train = ss.fit_transform(y_train.reshape(-1, 1)).reshape(-1)
    y_test = ss.transform(y_test.reshape(-1, 1)).reshape(-1)
    
    # Fit the model
    # Implement me
    lr.fit(X_train, y_train)

    # Add the accuracy
    accs.append(lr.score(X_test, y_test))
    
print('The average score of the model: ', round(np.mean(accs), 3))
print('The std score of the accuracy: ', round(np.std(accs), 3))

The average score of the model:  0.203
The std score of the accuracy:  0.595


# The best practice: Pipeline + cross_val_score + KFold
While the approach based on cross validation is accurate, it is not the best practice. Here we show a much simpler approach using the pipeline, cross_val_score, and KFold.

## Create the pipeline
As the name suggested, the [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) is a chain where the steps in data preprocessing are connected. In reality, the pipeline is a list of tuples (pairs), where the second item in the tuple is a transformer and the first item the name of the transformer. The pipeline will, as you will see in this exercise and later ones, greatly simplify data preprocessing, training and testing, hyperparameter tuning, and model selection.

There are two things you should know about the pipeline. First, the pipeline must end with an estimator (a regressior or classifier). Second, the order of the transformers in the pipeline matters. For any two transformers A and B, the transformation provided by A will be performed prior to that provided by B, if A is before B in the pipeline.

In [8]:
from sklearn.pipeline import Pipeline

# The pipeline with StandardScaler and LinearRegression
pipe_lr = Pipeline([('StandardScaler', StandardScaler()), ('LinearRegression', LinearRegression())])

## Calculate the average accuracy using the pipeline, cross_val_score, and KFold
The [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function is a one-stop-shop that wraps what we did in sec. 5. That is, dividing the data into $k$ folds, and in each of the $k$ rounds, first standardizing the data then training and testing the model.

In [9]:
from sklearn.model_selection import cross_val_score

# Get the list of accuracy obtained by cross_val_score
accs = cross_val_score(pipe_lr,
                       X,
                       y,
                       cv=KFold(n_splits=10, random_state=0))

print('The average score of the model: ', round(accs.mean(), 3))
print('The std score of the accuracy: ', round(accs.std(), 3))

The average score of the model:  0.203
The std score of the accuracy:  0.595
