<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load data</a></span></li><li><span><a href="#Get-the-features-and-target" data-toc-modified-id="Get-the-features-and-target-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Get the features and target</a></span></li><li><span><a href="#The-fast-linear-regression-model-(implemented-using-numpy-array)" data-toc-modified-id="The-fast-linear-regression-model-(implemented-using-numpy-array)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The fast linear regression model (implemented using numpy array)</a></span></li><li><span><a href="#Train-and-test-the-fast-linear-regression-model" data-toc-modified-id="Train-and-test-the-fast-linear-regression-model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Train and test the fast linear regression model</a></span></li><li><span><a href="#Train-and-test-the-sklearn-linear-regression-model" data-toc-modified-id="Train-and-test-the-sklearn-linear-regression-model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Train and test the sklearn linear regression model</a></span></li><li><span><a href="#Discussion" data-toc-modified-id="Discussion-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Discussion</a></span></li></ul></div>

<b>

<p>
<center>
<font size="5">
Machine Learning I (DATS 6202 - O10), Spring 2019
</font>
</center>
</p>

<p>
<center>
<font size="4">
Exercise 3 (Solution)
</font>
</center>
</p>

<p>
<center>
<font size="3">
Data Science, Columbian College of Arts & Sciences, George Washington University
</font>
</center>
</p>

<p>
<center>
<font size="3">
Author: Yuxiao Huang
</font>
</center>
</p>

</b>

# Overview
- Apply Linear Regression on the Housing dataset
- Particularly, you should implement your own (fast) model (using numpy array)
- Complete the missing parts indicated by # Implement me
- Particularly, the code should
    - be bug-free (while the output produced by your solution being the same as the provided output does not necessarily mean your code is bug-free, it is very likely that there is a bug in your code when the two kinds of output are different)
    - be commented

# Load data

In [1]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/rasbt/'
                 'python-machine-learning-book-2nd-edition'
                 '/master/code/ch10/housing.data.txt',
                 header=None,
                 sep='\s+')

df.columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 
              'NOX', 'RM', 'AGE', 'DIS', 'RAD', 
              'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

target = 'MEDV'

# Get the features and target

In [2]:
# Implement me
X = df.drop(columns=[target])
y = df[target]

# The fast linear regression model (implemented using numpy array)

In [3]:
from sklearn.base import BaseEstimator, RegressorMixin
import numpy as np

class MyFastLinearRegression(BaseEstimator, RegressorMixin):
    """The fast linear regression model (implemented using numpy array)"""
    
    def __init__(self, n_iter=100, eta=10 ** -4, random_state=0):
        # The number of iterations
        self.n_iter = n_iter
        # The learning rate
        self.eta = eta
        # The random state
        self.random_state = random_state

    def fit(self, X, y):
        """
        The fit function
        
        Parameters
        ----------
        X : the feature matrix
        y : the target vector
        """
        
        # Implement me
        # Initialize the weight for features x0 (the dummy feature), x1, x2, ..., xn
        self.w = np.zeros(1 + X.shape[1])

        # For each iteration
        for _ in range(self.n_iter):
            # Get the net_input
            net_input = self.net_input(X)
            
            # Get the errors
            errors = y - net_input
            
            # Get the update (of the weight) for features x1, x2, ..., xn
            self.w[1:] += self.eta * X.T.dot(errors)
            
            # Get the update (of the weight) for the dummy feature, x0
            self.w[0] += self.eta * errors.sum()

    def net_input(self, X):
        """
        Get the net input
        
        Parameters
        ----------
        X : the feature matrix
        
        Returns
        ----------
        The net input
       
        """
        
        # Implement me
        return np.dot(X, self.w[1:]) + self.w[0]

    def predict(self, X):
        """
        The predict function
        
        Parameters
        ----------
        X : the feature matrix
        
        Returns
        ----------
        The predicted value of the target
        """
        
        # Implement me
        return self.net_input(X)

# Train and test the fast linear regression model

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Create the pipeline with StandardScaler and MyFastLinearRegression
# Implement me
pipe_mflr = Pipeline([('StandardScaler', StandardScaler()), ('MyFastLinearRegression', MyFastLinearRegression())])

scores = cross_val_score(pipe_mflr,
                         X,
                         y,
                         cv=KFold(n_splits=10,
                                  random_state=0))

print(scores.mean(), scores.std())

0.32329520498092723 0.4907918945638258


# Train and test the sklearn linear regression model

In [5]:
from sklearn.linear_model import LinearRegression

# Create the pipeline with StandardScaler and LinearRegression
# Implement me
pipe_sklr = Pipeline([('StandardScaler', StandardScaler()), ('LinearRegression', LinearRegression())])

scores = cross_val_score(pipe_sklr,
                         X,
                         y,
                         cv=KFold(n_splits=10,
                                  random_state=0))

print(scores.mean(), scores.std())

0.2025289900605652 0.5952960169512242


# Discussion
The results in sections 5 and 6 show that, the model we implemented is actually more accurate and stable than the sklearn model. However, this is not a fair comparison since I (somewhat) fine-tuned the hyperparameters of our model. Later we will discuss the best practice for (serious) hyperparameter tuning and model selection.