# Linear Regression

In this tutorial we will implement a linear regression model. We will also implement a function that splits the available data into a training and a testting part.

## Problem Setting

We will use the Boston Housing Dataset. This dataset contains information collected by the U.S Census Service concerning housing in the city of Boston in the state of Massachusetts in 1978. Our goal is to predict the median value of the houses in a particular town in the city of Boston given its attributes. Check the file ’housing.names’ for more information on the attributes.

In [2]:
!pip install matplotlib

Collecting matplotlib
  Downloading matplotlib-2.1.1-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (13.2MB)
[K    100% |████████████████████████████████| 13.2MB 61kB/s eta 0:00:011    95% |██████████████████████████████▊ | 12.7MB 6.4MB/s eta 0:00:01
Collecting cycler>=0.10 (from matplotlib)
  Downloading cycler-0.10.0-py2.py3-none-any.whl
Collecting pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 (from matplotlib)
  Downloading pyparsing-2.2.0-py2.py3-none-any.whl (56kB)
[K    100% |████████████████████████████████| 61kB 3.9MB/s ta 0:00:011
[?25hInstalling collected packages: cycler, pyparsing, matplotlib
Successfully installed cycler-0.10.0 matplotlib-2.1.1 pyparsing-2.2.0


In [3]:
%matplotlib inline
#import urllib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

# %aimport my_module
# my_module.py

  'Matplotlib is building the font cache using fc-list. '


In [5]:
!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz
Collecting scikit-learn (from sklearn)
  Downloading scikit_learn-0.19.1-cp35-cp35m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (7.5MB)
[K    100% |████████████████████████████████| 7.6MB 103kB/s ta 0:00:01    36% |███████████▊                    | 2.8MB 5.3MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: sklearn
  Running setup.py bdist_wheel for sklearn ... [?25ldone
[?25h  Stored in directory: /Users/uberwach/Library/Caches/pip/wheels/d7/db/a3/1b8041ab0be63b5c96c503df8e757cf205c2848cf9ef55f85e
Successfully built sklearn
Installing collected packages: scikit-learn, sklearn
Successfully installed scikit-learn-0.19.1 sklearn-0.0


In [6]:
import sklearn

In [7]:
from sklearn import datasets

In [None]:
datasets.

In [None]:
from sklearn.datasets import load_boston
boston = load_boston()
#testfile = urllib.URLopener()
#testfile.retrieve("https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names", "housing.names")
df = pd.DataFrame(boston.data)
df.columns = ['crime_rate','res_land_zoned','industry','charles_river','nox','avg_num_rooms','prop_bf_1940','dst_emply_center','rd_highway_idx','tax_rate','stdnt_tchr_ratio','prop_blacks','low_status_pct']
X = boston.data
y = boston.target

In [None]:
print(boston.DESCR)

In [None]:
df.head(10)

In [None]:
df.shape

## Least Squares 

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
least_squares = LinearRegression(normalize=True)

In [None]:
least_squares.fit(X, y)

In [None]:
!pip install seaborn --proxy=audi.in.vwgroup.prox

In [None]:
import seaborn as sns

sns.barplot(x=least_squares.coef_, y=df.columns, orient="h")
plt.title("Weights of least squares")

But how good is our model on the data?

In [None]:
least_squares.score(X, y)

In [None]:
LinearRegression??

The `score` method always returns the metric the model has been optimized for. The `sklearn.metrics` package contains all kinds of metrics.

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
y_pred = least_squares.predict(X) # computes the predictions
mean_squared_error(y, y_pred)

## An unpleasant surprise

But in reality our model has to form predictions on *unseen* data. Let's model this situation.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
print(X_train.shape, X_test.shape, X.shape)

In [None]:
least_squares.fit(X_train, y_train)

In [None]:
print("Train score: %.4f" % least_squares.score(X_train, y_train))
print("Test score: %.4f" % least_squares.score(X_test, y_test))

The test score is way worse! This is the effect of overfitting.

*Exercise*: Compute the MSE for the train and test set. How does it compare to the situation before?

In [None]:
# %load ./solutions/mse.py

## Other models...

Try other models, play with the parameters and try to beat the best score!

*Extra*: Inspect the features not used by the Lasso model and remove them from the feature set. Does this improve the performance of all (others included!) models?

In [None]:
from sklearn.linear_model import Ridge # L_2
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

In [None]:
lasso = Lasso(alpha=0.05, normalize=True)
lasso.fit(X_train, y_train)

In [None]:
lasso.coef_

In [None]:
print("Train score: %.4f" % lasso.score(X_train, y_train))
print("Test score: %.4f" % lasso.score(X_test, y_test))

In [None]:
used_features = lasso.coef_ != 0.0
print(used_features)

In [None]:
X_reduced = X[:, used_features]

X_train_red, X_test_red, y_train_red, y_test_red = train_test_split(X_reduced, y, random_state=42)

$\lambda$ must be lowered because there is less noise to learn from.

In [None]:
lasso = Lasso(alpha=0.01, normalize=True)
lasso.fit(X_train_red, y_train_red)

In [None]:
print("Train score: %.4f" % lasso.score(X_train_red, y_train_red))
print("Test score: %.4f" % lasso.score(X_test_red, y_test_red))

## Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

In [None]:
clf_pipe = make_pipeline(
    PolynomialFeatures(degree=5),
    ElasticNet(alpha=0.001, l1_ratio=0.3, normalize=True, random_state=42)
)

clf_pipe.fit(X_train, y_train)

In [None]:
print("Train score: %.4f" % clf_pipe.score(X_train, y_train))
print("Test score: %.4f" % clf_pipe.score(X_test, y_test))

In [None]:
from sklearn.metrics import r2_score

In [None]:
r2_score(y_test, clf_pipe.predict(X_test))

In [None]:
clf_pipe.steps

In [None]:
clf_pipe.steps[1][1].coef_.shape