# Boston Housing Data

The Boston Housing dataset is a classic example used in Machine Learning. This dataset contains information collected in the 1970's by the U.S Census Service concerning housing in the area of Boston. It consists of 13 features and a target (the housing price in $1000's). There are 506 instances.

The objective of this tutorial is to build a linear model to predict the homes price given a set of feature. 

Feature Information:

    0. CRIM      per capita crime rate by town
    1. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
    2. INDUS     proportion of non-retail business acres per town
    3. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    4. NOX       nitric oxides concentration (parts per 10 million)
    5. RM        average number of rooms per dwelling
    6. AGE       proportion of owner-occupied units built prior to 1940
    7. DIS       weighted distances to five Boston employment centres
    8. RAD       index of accessibility to radial highways
    9. TAX      full-value property-tax rate per \$10,000
    10. PTRATIO  pupil-teacher ratio by town
    11. B        1000(Bk - 0.63)^2 where Bk is the proportion of [people of African American descent] by town
    12. LSTAT    % lower status of the population

Target:
    0. Price     Median value of owner-occupied homes in $1000's

Example: https://machinelearningmastery.com/ridge-regression-with-python/

https://towardsdatascience.com/ridge-and-lasso-regression-a-complete-guide-with-python-scikit-learn-e20e34bcbf0b
        
About Kfold CV: https://machinelearningmastery.com/k-fold-cross-validation/

## Import libraries

Libraries needed for this exercice.

In [None]:
import numpy as np

from pandas import read_csv

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
%matplotlib inline

## 1. Load the dataset

We will import the housing data from the scikit-learn library, then we load the data into a pandas dataframe using pd.DataFrame.

a) How many lines and column does this dataset have ? Show the first 5 examples.

b) Check there are no missing values. For this use `isnull().sum()` function.

In [None]:
import pandas as pd
from sklearn.datasets import load_boston

# Load data
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns = boston.feature_names)
boston_df.insert(0, 'Price', boston.target)

# a) Show some examples


# b) Check if there are any missing values


## 2. Data exploration

We first assign features values to numpy array $X$ and target values to numpy array $y$

a) check the dimension of $X$ and $y$

b) Make histograms for each features and for the target

c) Show scatter plots of each feature vs the target (optional: calculate correlation coefficient)

In [None]:
X = boston.data   # Features
y = boston.target # Target

# a) X and y dimensions


# b) Show histograms


# c) Scatter plots


## 3. Split data in train and test samples

We now split the total dataset in a train and a test sample using scikit-learn.

Look at the size of each sample.

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=3)



## 4. Linear regression

Now let's construct a predictive model using linear regression:

$$y_{pred} = w_0 + \sum_{i=1}^{N=13} w_i X_i$$

For this we use the scikit-learn model described here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

a) Fit the linear regression model using the training dataset and print the parameters (weights and bias term) of the fit.

b) Get the predicted model output, `y_train_pred`, using the training dataset. Make a scatter plot of the true target value, `y_train`, vs the predicted value, `y_train_pred`.

c) Calculate the root mean square error (RMS) between `y_train` and `y_train_pred`. For this you can use the scikit-learn function `mean_squared_error()`.

d) Finally we apply the model to the test dataset: repeat steps b) and c) with the test sample. Do you think that the model is acceptable ? Is there an overfitting problem ?

In [None]:
# Fit of the model
model1 = LinearRegression()




## 5. Ridge penalty (a.k.a L2 norm)

Let's see if a penalized linear algorithm can improve the modelling and prediction of the data. For this we use Ridge regression (also called L2 norm) which adds a penalty term to the fit model:

$$y_{pred} = w_0 + \sum_{i=1}^{N=13} w_i X_i + \lambda \sum_{i=0}^{N=13} w_i^2$$

See the scikit-learn implementation https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

a) Train the model using the training dataset and a $\lambda$ regularization parameter =1

b) Apply the algorithm to the test data and check the quality of the model. Do you see any improvement in the data modelling and prediction ? Try other values of $\lambda$.

c) Optional, try Lasso penalty: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html. Does it help ?

In [None]:
# define model
model2 = Ridge(alpha=1) # Alpha sets the lambda (yes...) hyperparameter 



## 6. Estimating model performance: Cross-validation

Instead of splitting the dataset in one training and one test samples we can use cross-validation to better determine the performance of a fit model. For this we apply the following [procedure](https://machinelearningmastery.com/k-fold-cross-validation/):
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
  - Take the group as test data set
  - Take the k-1 remaining groups as a training data set
  - Fit a model on the training set and evaluate it on the test set
  - Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores

a) Look at the example below, what are the different parameters ? To what corresponds the output ?

b) Apply the cross-validation to the other models. Can you say if one is more performant than the other ?


In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

cv = KFold(n_splits=5, shuffle=False)
scores = cross_val_score(model1, X, y, scoring='neg_mean_squared_error', cv=cv)
scores = np.absolute(scores)
print('Mean RMS: %.2f +- %.2f' % (np.mean(np.sqrt(scores)),np.std(np.sqrt(scores))))