> **Note:** In most sessions you will be solving exercises posed in a Jupyter notebook that looks like this one. Because you are cloning a Github repository that only we can push to, you should **NEVER EDIT** any of the files you pull from Github. Instead, what you should do, is either make a new notebook and write your solutions in there, or **make a copy of this notebook and save it somewhere else** on your computer, not inside the `sds` folder that you cloned, so you can write your answers in there. If you edit the notebook you pulled from Github, those edits (possible your solutions to the exercises) may be overwritten and lost the next time you pull from Github. This is important, so don't hesitate to ask if it is unclear.

# Exercise Set 13: Model building process and model selection

*Morning, August 21, 2018*

In this Exercise Set 13 we will investigate how to build machine learning models using a formalize pipeline from preprocessed (i.e. tidy) data to a model.

We import our standard stuff. Notice that we are not interested in seeing the convergence warning in scikit-learn so we suppress them for now.

In [73]:
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)

import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import seaborn as sns

## Model validation

In what follows we will regard the "train" data for two purposes. First we are interested in performing a model selection. Then with the selected model we estimate/train it on all the training data. 


> **Ex. 13.1.0:** Begin by reloading the housing dataset from Ex. 12.2.0 using the code below. 

In [74]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

cal_house = fetch_california_housing()    
X = pd.DataFrame(data=cal_house['data'], 
                 columns=cal_house['feature_names'])\
             .iloc[:,:-2]
y = cal_house['target']
print(X.head())
print(y[0:10])

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467
[4.526 3.585 3.521 3.413 3.422 2.697 2.992 2.414 2.267 2.611]


> **Ex. 13.1.1:** Make a for loop with 10 iterations where you:
1. Split the input data into, train and test where the sample of test should be one third. (Set a new random state for each iteration of the loop, so each iteration makes a different split)
2. Further split the training data  into to even sized bins; the first data is for training models and the other is for validation. 
3. Train a linear regression model with sub-training data. Compute the RMSE for out-of-sample predictions on test and validation data. Save the RMSE.

> You should now have a 10x2 DataFrame with 10 RMSE from both the test data set and the train data set. Compute descriptive statistics of RMSE for the out-of-sample predictions on test and validation data. Are the simular?    
>   They hopefuly are pretty simular. This shows us, that we can split the train data, and use this to fit the model. 

>> *Hint*: you can reuse any code used to solve exercises 12.2.X.

In [75]:
# [Answer to Ex. 13.1.1]
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error as mse

pipe_linear = make_pipeline(LinearRegression())

RMSE_val = []
RMSE_test = []

for i in range(1,11):
    # splitting into development (2/3) and test data (1/3)
    X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=1/3, random_state=i)
    # splitting development into train (1/3) and validation (1/3)
    X_train, X_val, y_train, y_val = train_test_split(X_dev, y_dev, test_size=1/2, random_state=i)
    pipe_linear.fit(X_train, y_train)
    RMSE_val.append(round(mse(pipe_linear.predict(X_val),y_val), 3))
    RMSE_test.append(round(mse(pipe_linear.predict(X_test),y_test), 3))
    
        
RMSE_df = pd.DataFrame({'Validation':RMSE_val, 'Test':RMSE_test})
print(RMSE_df)
#print(RMSE)
print(RMSE_df.head(10))

   Validation   Test
0       0.635  0.616
1       0.610  0.661
2       0.628  0.632
3       0.605  0.618
4      15.205  6.274
5       1.512  4.746
6       0.642  0.625
7       0.635  0.598
8       0.614  0.629
9       0.659  0.728
   Validation   Test
0       0.635  0.616
1       0.610  0.661
2       0.628  0.632
3       0.605  0.618
4      15.205  6.274
5       1.512  4.746
6       0.642  0.625
7       0.635  0.598
8       0.614  0.629
9       0.659  0.728


## Model building

> **Ex. 13.1.2:** Construct a model building pipeline which 

> 1. adds polynomial features without bias;
> 1. scales the features to mean zero and unit std. 
> 1. estimates a Lasso model

>> *Hint:* a modelling pipeline can be constructed with `make_pipeline` from `sklearn.pipeline`.

In [76]:
# [Answer to Ex. 13.1.2]

perform_val = []
perform_test = []
lambdas = np.logspace(-4, 4, 33)

for lambda_ in lambdas:
    pipe_lasso = make_pipeline(#PolynomialFeatures(include_bias=False),   # Tilføjer led af 2. (måske højere) orden.
                               StandardScaler(),   #Skalerer data
                               Lasso(alpha=lambda_, random_state=3))  
    pipe_lasso.fit(X_train, y_train)
    y_pred_val = pipe_lasso.predict(X_val)
    y_pred_test = pipe_lasso.predict(X_test)
    
    perform_val.append(mse(y_pred_val, y_val))
    perform_test.append(mse(y_pred_test, y_test))
    
#hyperparam_perform = pd.Series(perform,index=lambdas)

perform_df = pd.DataFrame({'Validation' : perform_val, 'Test' : perform_test})
print(perform_val)
print(perform_test)
print(perform_df.head(20))

#print(np.logspace(-4, 4, 33))

[0.658726546734106, 0.6585731534457759, 0.658301706410954, 0.6578239462427644, 0.6569871334354807, 0.6555417894145541, 0.6531093268473812, 0.6492116568364986, 0.643617326355236, 0.6378877034777192, 0.6407652919879558, 0.6411618733783598, 0.6548707182673363, 0.6997485446809155, 0.7782771498636788, 0.9888705565654492, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891, 1.293734309577891]
[0.7277304108895697, 0.7274222620245782, 0.7268763202665885, 0.7259113018194504, 0.7242159161951764, 0.7212646337301869, 0.7162137251988034, 0.7078575888747219, 0.6949903527450726, 0.678408729509855, 0.6704357627634729, 0.6636140858926931, 0.6839262961063322, 0.7315738927553227, 0.8136566843037268, 1.0348028168741916, 1.3558249085474579, 1.3558249085474579,

## Cross validation
In machine learning, we have two types of parameters: those that are learned from
the training data, for example, the weights in logistic regression, and the parameters
of a learning algorithm that are optimized separately. The latter are the tuning
parameters, also called *hyperparameters*, of a model, for example, the regularization
parameter in logistic regression or the depth parameter of a decision tree.
  
   
When we want to optimize over both normal parameters and hyperparameteres we do this using nested loops (two-layered cross validation). In outer loop we vary the hyperparameters, and then in the inner loop we do cross validation for the model with the specific selection of hyperparameters. This way we can find the model, with the lowest mean MSE. 

> **Ex. 13.1.3:**
Run a Lasso regression using the Pipeline from `Ex 13.1.2`. In the outer loop searching through the lambdas specified below. 
In the inner loop make 5 fold cross validation on the selected model and store the *average* MSE for each fold. Which lambda gives the lowest test MSE?


> ```python 
lambdas =  np.logspace(-4, 4, 12)
```

>> *Hint:* `KFold` in `sklearn.model_selection` may be useful.

In [103]:

print(type(y_dev))
print(len(y_dev))
print(type(X_dev))
print(len(X_dev))
print(X_dev)
print(lambdas)
print(len(mseCV))


<class 'numpy.ndarray'>
13760
<class 'numpy.ndarray'>
13760
[[3.88640000e+00 3.30000000e+01 4.21751412e+00 9.35028249e-01
  1.57100000e+03 4.43785311e+00]
 [4.04260000e+00 1.50000000e+01 6.88480392e+00 1.10294118e+00
  1.26700000e+03 3.10539216e+00]
 [4.99900000e-01 2.90000000e+01 2.37327189e+00 1.05529954e+00
  2.69000000e+03 1.23963134e+01]
 ...
 [5.53360000e+00 6.00000000e+00 4.90533563e+00 9.65576592e-01
  2.16000000e+03 3.71772806e+00]
 [2.20590000e+00 3.50000000e+01 2.74849095e+00 9.97987928e-01
  2.16000000e+03 4.34607646e+00]
 [2.67630000e+00 1.60000000e+01 3.95301028e+00 1.09985316e+00
  1.67400000e+03 2.45814978e+00]]
[1.00000000e-04 5.33669923e-04 2.84803587e-03 1.51991108e-02
 8.11130831e-02 4.32876128e-01 2.31012970e+00 1.23284674e+01
 6.57933225e+01 3.51119173e+02 1.87381742e+03 1.00000000e+04]
12


In [138]:
# [Answer to Ex. 13.1.3]
from sklearn.model_selection import KFold

# For some reason, X_dev was a df and had to be an array
X_dev = np.array(X_dev)

lambdas =  np.logspace(-4, 4, 12)
print(lambdas)
print('------')
kfolds = KFold(n_splits = 5)
mseCV = []

# For each lambda a model is fitted for each of the K=5 folds. The models are used to predict
# the valuation set of y's using the valuation set of X's. The mean squared errors are calculated
# and appended to MSECV.
for lambda_ in lambdas:
    pipe_lassoCV = make_pipeline(PolynomialFeatures(degree=3, include_bias=False),
                                 StandardScaler(),
                                 Lasso(alpha=lambda_, random_state=1))    
    mseCV_ = []
    
    for train_idx, val_idx in kfolds.split(X_dev, y_dev):
        X_train, y_train = X_dev[train_idx], y_dev[train_idx]
        X_val, y_val = X_dev[val_idx], y_dev[val_idx] 
        pipe_lassoCV.fit(X_train, y_train)
        mseCV_.append(mse(pipe_lassoCV.predict(X_val), y_val))    
    mseCV.append(mseCV_)

# Printing first list of MSE's in mseCV
print(mseCV[0])

# Creating data frame containing smallest average MSE and corresponding lambdas
optimalCV = pd.DataFrame(mseCV, index=lambdas).mean(axis=1).nsmallest(1)
optimalCV = pd.DataFrame(optimalCV, columns=['MSE'])
optimalCV.index.name = 'Lambda'

print(optimalCV)

[1.00000000e-04 5.33669923e-04 2.84803587e-03 1.51991108e-02
 8.11130831e-02 4.32876128e-01 2.31012970e+00 1.23284674e+01
 6.57933225e+01 3.51119173e+02 1.87381742e+03 1.00000000e+04]
------
[26.727605302803166, 0.49873874836689375, 0.5643463612209814, 0.5137438688379456, 0.6920419086209986]
               MSE
Lambda            
0.081113  0.658859


> **Ex. 13.1.4:** *Automated Cross Validation in one dimension:* 
Now we want to repeat exercise 13.1.3 in a more automated fasion. 
When you are doing cross validation with one hyperparameter, you can automate the process by using `validation_curve` from `sklearn.model_selection`. Use this function to search through the values of lambda, and find the value of lambda, which give the lowest test error.  

> check if you got the same output for the manual implementation (Ex. 13.1.3) and the automated implementation (Ex. 13.1.4) 

> *Hint:* You should set the scoring parameter to `neg_mean_squared_error`

> BONUS: Plot the average MSE-test and MSE-train against the different values of lambda. (*Hint*: Use logarithmic axes, and lambda as index)

In [11]:
# [Answer to Ex. 13.1.4]

When you have *more that one* hyperparameter, you will want to fit the model to all the possible combinations of hyperparameters. This is done in an approch called `Grid Search`, which is implementet in `sklearn.model_selection` as `GridSearchCV`

> **Ex. 13.1.5:** To get to know `Grid Search` we want to implement it in one dimension. Using `GridSearchCV` implement the Lasso, with the same lambdas as before (`lambdas =  np.logspace(-4, 4, 12)`), 10-fold CV and (negative) mean squared error as the scoring variable. Which value of Lambda gives the lowest test error? 

In [14]:
# [Answer to Ex. 13.1.5]

> **Ex. 13.1.6 BONUS** Expand the Lasso pipe from the last excercise with a Principal Component Analisys (PCA), and expand the Grid Search to searching in two dimensions (both along the values of lambda and the values of principal components (n_components)). Is `n_components` a hyperparameter? Which hyperparameters does the Grid Search select as the best?

> NB. This might take a while to calculate. 

In [16]:
# [Answer to Ex. 13.1.6]

> **Ex. 13.1.7 BONUS** repeat the previous now with RandomizedSearchCV with 20 iterations.

In [18]:
# [Answer to Ex. 13.1.7]



> **Ex. 13.1.8 BONUS** read about nested cross validation. How might we implement this in answer 13.1.6?


In [20]:
# [Answer to Ex. 13.1.8]