# Case Study 1

In [52]:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
import numpy as np

## Reading Data
The Boston dataset is a small set composed of 506 samples and 13 features used for regression problems. 

In [53]:
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data['data'], 
                                                    data['target'],
                                                    random_state=0)


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [54]:
from sklearn.preprocessing import StandardScaler, RobustScaler, QuantileTransformer
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge

## Setting up the pipeline
The pipeline we are going to setup is composed of the following tasks:

- Data Normalization
- Dimensionality Reduction: Principal Component Analysis (PCA) and a univariate feature selection algorithm as possible candidates.
- Regression

Start by manually implementing a pipeline without any dedicated scikit-learn module, to highlight how many repetitive activities are necessary. We are going to manually instantiate and initialize a single method for every step of the pipeline:

### Without a Pipeline
(not efficient)

In [55]:
scaler = StandardScaler()
pca = PCA()
ridge = Ridge()

In [56]:
X_train = scaler.fit_transform(X_train)
X_train = pca.fit_transform(X_train)
ridge.fit(X_train, y_train);

### With a Pipeline

The pipeline is just a list of ordered elements, each with a name and a corresponding object instance. The pipeline module leverages on the common interface that every scikit-learn library must implement functions , such as: `fit`, `transform` and `predict`.



In [57]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('reduce_dim', PCA()),
        ('regressor', Ridge())
        ])

## Use the Pipeline

In [58]:
pipe = pipe.fit(X_train, y_train)

You can get all the properties of each element in the pipeline. The line below gives the average values obtained for each of the 13 features in the standard scaler-class.

In [59]:
pipe.steps[0][1].mean_

array([ 4.68695736e-18,  4.68695736e-17,  1.17173934e-17,  3.28087015e-17,
        2.81217442e-17, -4.68695736e-17,  8.90521898e-17,  2.46065261e-17,
       -6.09304457e-17,  1.17173934e-17, -1.75760901e-17, -4.68695736e-18,
        3.28087015e-17])

In [60]:
print('Testing score: ', pipe.score(X_test, y_test))

Testing score:  -4035.603930701569


## Pipeline Tuning

Hyper-parameters are parameters that are manually tuned by a human operator to maximize the model performance against a validation set through a grid search.
Let's start with a trivial example, where we aim at optimizing 
- the number of components selected by the PCA
- the regularization factor of the linear regression model. 

We are going to use the `GridSearchCV` module in sklearn.

Concerning PCA, we want to evaluate how accuracy varies with the number of components, from 1 to 10:

In [61]:
n_features_to_test = np.arange(1, 11)

As for the regularization factor (in the ridge regression), we consider an exponential range of values

In [69]:
alpha_to_test = 2.0**np.arange(-6, +6)

It is important to evaluate all possible combinations of the parameters and 2 possible Scalers:
First of all, we define a dictionary with all the parameters we would like to combine in the evaluation:

In [70]:
params = {'reduce_dim__n_components': n_features_to_test,
          'regressor__alpha': alpha_to_test,
         'scaler' : [StandardScaler(), RobustScaler()]}

It is worth remarking the convention adopted to name the parameters: 
<ol>
<li> name of the pipeline step
<li> followed by a double underscore (__)
<li> finally the name of the parameter within the step. 
</ol>
The optimization is invoked as follows:

In [71]:
from sklearn.model_selection import GridSearchCV
gridsearch = GridSearchCV(pipe, params, verbose=1).fit(X_train, y_train)
print('Final score is: ', gridsearch.score(X_test, y_test))

Fitting 5 folds for each of 240 candidates, totalling 1200 fits
Final score is:  -33516.40161736743


In [72]:
gridsearch.best_params_

{'reduce_dim__n_components': 10,
 'regressor__alpha': 8.0,
 'scaler': StandardScaler()}

## Pipeline Tuning (advanced)
In theory, we could also apply the same approach to the dimensionality reduction step, for example to choose between `PCA` and `SelectKBest`. The only problem in this case is that PCA relies on a parameter named `n_components`, while SelectKBest requires to optimize a parameter named `k`.

Luckily, GridSearchCV also allows to optimize lists of parameter dictionaries, which solves this issue as well:


In [73]:
scalers_to_test=[StandardScaler(), RobustScaler()]
params = [
        {'scaler': scalers_to_test,
         'reduce_dim': [PCA()],
         'reduce_dim__n_components': n_features_to_test,\
         'regressor__alpha': alpha_to_test},

        {'scaler': scalers_to_test,
         'reduce_dim': [SelectKBest(f_regression)],
         'reduce_dim__k': n_features_to_test,\
         'regressor__alpha': alpha_to_test}
        ]

In [74]:
from sklearn.model_selection import GridSearchCV
gridsearch = GridSearchCV(pipe, params, verbose=1).fit(X_train, y_train)
print('Final score is: ', gridsearch.score(X_test, y_test))

Fitting 5 folds for each of 480 candidates, totalling 2400 fits
Final score is:  -3504.192468048363


In [68]:
gridsearch.best_params_

{'reduce_dim': SelectKBest(score_func=<function f_regression at 0x12b01b0d0>),
 'reduce_dim__k': 10,
 'regressor__alpha': 16.0,
 'scaler': StandardScaler()}