### Codio Activity 9.5: Using `StandardScaler`

This activity focuses on using the `StandardScaler` for scaling the data by converting it to $z$-scores.  To begin, you will scale data using just NumPy functions.  Then, you will use the scikit-learn transformer and incorporate it into a `Pipeline` with a `Ridge` regression model.  

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

### The Dataset

For this example, we will use a housing dataset that is part of the scikitlearn datasets module.  The dataset is chosen because we have multiple features on very different scales.  It is loaded and explored below -- your task is to predict `MedHouseVal` using all the other features after scaling and applying regularization with the `Ridge` estimator. 

In [2]:
cali = fetch_california_housing(as_frame=True)

In [3]:
cali.frame.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [5]:
print(cali.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

In [6]:
cali.frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   MedInc       20640 non-null  float64
 1   HouseAge     20640 non-null  float64
 2   AveRooms     20640 non-null  float64
 3   AveBedrms    20640 non-null  float64
 4   Population   20640 non-null  float64
 5   AveOccup     20640 non-null  float64
 6   Latitude     20640 non-null  float64
 7   Longitude    20640 non-null  float64
 8   MedHouseVal  20640 non-null  float64
dtypes: float64(9)
memory usage: 1.4 MB


In [8]:
X = cali.frame.drop('MedHouseVal', axis = 1)

In [9]:
y = cali.frame['MedHouseVal']

In [15]:
X_train,X_test,y_train,y_test = train_test_split(X,y, random_state = 42, test_size = 0.3)

### Problem 1

#### Scaling the Train data

Recall that **standard scaling** consists of subtracting the feature mean from each datapoint and subsequently dividing by the standard deviation of the feature.  Below, you are to scale `X_train` by subtracting the mean and dividing by the standard deviation.  Be sure to use the `numpy` mean and standard deviation functions with default settings.  

Assign your results to `X_train_scaled` below.  

In [11]:
X_train_scaled = (X_train - np.mean(X_train)) / np.std(X_train)
X_train_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
7061,-88.790278,-10.961748,-68.221482,-385.014398,0.916287,-13.761783,-65.240794,-145.433312
14689,-89.456003,-12.150979,-68.825172,-384.788417,0.724213,-13.840274,-65.774585,-144.968976
17323,-88.752795,-11.833851,-68.329414,-384.983149,0.489164,-13.795759,-64.931757,-146.491798
10056,-89.326701,-12.626671,-68.226694,-384.749108,0.112909,-13.826188,-62.745088,-146.921184
15750,-89.22307,-9.613954,-68.662141,-384.811541,1.186418,-13.817315,-63.442762,-147.645148


### Problem 2

#### Scale the test data

To scale the test data, use the mean and standard deviation of the **training** data.  In practice, you would not have seen the test data, so you would not be able to compute its mean and deviation.  Instead, you assume it is similar to your train data and use what you know to scale it.  

Assign the response as an array to `X_test_scaled` below.

In [17]:
X_test_scaled = (X_test - np.mean(X_train)) / np.std(X_train)
X_test_scaled.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
20046,-90.076474,-11.754569,-68.910361,-384.909826,1.068894,-13.689681,-64.243448,-145.927605
3024,-89.63019,-11.358159,-68.565063,-384.527222,1.220623,-13.786474,-64.674227,-146.152284
15663,-89.132091,-9.613954,-68.99801,-384.54424,0.996976,-13.893112,-63.428715,-147.640155
20484,-87.946954,-12.388825,-68.106836,-384.914479,1.34341,-13.724675,-65.076911,-145.782812
9814,-89.003524,-11.041031,-68.38018,-384.89697,0.780344,-13.802327,-63.981235,-147.38552


### Problem 3

#### Using `StandardScaler`

- Instantiate a `StandardScaler` transformer. Assign the result to `scaler`.
- Use the `.fit_transform` method on `scaler` to transform the training data. Assign the result to `X_train_scaled`.
- Use the `.transform` method on `scaler` to transform the test data. Assign the result to `X_test_scaled`.

In [18]:
scaler = StandardScaler()

In [21]:
X_train_scaled = scaler.fit_transform(X_train)
X_train_scaled

array([[ 0.13350629,  0.50935748,  0.18106017, ..., -0.01082519,
        -0.80568191,  0.78093406],
       [-0.53221805, -0.67987313, -0.42262953, ..., -0.08931585,
        -1.33947268,  1.24526986],
       [ 0.1709897 , -0.36274497,  0.07312833, ..., -0.04480037,
        -0.49664515, -0.27755183],
       ...,
       [-0.49478713,  0.58863952, -0.59156984, ...,  0.01720102,
        -0.75885816,  0.60119118],
       [ 0.96717102, -1.07628333,  0.39014889, ...,  0.00482125,
         0.90338501, -1.18625198],
       [-0.68320166,  1.85715216, -0.82965604, ..., -0.0816717 ,
         0.99235014, -1.41592345]], shape=(14448, 8))

In [22]:
X_test_scaled = scaler.transform(X_test)
X_test_scaled

array([[-1.1526893 , -0.28346293, -0.50781822, ...,  0.06127763,
         0.19166399,  0.28664112],
       [-0.70640568,  0.11294728, -0.16252032, ..., -0.03551561,
        -0.23911452,  0.06196251],
       [-0.20830675,  1.85715216, -0.59546738, ..., -0.14215427,
         1.00639726, -1.42590916],
       ...,
       [-0.19155996, -0.99700129, -0.6830438 , ..., -0.06058827,
        -0.92742367,  0.8358555 ],
       [-0.11911302, -1.47269353,  0.02607207, ...,  0.03461374,
         1.01576201, -0.84673764],
       [-0.43304974, -0.91771925, -0.84872893, ..., -0.0407528 ,
        -0.70266966,  0.67109119]], shape=(6192, 8))

### Problem 4

#### Building a `Pipeline`

Now, construct a pipeline with named steps `scaler` and `ridge` that takes in your data, applies the `StandardScaler` and fits a `Ridge` model with default settings. Next, use the `fit` function to train this pipeline on `X_train` and `y_train`. Assign your pipeline to `scaled_pipe`.

Use the `predict` function on `scaled_pipe` to compute the predictions on `X_train`. Assign your result to `train_preds`.

Use the `predict` function on `scaled_pipe` to compute the predictions on `X_test`. Assign your result to `test_preds`.

Use the `mean_squared_error` function to compute the MSE between `y_train` and `train_preds`. Assign your result to `train_mse`.

Use the `mean_squared_error` function to compute the MSE between `y_test` and `test_preds`. Assign your result to `test_mse`.

In [23]:
scaled_pipe = Pipeline([('scaler',StandardScaler()),
                       ('Ridge',Ridge())])

In [24]:
scaled_pipe.fit(X_train,y_train)

In [25]:
train_preds = scaled_pipe.predict(X_train)
test_preds = scaled_pipe.predict(X_test)

In [26]:
train_mse = mean_squared_error(train_preds, y_train)
train_mse

0.5233577493232345

In [28]:
test_mse = mean_squared_error(test_preds, y_test)
test_mse

0.5305437338152265

### Codio Activity 9.6: Using `GridSearchCV`

This activity focuses on using `GridSearchCV` to search over different hyperparameter values within the `Ridge` estimator.  You will first use the grid search to search parameters for an estimator.  Then, you will incorporate a pipeline into the grid search and identify the step in the pipeline you are searching along with the hyperparameters. 

In [29]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.datasets import fetch_california_housing

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### The Data

We again use the california housing dataset from scikit-learn.  You are building regression models with the `MedHouseVal` as the target feature.  The data is loaded and described below.  

In [30]:
cali = fetch_california_housing(as_frame=True)

In [31]:
cali.frame.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [32]:
X = cali.frame.drop('MedHouseVal', axis = 1)
y = cali.frame['MedHouseVal']

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Problem 1

#### Dictionary for grid search
As discussed in the videos, to search over hyperparameters you have to create a dictionary with a key whose name is exactly that of the hyperparameter to search over.  With the `Ridge` estimator, this will be `alpha`.  Create a dictionary with `alpha` as the key and values `[0.1, 1.0, 10.0]` and assign to the variable `params_dict` below.  

In [34]:
params_dict = {'alpha':[0.1,1.0,10.0]}
params_dict

{'alpha': [0.1, 1.0, 10.0]}

In [35]:
print(params_dict.values())
params_dict.keys()

dict_values([[0.1, 1.0, 10.0]])


dict_keys(['alpha'])

### Problem 2

#### Creating the grid search object

Instantiate a `Ridge()` regressor and assign to `ridge`.

Next, use `GridSearchCV(` to instantiate a grid search object using `ridge` as the estimator. Set the argument `param_grid` equal to `params_dict`. Assign your grid to `grid` below. 

In [36]:
ridge = Ridge()
grid = GridSearchCV(ridge, param_grid = params_dict)

In [38]:
print(grid.get_params()['param_grid'])
grid

{'alpha': [0.1, 1.0, 10.0]}


### Problem 3

#### Performing the grid search

- Use the `fit` function on `grid` to train your model using `X_train`  and `y_train`.
- Use the `predict` function on `grid` to compute the predictions on `X_train`. Assign your result to `train_preds`.
- Use the `predict` function on `gird` to compute the predictions on `X_test`. Assign your result to `test_preds`.
- Use the `mean_squared_error` function to compute the MSE between `y_train` and `train_preds`. Assign your result to `train_mse`.
- Use the `mean_squared_error` function to compute the MSE between `y_test` and `test_preds`. Assign your result to `test_mse`.


In [39]:
grid.fit(X_train,y_train)

In [40]:
train_preds = grid.predict(X_train)
test_preds = grid.predict(X_test)

In [41]:
train_mse = mean_squared_error(y_train,train_preds)
train_mse

0.5233576299656518

In [43]:
test_mse = mean_squared_error(y_test,test_preds)
test_mse

0.530561502747035

### Problem 4

#### Identify optimal alpha value

Use y fit grid to determine the optimal alpha value.  Assign this as a float to `best_alpha` below.  (**Hint**: Use the `best_params_` attribute of the fit grid.)

In [45]:
best_alpha = grid.best_params_
best_alpha

{'alpha': 0.1}

In [46]:
print(f'Best alpha: {list(best_alpha.values())[0]}')

Best alpha: 0.1


### Problem 5

#### Pipeline with Grid Search

To use a `Pipeline` in a `GridSearchCV` you want to preface the value in your parameter dictionary with an all lowercase version of the object.  For example, to search over a ridge estimators alpha value we will create a pipeline with names `scaler` and `ridge` to use the `StandardScaler` followed by the `Ridge` regressor.  To search over the ridge objects alpha paramater we write `ridge__alpha`. (Note there are two underscores here.)

Below, you are provided a pipeline and dictionary ready to be used in a new grid search.  You are to instantiate, fit, and score a grid search on the train and test data using mean squared error. Create your grid object as `grid_2` below and assign the training error and test error to `model_2_train_mse` and `model_2_test_mse`.  Determine the optimal value for `alpha` and assign it as a dictionary to `model_2_best_alpha` below.

In [47]:
pipe = Pipeline([('scale', StandardScaler()), ('ridge', Ridge())])

In [48]:
param_dict = {'ridge__alpha': [0.001, 0.1, 1.0, 10.0, 100.0, 1000.0]}

In [49]:
### GRADED

grid_2 = ''
model_2_train_mse = ''
model_2_test_mse = ''
model_2_best_alpha = ''

### BEGIN SOLUTION
grid_2 = GridSearchCV(pipe, param_grid=param_dict)
grid_2.fit(X_train, y_train)
train_preds = grid_2.predict(X_train)
test_preds = grid_2.predict(X_test)
model_2_train_mse = mean_squared_error(y_train, train_preds)
model_2_test_mse = mean_squared_error(y_test, test_preds)
model_2_best_alpha = grid_2.best_params_
### END SOLUTION

# Answer check
print(f'Test MSE: {model_2_test_mse}')
print(f'Best Alpha: {list(model_2_best_alpha.values())[0]}')

Test MSE: 0.5305677582888798
Best Alpha: 0.001
