### Required Assignment 9.3: Using StandardScaler

**Estimated Time: 45 Minutes**

**Total Points: 40**


This activity focuses on using the `StandardScaler` to scale the data by converting it to $z$-scores.  To begin, you will scale data using just NumPy functions.  Then, you will use the scikit-learn transformer and incorporate it into a `Pipeline` with a `Ridge` regression model.  

#### Index

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)


In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

### The Dataset

For this example, we will use a housing dataset that is part of the scikitlearn datasets module.  The dataset is chosen because we have multiple features on very different scales.  It is loaded and explored below -- your task is to predict `median_house_value` using all the other features after scaling and applying regularization with the `Ridge` estimator. 

In [2]:
cali = pd.read_csv('data/housing.csv')

In [3]:
cali.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
print(cali.describe())


          longitude      latitude  housing_median_age   total_rooms  \
count  20640.000000  20640.000000        20640.000000  20640.000000   
mean    -119.569704     35.631861           28.639486   2635.763081   
std        2.003532      2.135952           12.585558   2181.615252   
min     -124.350000     32.540000            1.000000      2.000000   
25%     -121.800000     33.930000           18.000000   1447.750000   
50%     -118.490000     34.260000           29.000000   2127.000000   
75%     -118.010000     37.710000           37.000000   3148.000000   
max     -114.310000     41.950000           52.000000  39320.000000   

       total_bedrooms    population    households  median_income  \
count    20433.000000  20640.000000  20640.000000   20640.000000   
mean       537.870553   1425.476744    499.539680       3.870671   
std        421.385070   1132.462122    382.329753       1.899822   
min          1.000000      3.000000      1.000000       0.499900   
25%        296.00000

In [5]:
print(cali.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None


In [6]:
X = cali.drop('median_house_value', axis = 1)
y = cali['median_house_value']

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

### Problem 1

#### Scaling the Train data

**10 Points**

Recall that **standard scaling** consists of subtracting the feature mean from each datapoint and subsequently dividing by the standard deviation of the feature.  Below, you are to scale `X_train` by subtracting the mean and dividing by the standard deviation.  Be sure to use the `numpy` mean and standard deviation functions with default settings.  

Assign your results to `X_train_scaled` below.  

In [8]:
### GRADED

X_train_scaled = ''

### BEGIN SOLUTION
X_train_numeric = X_train.select_dtypes(include='number')
X_train_scaled = (X_train_numeric - X_train_numeric.mean()) / X_train_numeric.std()
### END SOLUTION

# Answer check
print(X_train_scaled.mean())
print('-----------------')
print(X_train_scaled.std())

longitude            -3.241163e-15
latitude             -1.466035e-15
housing_median_age   -5.434314e-17
total_rooms           1.524559e-17
total_bedrooms       -1.204893e-16
population            6.885104e-17
households            3.049117e-17
median_income         1.389316e-16
dtype: float64
-----------------
longitude             1.0
latitude              1.0
housing_median_age    1.0
total_rooms           1.0
total_bedrooms        1.0
population            1.0
households            1.0
median_income         1.0
dtype: float64


### Problem 2

#### Scale the test data

**10 Points**

To scale the test data, use the mean and standard deviation of the **training** data.  In practice, you would not have seen the test data, so you would not be able to compute its mean and deviation.  Instead, you assume it is similar to your train data and use what you know to scale it.  

Assign the response as an array to `X_test_scaled` below.

In [9]:
### GRADED

X_test_scaled = ''

### BEGIN SOLUTION
X_train_numeric = X_train.select_dtypes(include='number')
X_test_numeric = X_test.select_dtypes(include='number')

X_test_scaled = (X_test_numeric - X_train_numeric.mean()) / X_train_numeric.std()
### END SOLUTION

# Answer check
print(X_test_scaled.mean())
print('-----------------')
print(X_test_scaled.std())

longitude             0.023961
latitude             -0.029354
housing_median_age    0.016943
total_rooms          -0.014141
total_bedrooms       -0.015922
population           -0.007164
households           -0.013351
median_income        -0.010885
dtype: float64
-----------------
longitude             1.000881
latitude              1.000107
housing_median_age    0.992540
total_rooms           1.028338
total_bedrooms        1.012935
population            0.977183
households            1.000965
median_income         0.991110
dtype: float64


### Problem 3

#### Using `StandardScaler`

**10 Points**

- Instantiate a `StandardScaler` transformer. Assign the result to `scaler`.
- Use the `.fit_transform` method on `scaler` to transform the training data. Assign the result to `X_train_scaled`.
- Use the `.transform` method on `scaler` to transform the test data. Assign the result to `X_test_scaled`.
- Use `X_train_numeric` and `X_test_numeric` to include only numeric values. 

In [10]:
### GRADED

scaler = ''
X_train_scaled = ''
X_test_scaled = ''
# Select only numeric columns
X_train_numeric = X_train.select_dtypes(include='number')
X_test_numeric = X_test.select_dtypes(include='number')
### BEGIN SOLUTION

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_numeric)
X_test_scaled = scaler.transform(X_test_numeric)
### END SOLUTION

# Answer check
print(scaler.mean_)
print('----------')
print(scaler.scale_)

[-119.5841023    35.6506693    28.57537375 2644.93923034  539.82828073
 1427.92732558  501.07059801    3.87689155]
----------
[2.00286090e+00 2.13566827e+00 1.26131971e+01 2.16297957e+03
 4.19772219e+02 1.14018573e+03 3.82207992e+02 1.90484248e+00]


### Problem 4

#### Building a `Pipeline`

**15 Points**

Now, construct a pipeline with named steps `scaler` and `ridge` that takes in your data, applies the `StandardScaler`, and fits a `Ridge` model with default settings. Next, use the `fit` function to train this pipeline on `X_train` and `y_train`. Assign your pipeline to `scaled_pipe`.

Use the `predict` function on `scaled_pipe` to compute the predictions on `X_train`. Assign your result to `train_preds`.

Use the `predict` function on `scaled_pipe` to compute the predictions on `X_test`. Assign your result to `test_preds`.

Use the `mean_squared_error` function to compute the MSE between `y_train` and `train_preds`. Assign your result to `train_mse`.

Use the `mean_squared_error` function to compute the MSE between `y_test` and `test_preds`. Assign your result to `test_mse`.

Use `X_train_numeric` and `X_test_numeric` to include only numeric values. 

In [14]:
### GRADED

scaled_pipe = ''
train_preds = ''
test_preds = ''
train_mse = ''
test_mse = ''

# Select only numeric columns
X_train_numeric = X_train.select_dtypes(include='number')
X_test_numeric = X_test.select_dtypes(include='number')

### BEGIN SOLUTION

scaled_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('ridge', Ridge())
]).fit(X_train_numeric, y_train)

# Predict and evaluate
train_preds = scaled_pipe.predict(X_train_numeric)
test_preds = scaled_pipe.predict(X_test_numeric)
train_mse = mean_squared_error(y_train, train_preds)
test_mse = mean_squared_error(y_test, test_preds)
### END SOLUTION

# Answer check
print(f'Train MSE: {train_mse}')
print(f'Test MSE: {test_mse}')


Train MSE: 4865156880.463843
Test MSE: 4851689342.065711
