### Module 8-1 Learning Notebook: Scaling Data

The importance of Data Scaling:<P>

It is common to have data where the scale of values differs from variable to variable. For example, one variable may be in feet, another in meters, or one may be in lbs and another in percent body fat, etc.

In some machine learning algorithms, we can achieve much better performance if all of the variables are scaled in similar or the same range. Common scaling ranges are:
- <B>"normalization"</B>: scale everything on the interval of  0 - 1
    
<img src="images/norm.png" alt="Normalization" style="width: 700px;"/>    
    
- <B>"standardization"</B>: scale so the the data has a standard deviation of 1 and mean centered on 0

<img src="images/stand.png" alt="Standardization" style="width: 700px;"/><BR>    
Scaling often improves the performance of parametric  algorithms that use a weighted sum of the input, like linear models and neural networks, as well as models that use distance measures such as Support Vector Machines (SVMs) and K-Nearest Neighbors (KNNs). We will learn these algorithms soon.
    
Some algorithms, like decision trees are not sensitive to scaled data, but it doesn't hurt to scale the data.

As such, it is a good practice to scale input data. This lesson is about scaling data.
    
A few guidelines:
- You don't scale the target variable (the 'y' or dependent variable. Leave it in its original scale)
- You fit the scaler to only the training data (X_train), not the whole dataset
- You must scale the test data (X_test) before you use the trained model to predict values

In this exercise, we will:
- Load very small diabetes dataset
- Split the data into train and test sets
- Demonstrate scaling X_train data using a normalized and standardized scaler
- Compare multiple linear regressions on scaled data
- Discuss why we don't see model improvement in this notebook

In [2]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import boto3
import pandas as pd
import numpy as np

### 1. Load the diabetes dataset

In [3]:
# Get the data from the S3 bucket: machinelearning-read-only
# Create session and S3 client
sess = boto3.session.Session()
s3 = sess.client('s3')
# Set variables 
source_bucket = 'machinelearning-read-only'
source_key = 'data/diabetes.csv'
# Load the dataframe
response = s3.get_object(Bucket=source_bucket, Key=source_key)
# The 'Body' is of type streaming body. We can put this right into a dataframe
df = pd.read_csv(response.get("Body")) 
print('The size of the complete dataset:',df.shape)
df.head(3)

The size of the complete dataset: (20, 3)


Unnamed: 0,Weight,Waist,Pulse
0,191.0,36.0,50.0
1,189.0,37.0,52.0
2,193.0,38.0,58.0


In [4]:
# Check out the descriptive stats of the raw data
df.describe()

Unnamed: 0,Weight,Waist,Pulse
count,20.0,20.0,20.0
mean,178.6,35.4,56.1
std,24.690505,3.201973,7.210373
min,138.0,31.0,46.0
25%,160.75,33.0,51.5
50%,176.0,35.0,55.0
75%,191.5,37.0,60.5
max,247.0,46.0,74.0


### 2. Split the data into train and test sets

In [5]:
# Create features and target dataframes
X = df[['Weight','Pulse']]
y = df['Waist'] 
# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,random_state = 44)
# We are going to train the scaler using the X_train data
X_train.shape

(15, 2)

### 3. Demonstrate scaling X_train data using a normalized and standardized scaler

#### Normalization:

In [7]:
# Normalization scaler: scale all columns to 0 - 1 intervals
# Import the MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
# Create the object
norm_scaler = MinMaxScaler()
# Compute the minimum and maximum to be used for later scaling.
norm_scaler.fit(X_train)
# Do the scaling, this returns a numpy array
norm_scaled_array = norm_scaler.transform(X_train) 
norm_scaled_array

array([[0.25688073, 0.21428571],
       [0.16513761, 0.28571429],
       [0.46788991, 0.21428571],
       [0.66972477, 0.35714286],
       [0.14678899, 0.64285714],
       [0.26605505, 0.5       ],
       [0.34862385, 1.        ],
       [0.28440367, 0.14285714],
       [0.34862385, 0.28571429],
       [0.46788991, 0.        ],
       [0.48623853, 0.14285714],
       [0.        , 0.78571429],
       [1.        , 0.14285714],
       [0.17431193, 0.21428571],
       [0.22018349, 0.57142857]])

In [9]:
# Let's create a dataframe out of the array
X_train_norm = pd.DataFrame(data = norm_scaled_array, columns = X_train.columns)
# And check out the descriptive stats
X_train_norm.describe()

Unnamed: 0,Weight,Pulse
count,15.0,15.0
mean,0.353517,0.366667
std,0.244454,0.277781
min,0.0,0.0
25%,0.197248,0.178571
50%,0.284404,0.285714
75%,0.46789,0.535714
max,1.0,1.0


#### Standardization:

In [10]:
# Standardized scaler: scale all columns to std dev = 1 & centered at 0
# Import the StandardScaler
from sklearn.preprocessing import StandardScaler
# Create the object
stand_scaler = StandardScaler()
# Compute the standard deviation and current mean to be used for later scaling.
stand_scaler.fit(X_train)
# Do the scaling, this returns a numpy array
stand_scaled_array = stand_scaler.transform(X_train) 
# Create a dataframe out of the array
X_train_stand = pd.DataFrame(data = stand_scaled_array, columns = X_train.columns)
X_train_stand.describe()

Unnamed: 0,Weight,Pulse
count,15.0,15.0
mean,7.401487e-17,1.258253e-16
std,1.035098,1.035098
min,-1.496907,-1.366314
25%,-0.6616951,-0.7009013
50%,-0.2926479,-0.3016537
75%,0.4842935,0.6299239
max,2.737423,2.359997


### Compare multiple linear regressions on scaled data

In [11]:
# Keep track of model performance
model_performance = []

In [12]:
# First, use just the unscaled data
#
# Create linear regression object
ori_model = linear_model.LinearRegression()
#
# Train the model using the training sets
ori_model.fit(X_train, y_train)
#
print('Unscaled model performance:')
print('The linear model has equation of:')
x1_coef = ori_model.coef_.item(0)
x2_coef = ori_model.coef_.item(1)
intercept = ori_model.intercept_.item(0)
print("y = {} * x1 + {} * x2 + {}".format(x1_coef,x2_coef,intercept))
# Make predictions of waist size using weights from the test dataset
y_pred = ori_model.predict(X_test)
# Now, use the waist size prediction and the true waist size to see how well our model does
r2 = round(r2_score(y_test, y_pred),2)
mse = round(mean_squared_error(y_test, y_pred),2)
print("Coefficient of determination: %.2f" % r2)
print("MSE:",mse)
model_performance.append(['unscaled',r2,mse])

Unscaled model performance:
The linear model has equation of:
y = 0.11324628938445999 * x1 + -0.024528558257205475 * x2 + 16.588395258602095
Coefficient of determination: 0.73
MSE: 0.75


In [13]:
# Second, normalized data
#
# Create linear regression object
norm_model = linear_model.LinearRegression()
#
# Train the model using the normalized features original targets
norm_model.fit(X_train_norm, y_train)
#
print('Normalized model performance:')
print('The linear model has equation of:')
x1_coef = norm_model.coef_.item(0)
x2_coef = norm_model.coef_.item(1)
intercept = norm_model.intercept_.item(0)
print("y = {} * x1 + {} * x2 + {}".format(x1_coef,x2_coef,intercept))
# *******
# We must NORMALIZE (or "transform") test data to use it on this model
X_test_norm = norm_scaler.transform(X_test) # We are using the same scaler we created earlier
y_pred = norm_model.predict(X_test_norm) # Now we can predict using the scaled data
# *******
# Now, use the waist size prediction and the true waist size to see how well our model does
r2 = round(r2_score(y_test, y_pred),2)
mse = round(mean_squared_error(y_test, y_pred),2)
print("Coefficient of determination: %.2f" % r2)
print("MSE:",mse)
model_performance.append(['normalized',r2,mse])

Normalized model performance:
The linear model has equation of:
y = 12.343845542906138 * x1 + -0.686799631201753 * x2 + 31.08806951382612
Coefficient of determination: 0.73
MSE: 0.75


In [14]:
# Third, standardized data
#
# Create linear regression object
stand_model = linear_model.LinearRegression()
#
# Train the model using the normalized features original targets
stand_model.fit(X_train_stand, y_train)
#
print('Standardized model performance:')
print('The linear model has equation of:')
x1_coef = stand_model.coef_.item(0)
x2_coef = stand_model.coef_.item(1)
intercept = stand_model.intercept_.item(0)
print("y = {} * x1 + {} * x2 + {}".format(x1_coef,x2_coef,intercept))
# *******
# We must STANDARDIZE (or "transform") test data to use it on this model
X_test_stand = stand_scaler.transform(X_test) # We are using the same scaler we created earlier
y_pred = stand_model.predict(X_test_stand) # Now we can predict using the scaled data
# *******
# Now, use the waist size prediction and the true waist size to see how well our model does
r2 = round(r2_score(y_test, y_pred),2)
mse = round(mean_squared_error(y_test, y_pred),2)
print("Coefficient of determination: %.2f" % r2)
print("MSE:",mse)
model_performance.append(['standardized',r2,mse])

Standardized model performance:
The linear model has equation of:
y = 2.915182296690269 * x1 + -0.18431089389555544 * x2 + 35.2
Coefficient of determination: 0.73
MSE: 0.75


In [15]:
# Show comparisions: Why don't we see improvement?
model_performance

[['unscaled', 0.73, 0.75],
 ['normalized', 0.73, 0.75],
 ['standardized', 0.73, 0.75]]

### Make sure to transform before making predictions with the model

In [16]:
# Predict my waist size
myData = [[210,54]]
print('Original model:', ori_model.predict(myData))
print('Normalized model:', norm_model.predict(norm_scaler.transform(myData)))
print('Standardized model:', stand_model.predict(stand_scaler.transform(myData)))      

Original model: [39.04557388]
Normalized model: [39.04557388]
Standardized model: [39.04557388]
