### Module 8 Lab: Scaling Data Practice
    
Recall a few guidelines:
- You don't scale the target variable (the 'y' or dependent variable. Leave it in its original scale)
- You fit the scaler to only the training data (X_train), not the whole dataset
- You must scale the test data (X_test) before you use the trained model to predict values

In this lab:
- Load the Ames, Iowa housing dataset from S3
- Split the data into train and test sets
- Scale X_train data using a normalized and standardized scaler
- Use a new type of linear regression algorithm called a Stochastic Gradient Descent (SGD)
    - You'll see a very large difference in model performance when you train using scaled data
- Compare linear regression models for the 3 datasets: Unscaled, Normalized, Standardized

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import linear_model # This include the SGD algorithm
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import boto3
import pandas as pd
import numpy as np

### 1. Load the Ames, Iowa housing dataset from S3 into a pandas DataFrame
This is a similar dataset to the Boston Housing, just bigger.

The goal is to predict house sale price.

In [None]:
# Get the data from the S3 bucket: machinelearning-read-only
source_bucket = 'machinelearning-read-only'
source_key = 'data/AmesHousing.csv'
# Create session and S3 client
sess = boto3.session.Session()
s3 = sess.client('s3')
# Load the dataframe
response = s3.get_object(Bucket=source_bucket, Key=source_key)
df = pd.read_csv(response.get("Body")) 
print('The size of the complete dataset:',df.shape)
df.head(3)

#### 1A. Isolate the features and target variable
For simplification, use only 2 features from the dataset:
- 'Gr Liv Area' - size of the house in square feet
- 'Overall Quality' - Rates the overall material and finish of the house

Use 'SalePrice' for the target variable.

In [None]:
# Isolate the relevant variables
X = df[['Gr Liv Area', 'Overall Qual']]
y = df[['SalePrice']]
print(X.shape)
print(y.shape)
# notice the scale of the features
X.describe()

#### Notice the vastly different scales of the variables
- Living area: 334 to 5642
- Overall quality: 1 to 10

This will make it hard for a linear regression algorithm to predict very accurately.

### 2. Split the data into train and test sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)
# Verify the sizes of the split datasets
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

### 3. Scale X_train data using a normalized and standardized scaler
Store the scaled data in these pandas DataFrames
- X_train_norm
- X_train_stand

#### Normalization:

In [None]:
# your code here

#### Standardization:

In [None]:
# your code here

#### Check:
You should have 3 X_train DataFrames:
- X_train: unscaled data
- X_train_norm: normalized data
- X_train_stand: standardized data

### 4. Use the Stochastic Gradient Descent (SGD) algorithm
Let's introduce a new type of linear regression algorithm called the SGD:<P>
This is similar to a linear regression, but more sophosticated. It uses something called a "loss function." As the algorithm trains, it continues to evaluate how well it is performing and adjusts accordingly to improve the predictions of the model.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor.score

This algorithm is sensitive to features with differing scales. We will see big improvements as we scale the features into similar scales.<P>
    
For us at this stage, we can use it in a very similar way to the regular linear regression model. It is not as intuitive as a simple linear regression model, but it is more powerful.

In [None]:
# Hint: When you train the SGD algorithm, it expects a 1d numpy array for the y values. Otherwise, you'll get a warning.
# Convert y_train to a 1d numpy array before training the model
y_train = y_train.values.ravel()
y_train

In [None]:
# Use a list to keep track of model performance
model_performance = []

In [None]:
# First, use just the unscaled data to train and evaluate the model
#
# Create linear regression object
ori_SGD = linear_model.SGDRegressor() # We'll use the default hyperparameters
#
# Train the model using the training data. 
ori_SGD.fit(X_train, y_train)

In [None]:
# Unscaled data continued.....
#
# Make predictions of sales price using the unscaled X_test data
y_pred = ori_SGD.predict(X_test)
# Have a look at the first few of the y_test and y_pred
for row in range(4):
    pp = int(y_pred[row]) # Pull out predicted price
    tp = int(y_test.values[row].item()) # Pull out true price
    error = abs(pp - tp)
    print('Predicted price:', pp, 'True price:', tp, 'Error:', error)
# Now, compute r2 and mse
print('\nUnscaled model performance:')
r2 = round(r2_score(y_test, y_pred),2)
mse = round(mean_squared_error(y_test, y_pred),2)
print("Coefficient of determination: %.2f" % r2)
print("MSE:",mse)
# Store the results into the list
model_performance.append(['unscaled',r2,mse])

##### Wow, big errors and negative R^2. Not a very good regression model...

### Your turn:
- In a similar way, train and evaluate the model using normalized and standardized data. 

In [None]:
# Train and evaluate using the normalized data
# Your code here:

In [None]:
# Train and evaluate using the standardized data
# Your code here:

In [None]:
# if you stored your values in this list, show it now
model_performance

### Summary:
What we did:<P>
- Picked 2 "intutive?" features from 83 to predict sales price: size (sq. feet) and overall quality
- Scaled the X_train data to both normalized and standarized scales.
- Tried to use the unscaled data to train the SGD algorithm. It failed.
- Trained the SGD algorithm on both the normalized and standardized data and saw a big improvement.<BR>
<BR>
If we wanted a better model, we could perform feature selection and hyperparameter tuning.