<a href="https://colab.research.google.com/github/nyp-sit/sdaai-iti103/blob/master/session-6/GradientDescent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" align="left"/></a>

# Gradient Descent

Welcome to the programming exercise. This is part of the series of exercises to help you acquire skills in different techniques to fine-tune your model.

**You will learn:**
- how to use SGD Regressor to train your model 
- the effects of learning rate and scaling on gradient descent-based algorithm




## Import required libraries

In [None]:
from __future__ import print_function

import warnings
warnings.filterwarnings('ignore', module='sklearn')

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNetCV

Here we will load the boston housing prices. The dataset is not scaled and we train models using different regression: Linear regression, Ridge, Lasso and ElasticNet and compare the RMSE.  We then compare the RMSEs with a SGDRegressor. 

In [None]:
boston = load_boston()
X = boston['data']
y = boston['target']

# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


We define a function to calculate rmse:

In [None]:

from sklearn.metrics import mean_squared_error

def rmse(ytrue, ypredicted):
    return np.sqrt(mean_squared_error(ytrue, ypredicted))
    

We train a plain vanilla **Linear Regression** and calcuate the RMSE

In [None]:
linearRegression = LinearRegression().fit(X_train, y_train)

linearRegression_rmse = rmse(y_test, linearRegression.predict(X_test))

print(linearRegression_rmse)

We then train a the model using **Ridge Regression**, and find the best alpha and the best RMSE

In [None]:
alphas = [0.005, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 80]

ridgeCV = RidgeCV(alphas=alphas, 
                  cv=4).fit(X_train, y_train)

ridgeCV_rmse = rmse(y_test, ridgeCV.predict(X_test))

print(ridgeCV.alpha_, ridgeCV_rmse)

We then train a the model using **Lasso Regression**, and find the best alpha and the best RMSE

In [None]:
lasso_alphas = np.array([1e-5, 5e-5, 0.0001, 0.0005])

lassoCV = LassoCV(alphas=lasso_alphas,
                  max_iter=5e4,
                  cv=3).fit(X_train, y_train)

lassoCV_rmse = rmse(y_test, lassoCV.predict(X_test))

print(lassoCV.alpha_, lassoCV_rmse)  # Lasso is slower

We then train a the model using **ElasticNet Regression**, and find the best alpha, l1-ratio and the best RMSE.

In [None]:
l1_ratios = np.linspace(0.1, 0.9, 9)

elasticNetCV = ElasticNetCV(alphas=lasso_alphas, 
                            l1_ratio=l1_ratios,
                            max_iter=1e4).fit(X_train, y_train)
elasticNetCV_rmse = rmse(y_test, elasticNetCV.predict(X_test))

print(elasticNetCV.alpha_, elasticNetCV.l1_ratio_, elasticNetCV_rmse)

Now we will put all the best RMSEs of various models in a dataframe for comparison

In [None]:
rmse_vals = [linearRegression_rmse, ridgeCV_rmse, lassoCV_rmse, elasticNetCV_rmse]

labels = ['Linear', 'Ridge', 'Lasso', 'ElasticNet']

rmse_df = pd.Series(rmse_vals, index=labels).to_frame()
rmse_df.rename(columns={0: 'RMSE'}, inplace=1)
rmse_df


Now let's try to use the **SGDRegressor**, and using same best hyper-parameters for ridge, lasso and elasticNet but uses the default starting learning rate (eta0) of 0.01 and the learning rate adjustment strategy of 'invscaling' i.e. eta = eta0 / pow(t, power_t).

In [None]:
from sklearn.linear_model import SGDRegressor


model_parameters_dict = {
    'Linear': {'penalty': 'none'},
    'Lasso': {'penalty': 'l2',
           'alpha': lassoCV.alpha_},
    'Ridge': {'penalty': 'l1',
           'alpha': ridgeCV.alpha_},
    'ElasticNet': {'penalty': 'elasticnet', 
                   'alpha': elasticNetCV.alpha_ ,
                   'l1_ratio': elasticNetCV.l1_ratio_}
}

new_rmses = {}
for modellabel, parameters in model_parameters_dict.items():
    # following notation passes the dict items as arguments
    SGD = SGDRegressor(**parameters)
    SGD.fit(X_train, y_train)
    new_rmses[modellabel] = rmse(y_test, SGD.predict(X_test))

    
rmse_df['RMSE-SGD'] = pd.Series(new_rmses).to_frame()
rmse_df

**Exercise**

What do you observe about the RMSE? What do you think is the reason for the observed RMSE? 

<details><summary>Click here for answer</summary>
    
Notice how high the error values are! The algorithm is diverging. This can be due to scaling and/or learning rate being too high. Let's adjust the learning rate and see what happens.
    
</details>

**Exercise**

Now let's try using a smaller learning rate of 1e-7 (i.e. 0.0000001) and apply the same version of SGD and compare the new RMSE of SGD with the new learning rate. 

<details><summary>Click here for answer</summary>
    
```python

for modellabel, parameters in model_parameters_dict.items():
    # following notation passes the dict items as arguments
    SGD = SGDRegressor(eta0=1e-7, **parameters)
    SGD.fit(X_train, y_train)
    new_rmses[modellabel] = rmse(y_test, SGD.predict(X_test))
```
</details>

In [None]:
new_rmses = {}

## START YOUR CODE HERE 


## END YOUR CODE HERE

rmse_df['RMSE-SGD-learningrate'] = pd.Series(new_rmses)
rmse_df

**Exercise**

Now let's scale our training data and try again.

* Fit a `MinMaxScaler` to `X_train` create a variable `X_train_scaled`.
* Using the scaler, transform `X_test` and create a variable `X_test_scaled`. 
* Apply the same versions of SGD to them and compare the results. Don't pass in a eta0 this time.

<details><summary>Click here for answer</summary>
    
```python
    
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

for modellabel, parameters in model_parameters_dict.items():
    # following notation passes the dict items as arguments
    SGD = SGDRegressor(**parameters)
    SGD.fit(X_train_scaled, y_train)
    new_rmses[modellabel] = rmse(y_test, SGD.predict(X_test_scaled))

rmse_df['RMSE-SGD-scaled'] = pd.Series(new_rmses)
rmse_df

```
</details>

In [None]:
from sklearn.preprocessing import MinMaxScaler

new_rmses = {}

## START YOUR CODE HERE ###



### END YOUR CODE HERE 

rmse_df['RMSE-SGD-scaled'] = pd.Series(new_rmses)
rmse_df

**Exercise**

What do you observe the values of RMSE? Does the scaling help? 

<details><summary>Click here for answer</summary>
    
We can see a smaller RMSEs. Scaling has a large impact on the performance of SGD and it helps the SGD to learn better. 
    
</details>