<img src="https://nyp-aicourse.s3.ap-southeast-1.amazonaws.com/agods/nyp_ago_logo.png" width='400'/>

# Gradient Descent


In this lab, you will learn:
- how to use SGD Regressor to train your model 
- how learning rate and features scaling impact the performance of gradient descent-based algorithms




## Import required libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

Here we will use the Boston dataset for our regression problem. You can find the description of the dataset in this [link](https://www.kaggle.com/code/prasadperera/the-boston-housing-dataset/data).

In [None]:
boston = pd.read_csv('data/boston.csv', index_col=0)
boston.info()

## Explore our data

In [None]:
boston.describe()

### Question 1

What do you observe about the range of values of the different features? 


## Split our dataset

MEDV refers to the median value of owner-occupied homes in $1000s. This is our target label. 

### Exercise 1

Create `X` (features/predictors) and `y` (labels) from the boston dataframe. Create a train and test set from `X` and `y` and call them `X_train, X_test, y_train, y_test`.

In [None]:
## Enter your code here 




<details><summary>Click here for solution</summary>
    
```python

X = boston.drop('MEDV', axis=1)
y = boston['MEDV']

# Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

```
</details>

## Train a regression model

Now let's try to use the [`SGDRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html), and using default setting (e.g. default starting learning rate (`eta0`) of 0.01 and the learning rate adjustment strategy of 'invscaling' i.e. `eta = eta0/pow(t, power_t)`. We also use the default penalty (L2 regularization) with `alpha=0.0001`

In [None]:
from sklearn.linear_model import SGDRegressor

sgd = SGDRegressor()
sgd.fit(X_train, y_train)

In [None]:
rmse = mean_squared_error(y_test, sgd.predict(X_test))
print(rmse)

### Question 2

What do you observe about the RMSE? What do you think is the reason for the observed RMSE? 

<details><summary>Click here for answer</summary>
    
Notice how high the error values are! The algorithm is diverging. This can be due to scaling and/or learning rate being too high. Let's adjust the learning rate and see what happens.
    
</details>

### Exercise 2
Now let's try using a smaller learning rate of 1e-7 (i.e. 0.0000001) and apply the same version of SGD and compare the new RMSE of SGD with the new learning rate. 

Complete the codes in the following code cell.


In [None]:
# Enter your code here 




<details><summary>Click here for answer</summary>
    
```python
sgd = SGDRegressor(eta0=1e-7, random_state=42)
sgd.fit(X_train, y_train)

rmse = mean_squared_error(y_test, sgd.predict(X_test), squared=False)
print(rmse)
```
</details>

### Exercise 3

Now let's scale our training data and try again.

* Fit a `StandardScaler` to `X_train` create a variable `X_train_scaled`.
* Using the scaler, transform `X_test` and create a variable `X_test_scaled`. 
* Apply the same versions of SGD to them and compare the results. Use a default eta0 this time.

Complete the code in the following code cell.


In [None]:
# Enter your code here 




<details><summary>Click here for answer</summary>
    
```python

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
sgd = SGDRegressor(random_state=42)
sgd.fit(X_train_scaled, y_train)

rmse = mean_squared_error(y_test, sgd.predict(X_test_scaled), squared=False)
print(rmse)

```
</details>

### Question 3

What do you observe about the value of RMSE? What can you conclude? 

<details><summary>Click here for answer</summary>
    
We can see a smaller RMSEs. Scaling has a large impact on the performance of SGD and it helps the SGD to learn better. 
    
</details>