# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Let's download a CSV file to use for this challenge and parse it into a DataFrame

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
627,0.98,514.5,294.0,110.25,7.0,0.4,33.375
549,0.76,661.5,416.5,122.5,7.0,0.4,40.41
490,0.86,588.0,294.0,147.0,7.0,0.25,31.23
342,0.9,563.5,318.5,122.5,7.0,0.25,32.55
725,0.9,563.5,318.5,122.5,7.0,0.4,34.075


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

> YOUR ANSWER HERE

## 2. Application

### 2.1 Preprocessing

❓ Standardise the features

In [2]:
data.columns

Index(['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area',
       'Overall Height', 'Glazing Area', 'Average Temperature'],
      dtype='object')

In [4]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler() # Instanciate StandarScaler

scaler.fit(data[['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area',
       'Overall Height', 'Glazing Area']]) # Fit scaler to data

data[['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area',
       'Overall Height', 'Glazing Area']] = scaler.transform(data[['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area',
       'Overall Height', 'Glazing Area']]) # Use scaler to transform data

data.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
0,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447,18.44
1,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447,18.44
2,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447,18.44
3,2.041777,-1.785875,-0.561951,-1.470077,1.0,-1.760447,18.44
4,1.284979,-1.229239,0.0,-1.198678,1.0,-1.760447,24.56


### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [72]:
# Import the model
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_validate

X = data[['Relative Compactness', 'Surface Area', 'Wall Area', 'Roof Area',
       'Overall Height', 'Glazing Area']]
y = data['Average Temperature']

# Instantiate model
model = SGDRegressor(loss='squared_error') # OLS solved by SGD

# 10-Fold Cross validate model
cv_results = cross_validate(model, X, y, cv=10, scoring=['neg_mean_absolute_error', 'r2'])
cv_results


{'fit_time': array([0.01289701, 0.00723195, 0.00572205, 0.00573087, 0.00491595,
        0.00456715, 0.00358582, 0.0036521 , 0.00368214, 0.00286889]),
 'score_time': array([0.00255895, 0.00132513, 0.00109386, 0.00092793, 0.00078368,
        0.0006988 , 0.0006969 , 0.00061703, 0.00056291, 0.00056195]),
 'test_neg_mean_absolute_error': array([-3.30594405, -1.86080148, -2.26583991, -2.3878028 , -1.63106736,
        -2.10876866, -1.82200267, -2.2572098 , -2.34660874, -1.83459744]),
 'test_r2': array([0.78492701, 0.90944194, 0.89570315, 0.88399678, 0.93132141,
        0.89676498, 0.92708227, 0.91539285, 0.89594701, 0.93878284])}

❓ Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in °C of all your folds and save it in the variable `max_error`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [73]:
max_error = cv_results["test_neg_mean_absolute_error"].max()
max_error

-1.631067361175653

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [80]:
from sklearn.linear_model import SGDRegressor

lin_reg_sgd = SGDRegressor(loss='squared_error') # OLS solved by SGD

# 10-Fold Cross validate model
cv_results = cross_validate(lin_reg_sgd , X, y, cv=10, scoring=['neg_mean_squared_error', 'r2'])

r2_mae = cv_results['test_r2'].mean()
cv_results

{'fit_time': array([0.0115881 , 0.00676203, 0.00584698, 0.00539875, 0.00484705,
        0.0046351 , 0.00338888, 0.00354004, 0.00453186, 0.0030551 ]),
 'score_time': array([0.00235581, 0.00132728, 0.00139308, 0.00110841, 0.00092292,
        0.00080681, 0.00079107, 0.00071502, 0.00061798, 0.00056672]),
 'test_neg_mean_squared_error': array([-17.7681022 ,  -6.32452985,  -9.27145734,  -9.94289013,
         -5.98612216,  -9.74913301,  -6.70521272,  -9.29503459,
         -9.71068136,  -5.93018612]),
 'test_r2': array([0.78668386, 0.908479  , 0.89417098, 0.88297672, 0.93113792,
        0.8966073 , 0.92711156, 0.91563633, 0.8949165 , 0.93903267])}

❓ Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [77]:
max_error_mae = cv_results['test_neg_mean_squared_error'].max()
max_error_mae

-5.89526099858917

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

> YOUR ANSWER HERE

# 🏁 Check your code and push your notebook

In [81]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/gulecs/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /Users/gulecs/code/gulecsec/data-loss-functions/tests
plugins: anyio-3.6.1, dash-2.7.0, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m   [ 33%][0m
test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m                [ 66%][0m
test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master

