# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Let's download a CSV file to use for this challenge and parse it into a DataFrame

In [1]:
import pandas as pd

data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/04-Under-the-Hood/loss_functions_dataset.csv")
data.sample(5)

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
450,0.79,637.0,343.0,147.0,7.0,0.25,41.245
586,0.86,588.0,294.0,147.0,7.0,0.4,32.205
332,0.62,808.5,367.5,220.5,3.5,0.25,15.43
667,0.64,784.0,343.0,220.5,3.5,0.4,19.995
680,0.86,588.0,294.0,147.0,7.0,0.4,31.955


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

> YOUR ANSWER HERE

## 2. Application

### 2.1 Preprocessing

❓ Standardise the features

In [2]:
# scaling needed to normalise data within a particular range
# standard scaler removes the mean and scales each feature to unit variance
# can be influenced by outliers

In [3]:
from sklearn.preprocessing import StandardScaler

In [4]:
X = data.loc[:, 'Relative Compactness': 'Glazing Area']

In [5]:
scaler = StandardScaler().fit(X)

In [6]:
X_scaled = scaler.transform(X)

### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [7]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import SGDRegressor
import numpy as np

In [8]:
sgd_model = SGDRegressor(loss="squared_error")

In [18]:
sgd_model_cv = cross_validate(
    sgd_model,
    X_scaled,
    data['Average Temperature'],
    cv=10,
    scoring = ['r2', 'max_error']
)

In [19]:
# scoring defines the models evaluation rules
# If r2 = 0.9 The relationship between features and target explain 90% of the data

In [20]:
sgd_model_cv

{'fit_time': array([0.00728083, 0.00726199, 0.00744796, 0.00655794, 0.00684786,
        0.00859714, 0.00665212, 0.00622201, 0.00667691, 0.00621796]),
 'score_time': array([0.00127316, 0.00093699, 0.00103211, 0.00081706, 0.00105119,
        0.00093269, 0.00138974, 0.00100374, 0.00090599, 0.00092125]),
 'test_r2': array([0.78550761, 0.90935988, 0.89541707, 0.88363235, 0.93126754,
        0.89668963, 0.92700824, 0.91627863, 0.8949807 , 0.93870006]),
 'test_max_error': array([-9.86541317, -8.62014575, -8.74908218, -9.18442593, -8.90632617,
        -8.52280223, -8.53799553, -8.88384145, -8.37975956, -7.7578171 ])}

❓ Compute 
- the mean cross-validated R2 score and save it in the variable `r2`
- the single biggest prediction error in °C of all your folds and save it in the variable `max_error`?

(Tips: `max_error` is an accepted scoring metric in sklearn)

In [21]:
r2 = sgd_model_cv['test_r2'].mean()
r2

0.8978841701834407

In [22]:
max_error = abs(sgd_model_cv['test_max_error']).max()
max_error

9.865413166310766

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

❓ **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [None]:
# epsilon insensitive will ignore errors less than epsilson
# The larger e the larger error you admit in your solution

In [24]:
mae_model = SGDRegressor(loss="epsilon_insensitive", epsilon=0)

In [26]:
mae_model_cv = cross_validate(
    mae_model,
    X_scaled,
    data['Average Temperature'],
    cv=10,
    scoring = ['r2', 'max_error']
)

❓ Compute 
- the mean cross-validated R2 score, store it in `r2_mae`
- the single biggest prediction error of all your folds, store it in `max_error_mae`?

In [27]:
mae_model_cv

{'fit_time': array([0.01056075, 0.00931311, 0.01145387, 0.00948095, 0.01081181,
        0.01105189, 0.01077914, 0.01217103, 0.00845122, 0.01218796]),
 'score_time': array([0.00097013, 0.00091195, 0.00142217, 0.00086212, 0.0010879 ,
        0.00089288, 0.00084519, 0.00081301, 0.00070977, 0.00082922]),
 'test_r2': array([0.74182237, 0.87335621, 0.87425106, 0.84743669, 0.91756668,
        0.87542607, 0.91806341, 0.89960911, 0.87717452, 0.93612138]),
 'test_max_error': array([-11.13937467, -10.71038773, -10.65542318, -11.18446019,
        -11.13179256, -10.88212706, -10.78203196, -11.08202627,
        -11.01161193, -10.08176081])}

In [28]:
r2_mae = mae_model_cv['test_r2'].mean()

In [29]:
max_error_mae = abs(mae_model_cv['test_max_error']).max()
max_error_mae

11.184460188836322

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

> YOUR ANSWER HERE

# 🏁 Check your code and push your notebook

In [30]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae
)

result.write()
print(result.check())


platform darwin -- Python 3.10.6, pytest-7.1.3, pluggy-1.0.0 -- /Users/bingobango/.pyenv/versions/tom/bin/python3
cachedir: .pytest_cache
rootdir: /Users/bingobango/code/lewagon/data-loss-functions/tests
plugins: anyio-3.6.1, asyncio-0.19.0
asyncio: mode=strict
[1mcollecting ... [0mcollected 3 items

test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m   [ 33%][0m
test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m                [ 66%][0m
test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m            [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master

