# Loss Functions

In this exercise, you will compare the effects of Loss functions on a `LinearRegression` model.

👇 Import the data from the attached csv file

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Relative Compactness,Surface Area,Wall Area,Roof Area,Overall Height,Glazing Area,Average Temperature
0,0.98,514.5,294.0,110.25,7.0,0.0,18.44
1,0.98,514.5,294.0,110.25,7.0,0.0,18.44
2,0.98,514.5,294.0,110.25,7.0,0.0,18.44
3,0.98,514.5,294.0,110.25,7.0,0.0,18.44
4,0.9,563.5,318.5,122.5,7.0,0.0,24.56


🎯 Your task is to predict the average temperature inside a greenhouse based on its design. Your temperature predictions will help you select the appropriate greenhouse design for each plant, based on their climate needs. 

🌿 You know that plants can handle small temperature variations, but are exponentially more sensitive as the temperature variations increase. 

## 1. Theory 

❓ Theoretically, which Loss function would you train your model on to limit the risk of killing plants?

<details>
<summary> 🆘 Answer </summary>
    
By theory, you would use a Mean Square Error (MSE) Loss function. It would penalize outlier predictions and prevent your model from committing large errors. This would ensure smaller temperature variations and a lower risk for plants.

</details>

We could use the mean squared error since it penalizes more outliers

## 2. Application

### 2.1 Preprocessing

👇 Scale the features

In [3]:
from sklearn.preprocessing import StandardScaler

In [4]:
y = df['Average Temperature']
X = df.drop(columns='Average Temperature')

In [6]:
scaler = StandardScaler()
scaler.fit(X)
X_std = scaler.transform(X)

In [7]:
X_std

array([[ 2.04177671, -1.78587489, -0.56195149, -1.47007664,  1.        ,
        -1.76044698],
       [ 2.04177671, -1.78587489, -0.56195149, -1.47007664,  1.        ,
        -1.76044698],
       [ 2.04177671, -1.78587489, -0.56195149, -1.47007664,  1.        ,
        -1.76044698],
       ...,
       [-1.36381225,  1.55394308,  1.12390297,  0.97251224, -1.        ,
         1.2440492 ],
       [-1.36381225,  1.55394308,  1.12390297,  0.97251224, -1.        ,
         1.2440492 ],
       [-1.36381225,  1.55394308,  1.12390297,  0.97251224, -1.        ,
         1.2440492 ]])

### 2.2 Modeling

In this section, you are going to verify the theory by evaluating models optimized on different Loss functions.

### Least Squares (MSE) Loss

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **Least Squares Loss** (MSE)



In [8]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import cross_validate

In [9]:
sgd_model = SGDRegressor(loss='squared_loss')

sgd_model_cv = cross_validate(sgd_model, X_std, y, cv=10, scoring=['r2', 'max_error'])
sgd_model_cv

{'fit_time': array([0.01006889, 0.00620961, 0.00661397, 0.00672102, 0.00764275,
        0.0056932 , 0.0055294 , 0.00530601, 0.00583553, 0.00532317]),
 'score_time': array([0.0012846 , 0.00100589, 0.00127101, 0.00113201, 0.00109673,
        0.00078869, 0.00084448, 0.00078321, 0.00070763, 0.00076413]),
 'test_r2': array([0.7855601 , 0.9089638 , 0.89531786, 0.88402824, 0.93143739,
        0.89653359, 0.92694057, 0.91602822, 0.89486122, 0.93908332]),
 'test_max_error': array([-9.8923053 , -8.66749942, -8.77407728, -9.19315991, -8.80216432,
        -8.66363922, -8.59634092, -8.88748563, -8.42272289, -7.69900127])}

👇 Compute 
- the mean cross-validated R2 score `r2`
- the single biggest prediction error in °C of all your folds `max_error`?

(Tips: `max_error` is an accepted scoring metrics in sklearn)

In [18]:
r2 = sgd_model_cv['test_r2'].mean()
r2

0.897875431370011

In [19]:
max_error = abs(sgd_model_cv['test_max_error']).max()
max_error

9.892305299038252

### Mean Absolute Error (MAE) Loss

What if we optimize our model on the MAE instead?

👇 **10-Fold Cross-validate** a Linear Regression model optimized by **Stochastic Gradient Descent** (SGD) on a **MAE** Loss

<details>
<summary>💡 Hints</summary>

- MAE loss cannot be directly specified in `SGDRegressor`. It must be engineered by adjusting the right parameters

</details>

In [14]:
# MAE loss engineered by setting epsilon_insensitive = 0
mae_model = SGDRegressor(loss="epsilon_insensitive", epsilon = 0)

# Cross Validate Model
mae_sgd = cross_validate(mae_model, X_std, y, cv = 10, scoring = ['r2','max_error'])

👇 Compute 
- the mean cross-validated R2 score `r2_mae`
- the single biggest prediction error of all your folds `max_error_mae`?

In [20]:
r2_mae = mae_sgd['test_r2'].mean()
r2_mae

0.8762114954984318

In [21]:
max_error_mae = abs(mae_sgd['test_max_error']).max()
max_error_mae

11.220344424837414

## 3. Conclusion

❓Which of the models you evaluated seems the most appropriate for your task?

<details>
<summary> 🆘Answer </summary>
    
Although mean cross-validated r2 scores are approximately similar between the two models, the one optimized on a MAE has more chance to make larger mistakes from time to time, increasing the risk of killing plants!

    
</details>

The model using the mean squared error loss

# 🏁 Check your code

In [22]:
from nbresult import ChallengeResult

result = ChallengeResult('loss_functions',
    r2 = r2,
    r2_mae = r2_mae,
    max_error = max_error,
    max_error_mae = max_error_mae,                     
)
result.write()
print(result.check())

platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0 -- /home/matheus/.pyenv/versions/lewagon/bin/python3
cachedir: .pytest_cache
rootdir: /home/matheus/code/matheussposito/data-challenges-869/05-ML/04-Under-the-hood/01-Loss-Functions
plugins: anyio-3.4.0
[1mcollecting ... [0mcollected 3 items

tests/test_loss_functions.py::TestLossFunctions::test_max_error_order [32mPASSED[0m[32m [ 33%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2 [32mPASSED[0m[32m          [ 66%][0m
tests/test_loss_functions.py::TestLossFunctions::test_r2_mae [32mPASSED[0m[32m      [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/loss_functions.pickle

[32mgit[39m commit -m [33m'Completed loss_functions step'[39m

[32mgit[39m push origin master
