
# Exercise 3 Regression

The goal of this exercise is to learn to evaluate a machine learning model using many regression metrics.

Preliminary:

- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is focus on the metrics, that is why the code to fit the Linear Regression is given.*

```python
#imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
#data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
#split data train test 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.1,
                                                    shuffle=True,
                                                    random_state=13)
#pipeline 
pipeline = [('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('lr', LinearRegression())]
pipe = Pipeline(pipeline)
#fit
pipe.fit(X_train, y_train)
```

1. Predict on the train set and test set

2. Compute R2, Mean Square Error, Mean Absolute Error on both train and test set


In [52]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
# split data train test 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.1,
                                                    shuffle=True,
                                                    random_state=13)
# pipeline 
pipeline = [('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('lr', LinearRegression())]
pipe = Pipeline(pipeline)

# fit
pipe.fit(X_train, y_train)

# 1.
y_pred_train = pipe.predict(X_train) 
y_pred_test = pipe.predict(X_test)

# testing
print(y_pred_train[:10])
print(y_pred_test[:10])

# 2.
r2_train = r2_score(y_pred_train, y_train)
mse_train = mean_squared_error(y_train, y_pred_train)
mae_train = mean_absolute_error(y_train, y_pred_train)

print('r2 on the train set:', r2_train)
print('MAE on the train set:', mae_train)
print('MSE on the train set:', mse_train)

r2_test = r2_score(y_pred_test, y_test)
mse_test = mean_squared_error(y_test, y_pred_test)
mae_test = mean_absolute_error(y_test, y_pred_test)

print('----------------')
print('r2 on the test set:', r2_test)
print('MAE on the test set:', mae_test)
print('MSE on the test set:', mse_test)

# This result shows that the model has slightly better results on the train set than the test set. That's frequent since it is easier to get a better grade on an exam we studied than an exam that is different from what was prepared. However, the results are not good: r2 ~ 0.3. Fitting non linear models as the Random Forest on this data may improve the results. That's the goal of the exercise 5.



[1.54505951 2.21338527 2.2636205  3.3258957  1.51710076 1.63209319
 2.9265211  0.78080924 1.21968217 0.72656239]
[ 1.82212706  1.98357668  0.80547979 -0.19259114  1.76072418  3.27855815
  2.12056804  1.96099917  2.38239663  1.21005304]
r2 on the train set: 0.3552292936915785
MAE on the train set: 0.5300159371615256
MSE on the train set: 0.5210784446797678
----------------
r2 on the test set: 0.30265471284464673
MAE on the test set: 0.5454023699809112
MSE on the test set: 0.5537420654727397
