<a href="https://colab.research.google.com/github/pontofio/Cours/blob/main/Copie_de_Notebook_1_Regression_Lineaire_Etudiant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üè† R√©gression Lin√©aire ‚Äì California Housing
üë®‚Äçüè´ Professeur : Dr. Khalil HADDAOUI

**Probl√®me concret** : Estimer le **prix m√©dian** d'un quartier √† partir de caract√©ristiques locales (revenu m√©dian, densit√©, proximit√© oc√©an, etc.).  
**Objectifs** : pipeline propre, baseline, am√©lioration (polyn√¥mes, Ridge/Lasso), √©valuation (RMSE/MAE/R¬≤).


## üìê Mod√®le lin√©aire multiple
$$
\hat y = \beta_0 + \sum_{j=1}^p \beta_j x_j
$$
Estimation par moindres carr√©s (minimise la somme des carr√©s des r√©sidus).  
**R√©gularisation** : Ridge (L2) et Lasso (L1) r√©duisent la variance et l'overfit.

- Ridge : $\min_\beta \|y-X\beta\|_2^2 + \alpha \|\beta\|_2$
- Lasso : $\min_\beta \|y-X\beta\|_2^2 + \alpha \|\beta\|_1$

**Pourquoi ?** Donn√©es bruit√©es / corr√©l√©es ‚Üí coefficients instables ‚Üí la r√©gularisation stabilise et g√©n√©ralise mieux.


In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import numpy as np, matplotlib.pyplot as plt

X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X.head(), y.head()


(   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
 0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
 1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
 2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
 3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
 4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
 
    Longitude  
 0    -122.23  
 1    -122.22  
 2    -122.24  
 3    -122.25  
 4    -122.25  ,
 0    4.526
 1    3.585
 2    3.521
 3    3.413
 4    3.422
 Name: MedHouseVal, dtype: float64)

In [None]:
# Baseline: pr√©dire la moyenne du train
import numpy as np
yhat_mean = np.full_like(y_test, y_train.mean())
rmse = mean_squared_error(y_test, yhat_mean)
mae = mean_absolute_error(y_test, yhat_mean)
r2  = r2_score(y_test, yhat_mean)
print("Baseline -> RMSE:", rmse, "| MAE:", mae, "| R2:", r2)


Baseline -> RMSE: 1.3106960720039365 | MAE: 0.9060685490007149 | R2: -0.00021908714592466794


In [None]:
# TODO: construire un pipeline standardisation + r√©gression lin√©aire
pipe_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("model", LinearRegression())
])

cv_rmse = (-cross_val_score(pipe_lr, X_train, y_train, scoring="neg_root_mean_squared_error", cv=5)).mean()
print("CV RMSE (LinearRegression):", cv_rmse)

# TODO: entra√Æner et √©valuer sur test (RMSE, MAE, R2)
pipe_lr.fit(X_train, y_train)
yhat = pipe_lr.predict(X_test)
print("Test RMSE:", mean_squared_error(y_test,yhat))
print("Test MAE :", mean_absolute_error(y_test,yhat))
print("Test R2  :", r2_score(y_test,yhat))


CV RMSE (LinearRegression): 0.7205271873526421
Test RMSE: 0.5558915986952442
Test MAE : 0.5332001304956565
Test R2  : 0.575787706032451


In [None]:
# TODO: ajouter des features polynomiales (degr√©=2) et comparer
pipe_poly = Pipeline([
    ("scaler", StandardScaler()),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("model", Ridge(alpha=1.0))
])
cv_rmse_poly = (-cross_val_score(pipe_poly, X_train, y_train, scoring="neg_root_mean_squared_error", cv=3)).mean()
print("CV RMSE (Poly+Ridge):", cv_rmse_poly)

pipe_poly.fit(X_train, y_train)
yhat_poly = pipe_poly.predict(X_test)
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, yhat_poly)))
print("Test MAE :", mean_absolute_error(y_test, yhat_poly))
print("Test R2  :", r2_score(y_test, yhat_poly))


CV RMSE (Poly+Ridge): 3.125718216312988
Test RMSE: 0.68021523026453
Test MAE : 0.4670525215371786
Test R2  : 0.6469096540341559


In [None]:
# TODO: GridSearch sur Ridge (alpha) # le alpha du ridge ou lasso est ce qu'on appelle un hyperparm√®tre ==> on applique un GridSearchCV pour trouver la meilleur valeur de ce param√®t
param_grid = {"model__alpha": [0.1, 1.0, 10.0]}
pipe_ridge = Pipeline([("scaler", StandardScaler()), ("model", Ridge())])
gs = GridSearchCV(pipe_ridge, param_grid, scoring="neg_root_mean_squared_error", cv=5, n_jobs=-1)
# TODO: gs.fit(...)
gs.fit(X_train, y_train)

print("Meilleur alpha:", gs.best_params_["model__alpha"])
#best = gs.best_estimator_
#print("Best = ", best)
yhat_best = best.predict(X_test)

print("Test RMSE:", np.sqrt(mean_squared_error(y_test, yhat_best)))
print("Test MAE :", mean_absolute_error(y_test, yhat_best))
print("Test R2  :", r2_score(y_test, yhat_best))

#print("Best Test RMSE:", mean_squared_error(y_test, yhat_best, squared=False))


NameError: name 'Pipeline' is not defined

### ‚úÖ √Ä retenir
- Toujours une baseline.
- Pipelines pour √©viter les fuites de donn√©es.
- R√©gulariser si corr√©lations/overfit.
