# Exercise 15.15 — Linear Regression with the Diabetes Dataset (scikit-learn)

**Goal:** Recreate the same overall workflow from the Chapter 15.5 case study, but using the **Diabetes** dataset.

**What this notebook does:**
1. Load the Diabetes dataset from `sklearn.datasets`
2. Explore the data (shape, feature names, basic stats)
3. Visualize relationships and the target distribution
4. Build a baseline Linear Regression model
5. Evaluate with train/test split + metrics
6. Try a few model-selection ideas (Ridge/Lasso, Polynomial features) and compare results

> **Tip:** Rewrite the markdown explanations in your own words before submitting.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)


## 1) Load the Diabetes dataset
The Diabetes dataset has 10 baseline variables (features) and a quantitative target (disease progression one year after baseline).


In [None]:
diabetes = load_diabetes(as_frame=True)

X = diabetes.data      # DataFrame
y = diabetes.target    # Series

print("X shape:", X.shape)
print("y shape:", y.shape)
X.head()


In [None]:
# Feature names
diabetes.feature_names


## 2) Quick data checks


In [None]:
X.info()


In [None]:
# Missing values (should be none)
X.isna().sum()


In [None]:
# Basic stats for features and target
display(X.describe())
display(y.describe())


## 3) Visualizations
We'll look at:
- Histogram of the target
- A few scatter plots of features vs target


In [None]:
plt.figure(figsize=(7,4))
plt.hist(y, bins=30)
plt.title("Target Distribution (Diabetes progression)")
plt.xlabel("Target")
plt.ylabel("Count")
plt.show()


In [None]:
# Pick a few features to visualize vs target
features_to_plot = ["bmi", "bp", "s5", "s1"]

for f in features_to_plot:
    plt.figure(figsize=(6,4))
    plt.scatter(X[f], y, s=10)
    plt.title(f"{f} vs Target")
    plt.xlabel(f)
    plt.ylabel("Target")
    plt.show()


## 4) Train/Test split
We'll hold out a test set to evaluate generalization.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Train:", X_train.shape, "Test:", X_test.shape)


## 5) Baseline model: Linear Regression
We'll train a basic Linear Regression model and evaluate using:
- MAE
- MSE / RMSE
- R²


In [None]:
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

y_pred = lin_reg.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Linear Regression Performance (Test Set)")
print("MAE :", round(mae, 3))
print("MSE :", round(mse, 3))
print("RMSE:", round(rmse, 3))
print("R^2 :", round(r2, 3))


### Predicted vs Actual plot
If the model is good, points should fall close to a diagonal line.


In [None]:
plt.figure(figsize=(6,6))
plt.scatter(y_test, y_pred, s=12)
plt.title("Predicted vs Actual (Linear Regression)")
plt.xlabel("Actual y")
plt.ylabel("Predicted y")

min_y = min(y_test.min(), y_pred.min())
max_y = max(y_test.max(), y_pred.max())
plt.plot([min_y, max_y], [min_y, max_y])
plt.show()


## 6) Model selection ideas (like the case study style)

### A) Ridge Regression (L2 regularization)
We use a Pipeline with StandardScaler because regularization is scale-sensitive.


In [None]:
ridge_model = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", Ridge(alpha=1.0, random_state=42))
])

ridge_model.fit(X_train, y_train)
ridge_pred = ridge_model.predict(X_test)

ridge_mse = mean_squared_error(y_test, ridge_pred)
ridge_rmse = np.sqrt(ridge_mse)
ridge_mae = mean_absolute_error(y_test, ridge_pred)
ridge_r2 = r2_score(y_test, ridge_pred)

print("Ridge Regression Performance (Test Set)")
print("MAE :", round(ridge_mae, 3))
print("RMSE:", round(ridge_rmse, 3))
print("R^2 :", round(ridge_r2, 3))


### B) Lasso Regression (L1 regularization)
Lasso can shrink some coefficients to zero (feature selection-ish behavior).


In [None]:
lasso_model = Pipeline([
    ("scaler", StandardScaler()),
    ("lasso", Lasso(alpha=0.05, max_iter=10000, random_state=42))
])

lasso_model.fit(X_train, y_train)
lasso_pred = lasso_model.predict(X_test)

lasso_mse = mean_squared_error(y_test, lasso_pred)
lasso_rmse = np.sqrt(lasso_mse)
lasso_mae = mean_absolute_error(y_test, lasso_pred)
lasso_r2 = r2_score(y_test, lasso_pred)

print("Lasso Regression Performance (Test Set)")
print("MAE :", round(lasso_mae, 3))
print("RMSE:", round(lasso_rmse, 3))
print("R^2 :", round(lasso_r2, 3))


### C) Polynomial features + Linear Regression
This can capture non-linear relationships, but it can also overfit.
We'll try degree=2 as a basic experiment.


In [None]:
poly2_model = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", StandardScaler()),
    ("linreg", LinearRegression())
])

poly2_model.fit(X_train, y_train)
poly2_pred = poly2_model.predict(X_test)

poly2_mse = mean_squared_error(y_test, poly2_pred)
poly2_rmse = np.sqrt(poly2_mse)
poly2_mae = mean_absolute_error(y_test, poly2_pred)
poly2_r2 = r2_score(y_test, poly2_pred)

print("Polynomial (degree=2) + Linear Regression Performance (Test Set)")
print("MAE :", round(poly2_mae, 3))
print("RMSE:", round(poly2_rmse, 3))
print("R^2 :", round(poly2_r2, 3))


## 7) Compare models (summary table)


In [None]:
results = pd.DataFrame({
    "Model": ["Linear Regression", "Ridge (alpha=1.0)", "Lasso (alpha=0.05)", "Poly2 + Linear"],
    "MAE":  [mae, ridge_mae, lasso_mae, poly2_mae],
    "RMSE": [rmse, ridge_rmse, lasso_rmse, poly2_rmse],
    "R2":   [r2, ridge_r2, lasso_r2, poly2_r2]
}).sort_values("RMSE")

results


## 8) Cross-validation (optional but good practice)
To avoid relying on a single train/test split, we can compare models using cross-validation.
We’ll use **R²** as the scoring metric here.


In [None]:
def cv_r2(model, X, y, folds=5):
    scores = cross_val_score(model, X, y, cv=folds, scoring="r2")
    return scores.mean(), scores.std()

models = {
    "Linear Regression": LinearRegression(),
    "Ridge (alpha=1.0)": ridge_model,
    "Lasso (alpha=0.05)": lasso_model,
    "Poly2 + Linear": poly2_model
}

cv_rows = []
for name, model in models.items():
    mean_r2, std_r2 = cv_r2(model, X, y, folds=5)
    cv_rows.append((name, mean_r2, std_r2))

cv_df = pd.DataFrame(cv_rows, columns=["Model", "CV Mean R2", "CV Std R2"]).sort_values("CV Mean R2", ascending=False)
cv_df


## 9) Conclusion (rewrite in your own words)

In this notebook, I loaded the Diabetes dataset from scikit-learn, explored the features and target,
and built a baseline Linear Regression model. I evaluated performance using standard regression metrics and a predicted vs actual plot.
Then I tested model-selection ideas (Ridge, Lasso, and polynomial features) to see if they improved accuracy.
Finally, I used cross-validation to compare models more reliably across multiple splits.
