# California Housing ML Project

## Objective
This project demonstrates a complete **end-to-end regression pipeline** using the **California Housing dataset**. It aims to predict median house prices based on various socio-economic and geographic features. 

---

## Why This Project?

-  **Real-world Data**: Based on California census blocks, giving practical value.
-  **Covers Full ML Lifecycle**: From EDA to preprocessing, training, evaluation, and explainability.
-  **Multi-model Comparison**: Linear, Ridge, and Polynomial Ridge Regression.
-  **Model Tuning**: Hyperparameter optimization using `GridSearchCV`.
-  **Explainable AI**: Visualizations, residual analysis, and permutation importance.
-  **Generative AI Integration**: Summarizes results using a GPT-style model.

---

##  What This Project Helps You Learn

- How to explore and visualize real-world tabular data.
- How to prepare data for regression tasks using scaling and feature engineering.
- Differences between simple linear models and polynomial transformations.
- How to compare models with evaluation metrics like MSE and R².
- How to interpret residuals and feature importances.
- How to integrate a text generation model (like GPT-2) to produce natural language summaries of ML outcomes.

---

##  Techniques & Tools Used

| Category              | Tools / Libraries                              |
|-----------------------|------------------------------------------------|
| Data Loading          | `sklearn.datasets.fetch_california_housing`   |
| Visualization         | `matplotlib`, `seaborn`                        |
| Preprocessing         | `StandardScaler`, `PolynomialFeatures`        |
| Models                | `LinearRegression`, `Ridge`                    |
| Tuning                | `GridSearchCV`                                 |
| Evaluation            | `mean_squared_error`, `r2_score`              |
| Interpretability      | `permutation_importance`, residual plots       |
| Generative Summary    | `transformers.pipeline("text-generation")`     |

---

##  Results

- **Best Model**: Polynomial Ridge Regression  
- **R² Score**: ~0.65  
- **Top Features**: MedInc, Latitude, Longitude, AveRooms, AveBedrms

---

##  Bonus: AI Summary

Using GPT-2, we generated a natural language summary of the experiment. This simulates automated ML reporting using generative models.

---

##  Future Work

- Integrate SHAP for deeper model explanations.
- Try other regression models (SVR, XGBoost, etc.).
- Turn this into a classification task using other datasets.
- Deploy as an interactive app using Streamlit or Gradio.

---

In [1]:
!pip install -q transformers datasets matplotlib seaborn scikit-learn

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.inspection import permutation_importance
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
data = fetch_california_housing()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

In [None]:
print("🧹 Dataset Preview:")
display(df.head())
print("\nCorrelation Heatmap:")
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

In [None]:
X = df.drop("target", axis=1)
y = df["target"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

In [None]:
models = {
    "Linear": Pipeline([("lr", LinearRegression())]),
    "Ridge": Pipeline([("ridge", Ridge(alpha=1.0))]),
    "PolyRidge": Pipeline([
        ("poly", PolynomialFeatures(degree=2, include_bias=False)),
        ("ridge", Ridge(alpha=1.0))
    ])
}

results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results.append({
        "Model": name,
        "MSE": mean_squared_error(y_test, y_pred),
        "R2": r2_score(y_test, y_pred)
    })

In [None]:
# === RESULTS TABLE ===
results_df = pd.DataFrame(results)
print("\n Model Comparison Table:")
display(results_df.sort_values("R2", ascending=False))