helpful:

https://www.datacamp.com/tutorial/guide-to-the-gradient-boosting-algorithm

https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html

In [17]:
#@title Imports & setup (run this first)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (mean_squared_error, r2_score, accuracy_score,
                             precision_score, recall_score, f1_score,
                             classification_report, confusion_matrix, ConfusionMatrixDisplay)
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.datasets import load_digits, fetch_california_housing

---
# Exercise 1 (Regression – House Prices)

**Dataset:** `sklearn.datasets.fetch_california_housing `(Are all features  useful?).

![](https://user-images.githubusercontent.com/48794028/148332938-4e66d4ca-2d16-474f-8482-340aef6a48d0.png)

Tasks:
- Split data into train/test.
- Train a `DecisionTreeRegressor`, `RandomForestRegressor`, and `GradientBoostingRegressor`.
- Compare models using `RMSE`.
- Plot predicted vs. true values for test set.


**Attribute Information:**
- `MedInc`        median income in block group
- `HouseAge`      median house age in block group
- `AveRooms`      average number of rooms per household
- `AveBedrms`     average number of bedrooms per household
- `Population`    block group population
- `AveOccup`      average number of household members
- `Latitude`      block group latitude
- `Longitude`     block group longitude

In [None]:
# TODO: Load dataset
# df = fetch_california_housing(as_frame=True)
data = fetch_california_housing(as_frame=True)
# print(df.DESCR)
print(data.DESCR)

# # Read the DataFrame, first using the feature data
# df = pd.DataFrame(data.data, columns=data.feature_names)

# # Add a target column, and fill it with the target data
# df['target'] = data.target
# # Show the first five rows
# df.head()

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per ce

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [None]:
# Prepare features and target

X = data.data  # features
X = 
y = data.target  # MedHouseVal

# X = df[['Longitude', 'Latitude']]
# y = df['target']

# columns_drop = ["Longitude", "Latitude"]
# X = df.drop(columns_drop, axis=1)
# y = df[columns_drop]

# TODO: Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape


print("Features:", list(X.columns))
print("Shape:", X.shape, "| Target shape:", y.shape)
X.head()


Features: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
Shape: (20640, 8) | Target shape: (20640,)


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [None]:
# TODO: Train DT, RF, GB

from math import sqrt
from xgboost import XGBRegressor

# Models
models = {
    "DecisionTree": DecisionTreeRegressor(max_depth=5, random_state=42),
    "RandomForest": RandomForestRegressor(n_estimators=100, random_state=42),
    "GradientBoosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, random_state=42)
}

# DT
dt_base = DecisionTreeRegressor(random_state=42)
dt_base.fit(X_train, y_train)

y_te_pred = dt_base.predict(X_test)

print("Baseline Decision Tree Regressor")
print("-"*32)
print("Test MSE: %.3f" % ( mean_squared_error(y_test, y_te_pred)))


# RF


# GB



Baseline Decision Tree Regressor
--------------------------------
Test MSE: 0.495


In [None]:
# Plot predictions

# DT
# Visualize the learned tree
plt.figure(figsize=(12, 8))
plot_tree(dt_base, feature_names=["MedInc", "HouseAge"], filled=True)
plt.title("Baseline Decision Tree Structure")
plt.show()



In [None]:
# unlock seed and do repetitions

NUM_REPS = 10
SEEDS = list(range(NUM_REPS))

res_dict = {"Model": [], "Seed": [], "RMSE": []}

# ...

# Models
models = {
    "DecisionTree": DecisionTreeRegressor(max_depth=5, random_state=42),
    "RandomForest": RandomForestRegressor(n_estimators=100, random_state=42),
    "GradientBoosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
    "XGBoost": XGBRegressor(n_estimators=100, random_state=42)
}

# ...

# show results
res_df = pd.DataFrame(res_dict)
print(res_df.groupby("Model").agg({"RSME": ["mean", "std"]}))

---
# Exercise 2(Regression – California Housing Extended)

Same dataset.

**Tasks:**
- Train models with different `max_depth` and `n_estimators`.
- Use cross-validation to compare performance.
- Visualize feature importance for Random Forest & Gradient Boosting.

In [None]:
# TODO: Try different hyperparameters (max_depth, n_estimators)

# Define parameter grids for GridSearchCV
papram_grid_rf = {
    'max_depth': [3, 5, 10, 15],
    'n_estimators': [50, 100, 200]
}

param_grid_gb = {
    'max_depth': [3, 5, 10, 15],
    'n_estimators': [50, 100, 200]
}

# GridSearchCV for Random Forest
grid_search_rf = GridSearchCV 

grid_search_rf.fit(X_train, y_train)

grid_search_gb = 

print("\nBest parameters for gradient boosting:", grid_search_gb,best_params_)
print("Best score for gradient boosting: "), grid_search_gb.best_score_


best_rf_model = grid_search_rf.best_estimator_
best_gb_model = grid_search_gb.best_estimator_

print("Best RF", best_rf_model)
print("Best GB", best_gb_model)

In [None]:
# TODO: Use cross_val_score for evaluation


In [None]:
# TODO: Plot feature importances

---
# Exercise 3 (Classification – Digits Dataset)

Dataset: load_digits()

**Tasks:**
- Multi-class classification (0–9).
- Train DT, RF, GB.
- Evaluate with cross-validation accuracy.
- Print classification report.
- Plot confusion matrix.
- Visualize 10 misclassified digits → discuss why.

In [None]:
# TODO: Load digits dataset
digits = load_digits(as_frame=True)
print(digits.DESCR)

X, y = digits.data, digits.target

X.shape, y.shape



.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 1797
:Number of Attributes: 64
:Attribute Information: 8x8 image of integer pixels in the range 0..16.
:Missing Attribute Values: None
:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
:Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an in

In [None]:
# TODO: Train/test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.shape, X_test.shape

Features: ['pixel_0_0', 'pixel_0_1', 'pixel_0_2', 'pixel_0_3', 'pixel_0_4', 'pixel_0_5', 'pixel_0_6', 'pixel_0_7', 'pixel_1_0', 'pixel_1_1', 'pixel_1_2', 'pixel_1_3', 'pixel_1_4', 'pixel_1_5', 'pixel_1_6', 'pixel_1_7', 'pixel_2_0', 'pixel_2_1', 'pixel_2_2', 'pixel_2_3', 'pixel_2_4', 'pixel_2_5', 'pixel_2_6', 'pixel_2_7', 'pixel_3_0', 'pixel_3_1', 'pixel_3_2', 'pixel_3_3', 'pixel_3_4', 'pixel_3_5', 'pixel_3_6', 'pixel_3_7', 'pixel_4_0', 'pixel_4_1', 'pixel_4_2', 'pixel_4_3', 'pixel_4_4', 'pixel_4_5', 'pixel_4_6', 'pixel_4_7', 'pixel_5_0', 'pixel_5_1', 'pixel_5_2', 'pixel_5_3', 'pixel_5_4', 'pixel_5_5', 'pixel_5_6', 'pixel_5_7', 'pixel_6_0', 'pixel_6_1', 'pixel_6_2', 'pixel_6_3', 'pixel_6_4', 'pixel_6_5', 'pixel_6_6', 'pixel_6_7', 'pixel_7_0', 'pixel_7_1', 'pixel_7_2', 'pixel_7_3', 'pixel_7_4', 'pixel_7_5', 'pixel_7_6', 'pixel_7_7']
Shape: (1797, 64) | Target shape: (1797,)


Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_6,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


In [None]:
# TODO: Train models

# Models
models = {
    "DecisionTree": DecisionTreeRegressor(max_depth=5, random_state=42),
    "RandomForest": RandomForestRegressor(n_estimators=100, random_state=42),
    "GradientBoosting": GradientBoostingRegressor(n_estimators=100, random_state=42),
}

for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring="")
    print(f"{name}")


In [None]:
# TODO: Evaluate with CV + classification report


In [None]:
# TODO: Plot confusion matrix

ConfusionMatrixDisplay

In [None]:
# TODO: Visualize misclassified examples