# <i class="fa-solid fa-dumbbell"></i> Exercises

Please fill the missing code pieces as indicated by the `...`. The imports are always provided at the top of the code chunks. This should give you a hint for which functions/classes to use. Have a look at the online documentation if you are unsure how to use them.

## Exercise 1: Model Selection

Today we are working with the `California Housing dataset`, which you are already familiar with, as we previously used it while exploring resampling method.
This dataset is based on the 1990 U.S. Census and includes features describing California districts. 

1) Familiarize yourself with the data
    - What kind of features are in the dataset? What is the target?

In [None]:
from sklearn.datasets import fetch_california_housing

# TODO: Get data
...

# TODO: Extract features and target
...

2) Baseline model 
    - Create a baseline linear regression model using **all** features and evaluate the model through 5-fold cross validation, using R² as the performance metric
    - Print the individual and average R²

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np

# TODO: Fit and evaluare regression model
...

3) Apply a forward stepwise selection to find a simpler suitable model.
    - Split the data into 80% training data and 20% testing data (print the shape to confirm it was sucessful)
    - Perform a forward stepwise selection with a linear regression model, 5-fold CV, R² score, and `parsimonious` feature selection (refer to [documentation](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/) for further information)
    - Print the best CV R² as well as the chosen features

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import train_test_split

# TODO: Split data into training and test sets
...

# TODO: Perform forward sequential feature selection
sfs_forward = ...

# Summary of results
print(f">> Forward SFS:")
print(f"   Best CV R²      : {sfs_forward.k_score_:.3f}")
print(f"   Optimal # feats : {len(sfs_forward.k_feature_idx_)}")
print(f"   Feature names   : {sfs_forward.k_feature_names_}")

4) Evaluate the model on the test set

In [None]:
# TODO: Get the selected features as a list
...

# TODO: Train and evaluate the model
...

## Exercise 2: LASSO

Please implement a Lasso regression model similar to the Ridge model in the [Regularization](2_Regularization) section.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm 

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split

# Data related processing
hitters = sm.datasets.get_rdataset("Hitters", "ISLR").data
hitters_subset = hitters[["Salary", "AtBat", "Runs","RBI", "CHits", "CAtBat", "CRuns", "CWalks", "Assists", "Hits", "HmRun", "Years", "Errors", "Walks"]].copy()

# TODO: Drop highly correlated features and rows with missing data
...

# TODO: Get the target (y) and features (X), then split into training and test set
...

# TODO: Scale predictors to mean=0 and std=1
...

# TODO: Implement Lasso 
...

## Exercise 3: Principal Component Analysis

 For today’s practical session, we will work with the **Diabetes** dataset built into `scikit-learn`. This dataset contains medical information from 442 diabetes patients:

* **Features (X):** 10 baseline variables (age, sex, BMI, average blood pressure, and six blood serum measures).
* **Target (y):** a quantitative measure of disease progression one year after baseline.

You can read more here: [https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load\_diabetes.html](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html)

**Tasks:**

1. **Inspect & clean (already implemented)**

   * Display summary statistics (`df.describe()`) for all 10 features.
   * Check for missing values. (Hint: this dataset has none, but verify.)

2. **Standardize**

   * Use `StandardScaler()` to transform each feature to mean 0, variance 1.

3. **PCA & scree plot**

   * Fit `PCA()` to the standardized feature matrix.
   * Plot the **explained variance ratio** for each principal component (a scree plot).
   * Decide how many components to retain (e.g.\ cumulative variance ≥ 80%).

4. **Interpret loadings**

   * Examine `pca.components_`.
   * For the first two retained PCs, list the top 3 features by absolute loading.
   * Infer what physiological patterns these components might represent.

5. **Project the data for visualization**

   * Compute the PCA projection: `X_pca = pca.transform(X_std)`.

6. **Plot the results (already implemented)**
   * Create a 2D scatter of PC1 vs. PC2, coloring points by whether the target is **above** or **below** the median progression value.
   * Do patients with more rapid progression cluster differently?

In [None]:
from sklearn.datasets import load_diabetes

# Load the data as a DataFrame
diabetes = load_diabetes(as_frame=True)
df = diabetes.frame
df.rename(columns={'target': 'Disease progression'}, inplace=True)

X = df.drop(columns='Disease progression')
y = df['Disease progression']

# 1. Inspect the data
df.head()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

import seaborn as sns
sns.set_theme(style="whitegrid")

# 2. Standardize the data
scaler = StandardScaler()
X_std = ...

# 3. Perform the PCA
pca = ...

# 4. Get the explained variance ratio
explained_variance = ...

# 5. Project into PCA space
X_pca = ...

# 6. Plot the explained variance and 2D PCA projection
fig, ax = plt.subplots(1,2, figsize=(15, 5))

ax[0].plot(, marker='o')
ax[0].set(xlabel='Number of Components', ylabel='Cumulative Explained Variance', title='Scree Plot')

sns.scatterplot(..., hue=y, palette='viridis', alpha=0.6, ax=ax[1])
ax[1].set(xlabel='Principal Component 1', ylabel='Principal Component 2');

## Exercise 3.2: PCR and PLS

In this exercise, we will compare PCR and PLS on the classic Diabetes dataset from scikit-learn. This dataset contains 10 baseline variables (age, BMI, blood pressure, etc.) and a quantitative target: a measure of disease progression one year after baseline.

We start by loading the data and extracting the features (`X`) as well as the target (`y`):

In [None]:
from sklearn.datasets import load_diabetes
import pandas as pd
import matplotlib.pyplot as plt

data = load_diabetes(as_frame=True)
X = data.data
y = data.target

X.head()

How many predictors does the dataset have? Are any of them obviously correlated? Visualize them with a correlation matrix/heatmap.

In [None]:
import seaborn as sns

# TODO: Plot the correlation matrix with sns.heatmap()
...

Please apply PCR with a range of components to see how the performance changes.

1. Use 10-fold CV
2. Try 1 to 10 components
3. Create a model pipeline with `make_pipeline()`
4. Evaluate the models with `cross_val_score()`
5. Plot the $R^2$ over the components

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

# TODO: Set up variables
cv = ...
n_components = ...
pcr_scores = []

# TODO: Loop over number of components, create pipeline, evaluate with cross_val_score()
for n in n_components:
    model = ...
    scores = ...
    pcr_scores.append(scores.mean())

# TODO: Plot results
...

Now, do the exact same thing with `PLSRegression`. How does PLS compare to PCR?

In [None]:
from sklearn.cross_decomposition import PLSRegression

pls_scores = []

# TODO: Loop over number of components, create PLS model, evaluate with cross_val_score()
for n in n_components:
    pls = ...
    scores = ...
    pls_scores.append(scores.mean())

# TODO: Plot PLS and PCR results together

## Exercise 4: Logistic Regression

For today's exercise we will use the **Breast Cancer Wisconsin (Diagnostic)**. It is a collection of data used for predicting whether a breast tumor is malignant (cancerous) or benign (non-cancerous), containing information derived from images of breast mass samples obtained through fine needle aspirates.

The dataset consists of 569 samples with 30 features that measure various characteristics of cell nuclei, such as radius, texture, perimeter, and area. Each sample is labeled as either **malignant (1)** or **benign (0)**.

1. Please [visit the documentation](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic) and familiarize yourself with the dataset
2. Take an initial look at the features (predictors) and targets (outcomes)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from ucimlrepo import fetch_ucirepo

# Fetch dataset 
breast_cancer_wisconsin_diagnostic = fetch_ucirepo(id=17) 
  
# Get data (as pandas dataframes) 
X = ...
y = ...

# Convert y to a 1D array (this is the required input for the logistic regression model)
y = ...

# Print some general information and statistics about the dataset 
...

1. Split the data into training and test sets (stratify `y`)
2. Create and fit a baseline model using only 2-3 interpretable predictors of your choice
    - Print the model coefficients
3. Evaluate on the test set:
    - Accuracy
    - Confusion matrix
    - Classification report
    - Compare test accuracy to train accuracy. Is there a big gap?

*Hint: If you get a warning about convergence, try setting `max_iter=10000` in the logistic regression class.*

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 1. Split the data
...

# 2. Baseline model with three features
...

# 3. Evaluate the baseline model
...

4. Use all predictors and build a pipeline with standardisation
5. Evaluate on the test set
6. Compare:
    - Baseline model vs full model (test accuracy, precision, recall for malignant)
    - Is the full model clearly better?
    - Is there any sign of overfitting (train vs test)?

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# 4. Full model with all features
...

# 5. Evaluate the full model
...

6. Create a custom plot which visualizes the confusion matrix It should contain:
    - The four squares of the matrix (color coded)
    - Labels of the actual values in the middle of each square
    - Labels for all squares
    - A colorbar
    - A title

7. Use scikit-learn to do the same

In [None]:
# 6. Plot a custom confusion matrix
...

# 6. Use scikit-learn
from sklearn.metrics import ConfusionMatrixDisplay

...

## Exercise 5: LDA, QDA & Naïve Bayes

Once again, we will use the Iris dataset for classificationa analysis. Your task is to compare the performance of LDA, QDA, and Gaussian Naïve Bayes!

1. Load the `iris` dataset from `sklearn.datasets`. We will use only the first two features (sepal length and width)
2. `TODO:` Split the data into training and test sets ([use stratification!](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html))
3. `TODO:` Fit LDA, QDA, and Naïve Bayes classifiers to the training data and orint the classification report for all models on the test data
4. Plot the decision boundaries for both models

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from matplotlib.colors import ListedColormap

# 1. Load data
iris = load_iris()
X = iris.data[:, :2]
y = iris.target
target_names = iris.target_names

# 2. Split into train/test
...

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# 3. TODO: Fit a LDA model and print the classification report
lda = ...

In [None]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

# 3. TODO: Fit a QDA model and print the classification report
qda = ...

In [None]:
from sklearn.naive_bayes import GaussianNB

# 3. TODO: Fit a Gaussian Naive Bayes model and print the classification report
gnb = ...

Once you have trained all three models, you can plot their decision boundaries based on the training data. For this, you can use the provided helper function below:

In [None]:
# 4. Plot the decision boundaries for all 3 classifiers

# Helper function to plot decision boundaries
def plot_decision_boundary(model, X, y, title, ax):
    h = .02
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
    cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

    ax.contourf(xx, yy, Z, cmap=cmap_light, alpha=0.2)
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=30)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_title(title)
    ax.set_xlabel('Sepal length')
    ax.set_ylabel('Sepal width')

# Create plots for all 3 classifiers
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
...  # LDA
...  # QDA
...  # Naïve Bayes
plt.tight_layout()