# **TP5 - Interpretable Models**

**Course: Advanced Programming For Data Science** <br>
**Lecturer: Dr. Sothea HAS** <br> 
**Mr. Vesal KHEAN**

-------


**Objective:** In this lab, you will move beyond simple prediction to understanding the *relationships* between physical measurements and the age of an abalone. You will learn to use statistical tests to select significant features, handle *non-linear patterns* with *polynomial features*, and control model complexity using *regularization*. The resulting models should also be interpreted at the end of this work.

**The Jupyter Notebook for this TP can be downloaded here: [TP4_interpretable_Models.ipynb.](https://hassothea.github.io/Advanced_Programming_for_DS/TP/TP5_Interpretable_Models.ipynb){target='_blank'}****


## **1\. Abalone dataset: EDA and Test**

[![](https://asc-aqua.org/wp-content/uploads/2023/03/shutterstock_1916916224-1-2500x1250.jpg)](https://asc-aqua.org/wp-content/uploads/2023/03/shutterstock_1916916224-1-2500x1250.jpg){target="_blank"}

Before interpreting any model, we must ensure the relationships it captures are statistically significant. We will use `statsmodels` to obtain detailed statistical summaries (`p-values`) for our coefficients.

**A. Import dataset:** For more information about the dataset at [Abalone of UC Irvive Machine Learning Repository](https://archive.ics.uci.edu/dataset/1/abalone){target='_blank'}. The Abalone dataset can be imported from respository of Machine Learning as below.

In [3]:
#| code-fold: true

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
column_names = ["Type", "Length", "Diameter", "Height", "Whole_weight", 
                "Shucked_weight", "Viscera_weight", "Shell_weight", "Rings"]

df = pd.read_csv(url, names=column_names)

**B. EDA and Data Preprocessing:** 

- Whatâ€™s the dimension of this dataset? How many quantitative and qualitative variables are there in this dataset?

- Create statistical summary and visualize the distribution of the dataset. 

- Identify and handle problems if there are any in this dataset.

- Study the correlation matrix of this dataset. Comment this correlation matrix.

In [None]:
# To do


**C. Simple and Multiple Linear Regression (OLS)**

- Split the dataset into $80\%-20\%$ train-test data.

- Fit simple linear regression model (`slr`) predicting `Rings` using the most reasonable input and `statsmodels` module, here because it provides a comprehensive summary table for statistical testing.

- Fit multiple linear regression model (`mlr`) using all available features. 

- Compute the performances of your models on the test data.

In [None]:
# To do



**D. Interpretation & Significance Testing:** From each result above, analyze the `P>|t|` column in the summary above.

- **Hypothesis**: Null Hypothesis ($H_0$) is that the coefficient is 0 (no effect). If $P < 0.05$, we reject $H_0$.

- **Observation**: Look for features with P-values $> 0.05$.

- **Refinement**: Below, we will try dropping a feature if it appears insignificant.

In [None]:
# To do


## **2\. Polynomial Features (Complexity)**

Simple linear models assume a straight-line relationship. However, biological growth is often non-linear. We will introduce **Polynomial Features** to capture interaction terms (e.g., $Length \times Diameter$) and curvature.

**A. Creating Interaction Terms**
Use `PolynomialFeatures` to generate a new feature set.

In [None]:
# To do


**B. The Danger of Overfitting**
Fit a standard Linear Regression model on these new polynomial features. Notice the magnitude of the coefficients.

In [None]:
# To do


## **3\. Regularization (Ridge & Lasso)**

High-degree polynomials can lead to exploding coefficients. One may use **Regularization** to constrain the model, trading a little bias for significantly reduced variance.

**A. Ridge Regression (L2 Penalty)** 

You may use a `Pipeline` to ensure data is Scaled before Regularization (crucial for Ridge/Lasso). We use `RidgeCV` to automatically find the best alpha.

In [4]:
# To do


**B. Visualizing the Regularization Path (Ridge)**

This plot visualizes how increasing the regularization strength ($\alpha$) shrinks the coefficients towards zero, reducing model complexity.

In [None]:
# To do


**C. Lasso feature selection**

Unlike Ridge, Lasso can drive coefficients exactly to zero, effectively performing feature selection. Build polynomial of a chosen degree of selected features, then perform lasso regression. Fine-tune suitable penalization strength $\alpha$ and keep track the change in coefficients of the model at each value of $\alpha$.


In [None]:
# To do

- Interpret each of your obtained model.

## **References & Further Reading**

* **Textbook**: [Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/) by Christoph Molnar.
    * *Recommended Reading*: Chapter 4.1 on [Linear Regression](https://christophm.github.io/interpretable-ml-book/limo.html) for a deeper understanding of weights, p-values, and interpretation.

* **Documentation**:
    * [Statsmodels OLS](https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLS.html): Official documentation for the Ordinary Least Squares implementation used for statistical testing.
    * [Scikit-Learn: Polynomial Features](https://scikit-learn.org/stable/modules/preprocessing.html#polynomial-features): Guide on generating polynomial and interaction features to capture non-linear relationships.
    * [Scikit-Learn: Generalized Linear Models](https://scikit-learn.org/stable/modules/linear_model.html): Comprehensive documentation for Ridge (L2) and Lasso (L1) regularization methods.

* **Dataset**:
    * [UCI Machine Learning Repository: Abalone Dataset](https://archive.ics.uci.edu/ml/datasets/abalone): The original source of the data, including variable definitions and data collection details.
