# COMP0189: Applied Artificial Intelligence
## Week 7 (Model Interpretation and Feature selection)


## Learning goals 🎯
1. Learn how to use different strategies for interpreting machine learning models.
2. Learn how to properly implement feature selection to avoid leaking information.

### Acknowledgements
- https://scikit-learn.org/stable/
- https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#id1

In [None]:
%pip install scikit-learn==1.6.1 matplotlib seaborn pandas

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Part 1: A common error: leaking information

We will start with a toy example to illustrate a common mistake when using feature selection. We will create a random dataset with 10.000 features and 100 samples.

In [2]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
X_test = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))
y_test = rnd.normal(size=(100,))

In [None]:
print(X.shape)

We might consider that 10.000 is a very high number of features and that we need to use feature selection. So, let's select the 5% most informative features.

In [None]:
from sklearn.feature_selection import SelectPercentile, f_regression

select = SelectPercentile(score_func=f_regression,
                          percentile=5)
select.fit(X, y)
X_sel = select.transform(X)

print(X_sel.shape)

Now we will create a pipeline to pre-process the data and fit a regression model to see if we can predict the random labels from the selected features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X_sel, y, random_state=0)
pipe = make_pipeline(StandardScaler(), Ridge())
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

These are great results but how did we get such good results on a random dataset?

These results are due to information leaking as the features were selected before spliting the data into train and test splits.

### Task 1: Implement a correct pipeline to pre-process the data, select the top 5% features and train a regression model to predict th random labels.

These results make more sense from what we would expet with random labels.

# Part 2: Model interpretation and feature selection

## Breast Cancer Wisconsin (Diagnostic) Dataset (WDBC)

For this part, we will use data from the **Breast Cancer Wisconsin (Diagnostic) Dataset (WDBC)**.

**Source:** [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)  

**Samples:** 569 (357 Benign, 212 Malignant)  

**Target Variable:** Diagnosis (**M** = Malignant, **B** = Benign)  

### Features (30 total)
- **10 Cell Nucleus Characteristics**, including:
  - Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Concave Points, Symmetry, Fractal Dimension  
- Each feature has **Mean, Standard Error (SE), and Worst** (largest mean of top 3 values) variations  



In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load dataset
data = load_breast_cancer()

# Convert to DataFrame
df = pd.DataFrame(data.data, columns=data.feature_names)
df['diagnosis'] = data.target  # 0 = Benign, 1 = Malignant

# Display the first few rows
df.head()

Now we identify features X and targets y. The column WAGE is our target variable (i.e., the variable which we want to predict).

In [None]:
# Define features (X) and target (y)
X = df.drop(columns=["diagnosis"])  # Exclude non-feature columns
y = df["diagnosis"]  # Target variable (M = Malignant, B = Benign)

# Display summary statistics
X.describe(include="all")

In [None]:
X.head()

Our target for prediction: Diagnosis.


In [None]:
# Define the target variable (y)
y = df["diagnosis"].values.ravel()

# Display the first few values
df["diagnosis"]

We now split the sample into a train and a test dataset. Only the train dataset will be used in the following exploratory analysis. This is a way to emulate a real situation where predictions are performed on an unknown target, and we don’t want our analysis and decisions to be biased by our knowledge of the test data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

First, let’s get some insights by looking at the a matrix showing the correlation of all features with each other between them. Only numerical variables will be used.

We can see that several features are strongly correlated. For example, "mean radius", "mean perimeter" and "mean area" are very strongly correlated with each other. They are also correlated to all other features in the same way. This indicates that these 3 features provide the same or very similar information about the tumor shape.

Before designing a machine learning pipeline, we should check the type of data that we are dealing with:

In [None]:
# Check dataset information
df.info()

All features are numerical and unbounded, suggesting we should scale all of them before training.

## Task 2: Machine Learning Pipeline


### Task 2.1 Implement a **machine learning pipeline** that includes **preprocessing and cross-validation** to optimize the model's hyperparameters. 
- Use the pipeline with **linear SVM** and **regularized logistic regression with L1 and elastic-net regularization** to predict whether a tumor is **malignant or benign** based on the given features. 
- Create a table to show the performance of the different models. 
- Plot the confusion matrix and ROC curve for each model.

In [None]:
from sklearn.compose import make_column_transformer

# Preprocessing: Standardize numerical features
preprocessor = make_column_transformer(
    (StandardScaler(), X.columns),  # Standardize all features
    verbose_feature_names_out=False,
)

In [None]:
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import GridSearchCV 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC 

def train_cv(model, param_grid):
    preprocess_and_train = Pipeline(steps=[
        ("preprocess", preprocessor),
        ("classify", model)
    ])

    grid_search = GridSearchCV(
        estimator=preprocess_and_train,
        param_grid=param_grid,
        n_jobs=-1,
        error_score=0,
        refit=True
    )

    # Fit GridSearchCV
    return grid_search.fit(X_train, y_train) 

# defining parameter range 
cv_svc = train_cv(
    LinearSVC(dual = "auto", random_state=42),
    {'classify__C': [0.1, 1,]}
)
model_svc=cv_svc.best_estimator_

cv_lasso = train_cv(
    LogisticRegression(
        penalty="l1",  # Lasso (L1 regularization)
        solver="liblinear",  # Required for L1 penalty
        max_iter=100000,
    ),
    {'classify__C': np.logspace(-3, 3, 10)}
)
model_Lasso = cv_lasso.best_estimator_

cv_en = train_cv(
    LogisticRegression(
        penalty="elasticnet",  # Lasso (L1 regularization)
        solver="saga",  # Required for L1 penalty
        max_iter=100000,
    ),
    {'classify__C': np.logspace(-3, 3, 10), "classify__l1_ratio": [0.1, 0.5, 0.9]}
)
model_EN = cv_en.best_estimator_

print("Done training models")

Done training models


In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay

# Evaluate models and plot confusion matrices and ROC curves

### Task 2.2 Plot the models coefficients variability across folds for the linear models

In [None]:
from sklearn.model_selection import RepeatedKFold, cross_validate



Discussion: Are the coefficents across the different models similar?

### Task 2.3 Plot the permutation feature importance for the different models.

In [None]:
from sklearn.inspection import permutation_importance



Discussion: Are the feature coefficients simimar to the permutation importance for the different models?

### Task 2.4 Implement a similar pipeline for tree-based models and use the pipeline with Random Forest and Gradient Boosting trees to predict the tumour malignancy from the other features.
- Create a table to show the performance of the different models. 
- Plot the confusion matrix and ROC curve for each model.

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Random Forest Model
rf_model = make_pipeline(
    preprocessor,
    RandomForestClassifier(max_depth=4, random_state=0, n_estimators=100, n_jobs=-1)
)

# Fit Random Forest model
rf_model.fit(X_train, y_train)

# Gradient Boosting Model
gb_model = make_pipeline(
    preprocessor,
    GradientBoostingClassifier(max_depth=4, random_state=0, n_estimators=100)
)

# Fit Gradient Boosting model
gb_model.fit(X_train, y_train)

# These are the pre-trained models that you can use for the next sections
tree_models = [
    ("Random Forest", rf_model),
    ("Gradient Boosting", gb_model)
]

print("Done training models")

Done training models


### Task 2.5 Plot the feature importance for the different tree-based models

### Task 2.6 Plot the permutation feature importance for the different tree-based models

Discussion: Are the feature importance and permutation feature importance similar for the different models?

### Task 2.7  For the best tree-based model use partial dependence plot to investigate dependence between the target response and each feature

In [None]:
from sklearn.inspection import PartialDependenceDisplay



## Task 3: Include feature selection within the cross-validation pipeline implemented in Task 1 and try two different feature selection strategies (select k best and recursive feature elimination) with the linear SVM model.
- Create a table to show the performance of the different models. 
- Plot the confusion matrix and ROC curve for each model.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression



Discussion: Did the model performance improved with feature selection?

### Task 3.2 Plot the coefficientes variability across folds for the linear model based on the selected features.

Discussion: Are similar features selected using the different strategies?