# COMP0189: Applied Artificial Intelligence
## Week 7 (Model Interpretation and Feature selection)


## Learning goals ðŸŽ¯
1. Learn how to use different strategies for interpreting machine learning models.
2. Learn how to properly implement feature selection to avoid leaking information.

### Acknowledgements
- https://scikit-learn.org/stable/
- https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#id1

In [None]:
%pip install scikit-learn==1.7.2 matplotlib==3.10.8 pandas==2.3.3 seaborn==0.13.2 imbalanced-learn

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster import hierarchy

# Part 1: A common error: leaking information

We will start with a toy example to illustrate a common mistake when using feature selection. We will create a random dataset with 10.000 features and 100 samples.

In [None]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))

In [None]:
print(X.shape)

We might consider that 10.000 is a very high number of features and that we need to use feature selection. So, let's select the 5% most informative features.

In [None]:
from sklearn.feature_selection import SelectPercentile, f_regression

select = SelectPercentile(score_func=f_regression,
                          percentile=5)
select.fit(X, y)
X_sel = select.transform(X)

print(X_sel.shape)

Now we will create a pipeline to pre-process the data and fit a regression model to see if we can predict the random labels from the selected features.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X_sel, y, random_state=0)
pipe = make_pipeline(StandardScaler(), Ridge())
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

These are great results but how did we get such good results on a random dataset?

These results are due to information leaking as the features were selected before spliting the data into train and test splits.

### Task 1: Implement a correct pipeline to pre-process the data, select the top 5% features and train a regression model to predict th random labels.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = ...
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)

These results make more sense from what we would expet with random labels.

# Part 2: Model interpretation and feature selection

### QSAR Biodegradation Dataset

**Source:** [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/254/qsar+biodegradation)

**Samples:** 1,055 chemicals (356 ready biodegradable, 699 not ready biodegradable)

**Target Variable:** Experimental class (**RB** = ready biodegradable, **NRB** = not ready biodegradable)

**Features:** 41 molecular descriptors (e.g., SpMax_L, nHM, F04[C-N], nO, nN) used to classify biodegradability.

**Purpose:** Development of Quantitative Structure-Activity Relationship (QSAR) models to predict the biodegradability of chemical compounds.

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("biodegradation.csv")


# Display the first few rows
df.head()

Now we identify features X and targets y. The column "experimental class" is our target variable (i.e., the variable which we want to predict).

In [None]:
df.replace(["RB","NRB"],[1,0], inplace = True) #apply decoding
#rename target for better readability
df.rename(columns = {"experimental class": "degradable"}, inplace = True);

# Define features (X) and target (y)
X = df.drop(columns=["degradable"])  # Exclude non-feature columns
y = df["degradable"]  # Target variable (1 = ready biodegradable, 0 = not ready biodegradable)

# Display summary statistics
X.describe(include="all")

In [None]:
X.head()

Our target for prediction: degradable.


In [None]:
# Display the first few values
df["degradable"]

Handle imbalanced data

In [None]:
from imblearn.under_sampling import RandomUnderSampler

In [None]:
print("Before Undersampling, counts of label '1': {}".format(sum(y == 1)))
print("Before Undersampling, counts of label '0': {} \n".format(sum(y == 0)))

rus = RandomUnderSampler(random_state=42)
X_res, y_res = rus.fit_resample(X, y)

print('After Undersampling, the shape of train_X: {}'.format(X_res.shape))
print('After Undersampling, the shape of train_y: {} \n'.format(y_res.shape))

print("After Undersampling, counts of label '1': {}".format(sum(y_res == 1)))
print("After Undersampling, counts of label '0': {}".format(sum(y_res == 0)))

## Exploratory data analysis

We now split the sample into a train and a test dataset. Only the train dataset will be used in the following exploratory analysis. This is a way to emulate a real situation where predictions are performed on an unknown target, and we donâ€™t want our analysis and decisions to be biased by our knowledge of the test data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_res, y_res, random_state=42,stratify=y_res)

First, letâ€™s get some insights by looking at the a matrix showing the correlation of all features with each other between them. Only numerical variables will be used.

In [None]:
correlation_matrix = X_train.corr()

plt.figure(figsize=(8, 7))
sns.heatmap(correlation_matrix, cmap='coolwarm', center=0, square=True,
xticklabels=correlation_matrix.columns, yticklabels=correlation_matrix.columns)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Matrix')
plt.tight_layout()
plt.show()

It is often easier to see structure in the correlation matrix if we reorder the features using hierarchical clustering.

In [None]:
# Compute the correlation matrix
correlation_matrix = X_train.corr()

# Compute distance matrix using absolute correlation (to consider both positive/negative)
distance_matrix = 1 - np.abs(correlation_matrix)

# Perform hierarchical clustering
linkage = hierarchy.linkage(distance_matrix, method='average')
order = hierarchy.dendrogram(linkage, no_plot=True)['leaves']
reordered_corr = correlation_matrix.iloc[order, order]

# Plot clustered heatmap
plt.figure(figsize=(8, 7))
sns.heatmap(reordered_corr, cmap='coolwarm', center=0, square=True,
xticklabels=reordered_corr.columns, yticklabels=reordered_corr.columns)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Matrix (clustered order)')
plt.tight_layout()
plt.show()

Before designing a machine learning pipeline, we should check the type of data that we are dealing with:

In [None]:
# Check dataset information
df.info()

All features are numerical and unbounded, suggesting we should scale all of them before training.

## Task 2: Machine Learning Pipeline


### Task 2.1 Implement a **machine learning pipeline** that includes **preprocessing and cross-validation** to optimize the model's hyperparameters.
- Use the pipeline with **linear SVM** and **regularized logistic regression with L1 and elastic-net regularization** to predict whether a chemical is **degradable or non-degradable** based on the given features.
- Create a table to show the performance of the different models.
- Plot the confusion matrix and ROC curve for each model.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Preprocessing: Standardize numerical features
preprocessor = ...

In [None]:
from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

def optimise_hyperparameters(model: BaseEstimator, param_grid: dict):
    preprocess_and_train = ...

    grid_search = ...

    # Fit GridSearchCV
    return grid_search.fit(X_train, y_train)

# defining parameter range
cv_svc = optimise_hyperparameters(
    LinearSVC(dual="auto", random_state=42),
    {'classify__C': [0.1, 1]}
)
model_svc=cv_svc.best_estimator_

cv_lasso = optimise_hyperparameters(
    LogisticRegression(
        penalty="l1",  # Lasso (L1 regularization)
        solver="liblinear",  # Required for L1 penalty
        max_iter=100000,
    ),
    {'classify__C': np.logspace(-3, 3, 10)}
)
model_Lasso = cv_lasso.best_estimator_

cv_en = optimise_hyperparameters(
    LogisticRegression(
        penalty="elasticnet",
        solver="saga",
        max_iter=100000,
    ),
    {'classify__C': np.logspace(-3, 3, 10), "classify__l1_ratio": [0.1, 0.5, 0.9]}
)
model_EN = cv_en.best_estimator_

print("Done training models")

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, balanced_accuracy_score
from sklearn.pipeline import Pipeline

def get_metrics(name: str, model: Pipeline, use_proba: bool = False):
    # Predictions
    y_pred_test = model.predict(X_test)

    # Probabilities for AUC computation
    y_proba_test = model.predict_proba(X_test)[:, 1] if use_proba else model.decision_function(X_test)

    # Compute classification metrics
    metrics_test = {
        "Model": name,
        "Accuracy": balanced_accuracy_score(y_test, y_pred_test),
        "Precision": precision_score(y_test, y_pred_test),
        "Recall": recall_score(y_test, y_pred_test),
        "F1-score": f1_score(y_test, y_pred_test),
        "AUC": roc_auc_score(y_test, y_proba_test),
    }

    return metrics_test

results_df = pd.DataFrame(columns=["Model", "Accuracy", "Precision", "Recall", "F1-score", "AUC"])
results_df.loc[0] = get_metrics("SVC", model_svc)
results_df.loc[1] = get_metrics("Logistic Regression (L1)", model_Lasso)
results_df.loc[2] = get_metrics("Logistic Regression (ElasticNet)", model_EN)

results_df

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay

def plot_evaluation_graphs(models: list[tuple[str, Pipeline]]):
    fig, ax = plt.subplots(len(models), 2, figsize=(10, 4 * len(models)))

    for i, model in enumerate(models):
        # Confusion Matrix
        ax[i][0].set_title(f"{model[0]} - confusion matrix")
        ConfusionMatrixDisplay.from_estimator(model[1], X_test, y_test, display_labels=["degradable", "non-degradable"], ax=ax[i][0], cmap="Blues")

        # ROC Curve
        ax[i][1].set_title(f"{model[0]} - ROC curve")
        RocCurveDisplay.from_estimator(model[1], X_test, y_test, ax=ax[i][1])

    fig.tight_layout()
    plt.show()

models = [
    ("SVC", model_svc),
    ("LR (L1)", model_Lasso),
    ("LR (EN)", model_EN),
]
plot_evaluation_graphs(models)

### Task 2.2 Plot the models coefficients variability across folds for the linear models (please rank the coefficients to facilitate interpretability)

In [None]:
from sklearn.model_selection import RepeatedKFold, cross_validate

def get_coefficients(model: Pipeline):
    # Get feature names from the preprocessing pipeline
    feature_names = ...

    # Define repeated k-fold cross-validation
    cv = ...

    # Perform cross-validation and store estimators
    cv_model = ...

    # Extract coefficients from trained models
    return pd.DataFrame(
        [est[-1].coef_.ravel() for est in cv_model["estimator"]],  # Extracting coefficients correctly
        columns=feature_names
    )

coefficients = [
    ("SVC", get_coefficients(model_svc)),
    ("LR (L1)", get_coefficients(model_Lasso)),
    ("LR (EN)", get_coefficients(model_EN)),
]

In [None]:
# Plot coefficient variability with ranked coefficients
def plot_coefficients(name: str, coefs: pd.DataFrame):
    # Calculate median absolute value for each feature and sort
    median_abs = coefs.median().sort_values(ascending=False)
    sorted_features = median_abs.index.tolist()

    # Reorder coefficients by median absolute value
    coefs_sorted = coefs[sorted_features]

    plt.figure(figsize=(10, 10))
    sns.stripplot(data=coefs_sorted, orient="h", palette="dark:k", alpha=0.5)
    sns.boxplot(data=coefs_sorted, orient="h", color="cyan", saturation=0.5)
    plt.axvline(x=0, color=".5")
    plt.xlabel("Coefficient")
    plt.suptitle(f"{name} - Optimal Regularization (Sorted by Feature Importance)")
    plt.subplots_adjust(left=0.3)
    plt.show()

for i, coefficient in enumerate(coefficients):
    plot_coefficients(coefficient[0], coefficient[1])

Discussion: Are the coefficents across the different models similar?

### Task 2.3 Plot the permutation feature importance for the different models.

In [None]:
from sklearn.inspection import permutation_importance

# Extract feature names
feature_names = X_test.columns if hasattr(X_test, 'columns') else [f"Feature {i}" for i in range(X_test.shape[1])]

# Compute permutation importance on the final estimator (Lasso Logistic Regression)
result_svc = ...

result_lasso = ...

result_en = ...

# Plot feature importances
fig, ax = plt.subplots(figsize=(15, 5))
ax.bar(np.arange(0, 41) + 0.25, result_svc.importances_mean, yerr=result_svc.importances_std, width=0.25, label="SVC")
ax.bar(np.arange(0, 41) + 0.50, result_lasso.importances_mean, yerr=result_lasso.importances_std, width=0.25, label="LASSO")
ax.bar(np.arange(0, 41) + 0.75, result_en.importances_mean, yerr=result_en.importances_std, width=0.25, label="ElasticNet")
ax.set_xticks(np.arange(0, 41) + 0.5, feature_names, rotation=45, ha="right")
ax.legend()
ax.set_ylabel("Mean Accuracy Decrease")
ax.set_xlabel("Feature")

plt.show()

Discussion: Are the feature coefficients simimar to the permutation importance for the different models?

### Task 2.4 Implement a similar pipeline for tree-based models and use the pipeline with Random Forest and Gradient Boosting trees to predict the degradability from the other features.
- Create a table to show the performance of the different models.
- Plot the confusion matrix and ROC curve for each model.

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Random Forest Model
rf_model = ...

# Fit Random Forest model
rf_model.fit(X_train, y_train)

# Gradient Boosting Model
gb_model = ...

# Fit Gradient Boosting model
gb_model.fit(X_train, y_train)

print("Done training models")

In [None]:
results_df = pd.DataFrame(columns=["Model", "Accuracy", "Precision", "Recall", "F1-score", "AUC"])
results_df.loc[0] = get_metrics("Random Forest", rf_model, use_proba=True)
results_df.loc[1] = get_metrics("Gradient Boosting", gb_model)

results_df

In [None]:
models = [
    ("RF", rf_model),
    ("GB", gb_model)
]

plot_evaluation_graphs(models)

### Task 2.5 Plot the feature importance for the different tree-based models

In [None]:
def plot_feature_importance(model: Pipeline):
    # Access the RandomForestClassifier inside the pipeline
    random_forest_classifier = model.steps[-1][1]

    # Get feature importances
    feature_importances = ...

    # Extract feature names from the preprocessor
    feature_names = ...

    # Create a pandas Series for better visualization
    importances_series = pd.Series(feature_importances, index=feature_names)

    # Plot feature importances
    fig, ax = plt.subplots(figsize=(10, 6))
    importances_series.sort_values().plot.barh(ax=ax, color="forestgreen", alpha=0.7)
    ax.set_title("Feature Importance")
    ax.set_xlabel("Importance")
    fig.tight_layout()
    plt.show()

plot_feature_importance(rf_model)

In [None]:
plot_feature_importance(gb_model)

### Task 2.6 Plot the permutation feature importance for the different tree-based models

In [None]:
def plot_feature_importance(model):
    # Extract feature names
    feature_names = X_test.columns if hasattr(X_test, 'columns') else [f"Feature {i}" for i in range(X_test.shape[1])]

    # Compute permutation importance on the final estimator (RandomForestClassifier inside the pipeline)
    result = ...

    # Convert to Pandas Series for easy plotting
    rf_importances = pd.Series(result.importances_mean, index=feature_names)

    # Plot feature importances with error bars
    fig, ax = plt.subplots(figsize=(10, 8))
    rf_importances.sort_values().plot.barh(yerr=result.importances_std, ax=ax, color="forestgreen", alpha=0.8)
    ax.set_title("Feature Importances using Permutation - Random Forest Classifier")
    ax.set_xlabel("Mean Accuracy Decrease")
    ax.set_ylabel("Features")
    fig.tight_layout()
    plt.show()

plot_feature_importance(rf_model)

In [None]:
plot_feature_importance(gb_model)

Discussion: Are the feature importance and permutation feature importance similar for the different models?

### Task 2.7  For the best tree-based model use partial dependence plot to investigate dependence between the target response and each feature

In [None]:
from sklearn.inspection import PartialDependenceDisplay

# Predictions & Probabilities for AUC Computation
y_pred_gb = ...
y_pred_rf = ...

y_proba_gb = ...  # Probabilities for positive class
y_proba_rf = ...

# Compute classification metrics
metrics = {
    "Gradient Boosting": {
        "Accuracy": balanced_accuracy_score(y_test, y_pred_gb),
        "AUC": roc_auc_score(y_test, y_proba_gb),
        "F1-score": f1_score(y_test, y_pred_gb),
    },
    "Random Forest": {
        "Accuracy": balanced_accuracy_score(y_test, y_pred_rf),
        "AUC": roc_auc_score(y_test, y_proba_rf),
        "F1-score": f1_score(y_test, y_pred_rf),
    }
}

# Print performance comparison
for model, scores in metrics.items():
    print(f"\n{model} Performance:")
    for metric, value in scores.items():
        print(f"{metric}: {value:.4f}")

# Select the best model (based on AUC)
best_model = ...
best_model_name = "Gradient Boosting" if best_model == gb_model else "Random Forest"
print(f"\nBest Model Selected: {best_model_name}")

# Partial Dependence Plot (for best model)
features_to_plot = preprocessor.get_feature_names_out()[:6]  # Plot first 6 features for clarity

fig, ax = plt.subplots(figsize=(10, 8))
# code here
PartialDependenceDisplay.from_estimator(...)
plt.suptitle(f"Partial Dependence Plot - {best_model_name}", fontsize=14)
plt.show()

In [None]:
# Generate Individual Partial Dependence Plots (IPDP)
fig, ax = plt.subplots(figsize=(12, 8))
# code here
PartialDependenceDisplay.from_estimator(...)

plt.suptitle("IPDP plots", fontsize=14)
plt.tight_layout()
plt.show()

## Task 3: Include feature selection within the cross-validation pipeline implemented in Task 1 and try two different feature selection strategies (select k best and recursive feature elimination) with the linear SVM model.
- Create a table to show the performance of the different models.
- Plot the confusion matrix and ROC curve for each model.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

def make_feature_selection_pipline(feature_selection_step):
    param_grid = {'linearsvc__C': [0.1, 1,]}

    model_svc_select = ...

    return GridSearchCV(
        estimator=model_svc_select,
        param_grid=param_grid,
        n_jobs=-1,
        error_score=0,
        verbose=1,
        refit=True,
    )

kbest_pipeline = make_feature_selection_pipline(
    SelectKBest(score_func=f_classif, k=10)
)

# Fit GridSearchCV
kbest_result = kbest_pipeline.fit(X_train, y_train)

In [None]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

rfe_pipeline = make_feature_selection_pipline(
    RFE(estimator=LogisticRegression(max_iter=5000, solver="liblinear"), n_features_to_select=10),
)

rfe_result=rfe_pipeline.fit(X_train, y_train)

In [None]:
results_df = pd.DataFrame(columns=["Model", "Accuracy", "Precision", "Recall", "F1-score", "AUC"])
results_df.loc[0] = get_metrics("KBest", kbest_result.best_estimator_)
results_df.loc[1] = get_metrics("RFE", rfe_result.best_estimator_)

results_df

In [None]:
plot_evaluation_graphs([
    ("KBest", kbest_result.best_estimator_),
    ("RFE", rfe_result.best_estimator_)
])

Discussion: Did the model performance improved with feature selection?

### Task 3.2 Plot the coefficientes variability across folds for the linear model based on the selected features (please rank the coefficients to facilitate interpretability).

In [None]:
coefs_kbest = ...
plot_coefficients("KBest", coefs_kbest)

coefs_rfe = ...
plot_coefficients("RFE", coefs_rfe)

Discussion: Are similar features selected using the different strategies?