# COMP0189: Applied Artificial Intelligence

## Week 7 (Model Interpretation and Feature selection)

## Learning goals 🎯

1. Learn how to properly implement feature selection to avoid leaking information.
2. Learn how to use different strategies for interpreting machine learning models.

### Acknowledgements

- https://scikit-learn.org/stable/
- https://scikit-learn.org/stable/auto_examples/inspection/plot_linear_model_coefficient_interpretation.html#id1


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import seaborn as sns


# Part 1: A common error: leaking information


We will start with a toy example to illustrate a common mistake when using feature selection. We will create a random dataset with 10.000 features and 100 samples.


In [2]:
rnd = np.random.RandomState(seed=0)
X = rnd.normal(size=(100, 10000))
X_test = rnd.normal(size=(100, 10000))
y = rnd.normal(size=(100,))
y_test = rnd.normal(size=(100,))


In [3]:
print(X.shape)


(100, 10000)


We might consider that 10.000 is a very high number of features and that we need to use feature selection. So, let's select the 5% most informative features.


In [4]:
from sklearn.feature_selection import SelectPercentile, f_regression
from sklearn.linear_model import RidgeCV

select = SelectPercentile(score_func=f_regression, percentile=5)
select.fit(X, y)
X_sel = select.transform(X)

print(X_sel.shape)


(100, 500)


Now we will create a pipeline to pre-process the data and fit a regression model to see if we can predict the random labels from the selected features.


In [5]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X_sel, y, random_state=0)
pipe = make_pipeline(StandardScaler(), Ridge())
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)


0.9047168401499722

These are great results but how did we get such good results on a random dataset?

These results are due to information leaking.


### Task 1: Implement a correct pipeline to pre-process the data, select the top 5% features and train a regression model to predict th random labels.


In [6]:
X_train, X_test, y_train, y_test = train_test_split(None)
pipe = make_pipeline(None, select, Ridge())
pipe.fit(None)
pipe.score(None)


TypeError: Expected sequence or array-like, got <class 'NoneType'>

These results make more sense from what we would expet with random labels.


# Part 2: Model interpretation and feature selection


For this part we will use data from the “Current Population Survey” from 1985 to predict wage as a function of various features such as experience, age, or education.


We fetch the data from OpenML. Note that setting the parameter as_frame to True will retrieve the data as a pandas dataframe.


In [None]:
from sklearn.datasets import fetch_openml

survey = fetch_openml(data_id=534, as_frame=True)


Now we identify features X and targets y: the column WAGE is our target variable (i.e., the variable which we want to predict).


In [None]:
X = survey.data[survey.feature_names]
X.describe(include="all")


Note that the dataset contains categorical and numerical variables. We will need to take this into account when preprocessing the dataset thereafter.


In [None]:
X.head()


Our target for prediction: the wage. Wages are described as floating-point number in dollars per hour.


In [None]:
y = survey.target.values.ravel()
survey.target.head()


We now split the sample into a train and a test dataset. Only the train dataset will be used in the following exploratory analysis. This is a way to emulate a real situation where predictions are performed on an unknown target, and we don’t want our analysis and decisions to be biased by our knowledge of the test data.


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)


First, let’s get some insights by looking at the variable distributions and at the pairwise relationships between them. Only numerical variables will be used. In the following plot, each dot represents a sample.


In [None]:
train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind="reg", diag_kind="kde")


Looking closely at the WAGE distribution reveals that it has a long tail. For this reason, we should take its logarithm to turn it approximately into a normal distribution (linear models such as ridge or lasso work best for a normal distribution of error).

The WAGE is increasing when EDUCATION is increasing. Note that the dependence between WAGE and EDUCATION represented here is a marginal dependence, i.e., it describes the behavior of a specific variable without keeping the others fixed.

Also, the EXPERIENCE and AGE are strongly linearly correlated.


Before design a machine learning pipeline, we should check the type of data that we are dealing with:


In [None]:
survey.data.info()


As seen previously, the dataset contains columns with different data types and we need to apply a specific preprocessing for each data types.


## Task 2: Implement a machine learning pipeline that includes pre-processing and cross-validation to optimize the models hyperparameters and use the pipeline with rigde regression, Lasso and elastic-net regression regression to predict the wages from the other features.


In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import Ridge, RidgeCV, LassoCV
from sklearn.pipeline import make_pipeline

categorical_columns = ["RACE", "OCCUPATION", "SECTOR", "MARR", "UNION", "SEX", "SOUTH"]
numerical_columns = ["EDUCATION", "EXPERIENCE", "AGE"]

preprocessor = make_column_transformer(
    (None, categorical_columns),
    (None, numerical_columns),
    verbose_feature_names_out=False,
)


alphas = np.logspace(-10, 10, 21)  # alpha values to be chosen from by cross-validation

model_Ridge = make_pipeline(
    None,
    TransformedTargetRegressor(
        regressor=None,
        func=np.log10,
        inverse_func=sp.special.exp10,
    ),
)
model_Ridge.fit(None)


In [None]:
model_Ridge[-1].regressor_.alpha_


In [None]:
from sklearn.linear_model import LassoCV

alphas = np.logspace(-10, 10, 21)  # alpha values to be chosen from by cross-validation

model_Lasso = make_pipeline(
    None,
    TransformedTargetRegressor(
        regressor=None,
        func=np.log10,
        inverse_func=sp.special.exp10,
    ),
)
model_Lasso.fit(None)


In [None]:
model_Lasso[-1].regressor_.alpha_


In [None]:
from sklearn.linear_model import ElasticNetCV

alphas = np.logspace(-10, 10, 21)  # alpha values to be chosen from by cross-validation

model_EN = make_pipeline(
    preprocessor,
    TransformedTargetRegressor(
        regressor=None,
        func=np.log10,
        inverse_func=sp.special.exp10,
    ),
)
model_EN.fit(None)


In [None]:
model_EN[-1].regressor_.alpha_


### Task 2.1 Check the performance of the computed models plotting its predictions on the test set and computing the median absolute error of the model.


In [None]:
from sklearn.metrics import PredictionErrorDisplay, median_absolute_error

mae_train = median_absolute_error(None)
y_pred = None
mae_test = median_absolute_error(None)
scores = {
    "MedAE on training set": f"{mae_train:.2f} $/hour",
    "MedAE on testing set": f"{mae_test:.2f} $/hour",
}

_, ax = plt.subplots(figsize=(5, 5))
display = PredictionErrorDisplay.from_predictions(
    y_test, y_pred, kind="actual_vs_predicted", ax=ax, scatter_kwargs={"alpha": 0.5}
)
ax.set_title("Ridge model, optimum regularization")
for name, score in scores.items():
    ax.plot([], [], " ", label=f"{name}: {score}")
ax.legend(loc="upper left")
plt.tight_layout()


In [None]:
from sklearn.metrics import PredictionErrorDisplay, median_absolute_error

mae_train = median_absolute_error(None)
y_pred = None
mae_test = median_absolute_error(None)
scores = {
    "MedAE on training set": f"{mae_train:.2f} $/hour",
    "MedAE on testing set": f"{mae_test:.2f} $/hour",
}

_, ax = plt.subplots(figsize=(5, 5))
display = PredictionErrorDisplay.from_predictions(
    y_test, y_pred, kind="actual_vs_predicted", ax=ax, scatter_kwargs={"alpha": 0.5}
)
ax.set_title("Lasso model, optimum regularization")
for name, score in scores.items():
    ax.plot([], [], " ", label=f"{name}: {score}")
ax.legend(loc="upper left")
plt.tight_layout()


In [None]:
from sklearn.metrics import PredictionErrorDisplay, median_absolute_error

mae_train = median_absolute_error(None)
y_pred = None
mae_test = median_absolute_error(None)
scores = {
    "MedAE on training set": f"{mae_train:.2f} $/hour",
    "MedAE on testing set": f"{mae_test:.2f} $/hour",
}

_, ax = plt.subplots(figsize=(5, 5))
display = PredictionErrorDisplay.from_predictions(
    y_test, y_pred, kind="actual_vs_predicted", ax=ax, scatter_kwargs={"alpha": 0.5}
)
ax.set_title("Elastic-Net model, optimum regularization")
for name, score in scores.items():
    ax.plot([], [], " ", label=f"{name}: {score}")
ax.legend(loc="upper left")
plt.tight_layout()


### Task 2.2 Plot the models coefficients' variability across folds for the linear models.


In [None]:
from sklearn.model_selection import RepeatedKFold, cross_validate

feature_names = model_Ridge[:-1].get_feature_names_out()
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=0)
cv_model = cross_validate(
    None,
    None,
    None,
    cv=cv,
    return_estimator=True,
    n_jobs=2,
)
coefs = pd.DataFrame(
    [est[-1].regressor_.coef_ for est in cv_model["estimator"]], columns=feature_names
)


In [None]:
plt.figure(figsize=(9, 7))
sns.stripplot(data=coefs, orient="h", palette="dark:k", alpha=0.5)
sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5)
plt.axvline(x=0, color=".5")
plt.title("Coefficient importance and its variability")
plt.xlabel("Coefficient importance")
plt.suptitle("Ridge model, optimal regularization")
plt.subplots_adjust(left=0.3)


In [None]:
from sklearn.model_selection import RepeatedKFold, cross_validate

feature_names = model_Lasso[:-1].get_feature_names_out()
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=0)
cv_model = cross_validate(
    None,
    None,
    None,
    cv=cv,
    return_estimator=True,
    n_jobs=2,
)
coefs = pd.DataFrame(
    [est[-1].regressor_.coef_ for est in cv_model["estimator"]], columns=feature_names
)


In [None]:
plt.figure(figsize=(9, 7))
sns.stripplot(data=coefs, orient="h", palette="dark:k", alpha=0.5)
sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5)
plt.axvline(x=0, color=".5")
plt.title("Coefficient importance and its variability")
plt.xlabel("Coefficient importance")
plt.suptitle("Lasso model, optimal regularization")
plt.subplots_adjust(left=0.3)


In [None]:
from sklearn.model_selection import RepeatedKFold, cross_validate

feature_names = None[:-1].get_feature_names_out()
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=0)
cv_model = cross_validate(
    None,
    None,
    None,
    cv=cv,
    return_estimator=True,
    n_jobs=2,
)
coefs = pd.DataFrame(
    [est[-1].regressor_.coef_ for est in cv_model["estimator"]], columns=feature_names
)


In [None]:
plt.figure(figsize=(9, 7))
sns.stripplot(data=coefs, orient="h", palette="dark:k", alpha=0.5)
sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5)
plt.axvline(x=0, color=".5")
plt.title("Coefficient importance and its variability")
plt.xlabel("Coefficient importance")
plt.suptitle("Elastic-net model, optimal regularization")
plt.subplots_adjust(left=0.3)


Discussion: Are the coefficents across the different models similar?


### Task 2.3 Plot the permutation feature importance for the different models.


In [None]:
from sklearn.inspection import permutation_importance

feature_names = (
    X_test.columns
    if hasattr(X_test, "columns")
    else [f"feature {i}" for i in range(X_test.shape[1])]
)

result = permutation_importance(None)

forest_importances = pd.Series(result.importances_mean, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model - Ridge")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()

plt.xticks(rotation=45, ha="right")

plt.show()


In [None]:
feature_names = (
    X_test.columns
    if hasattr(X_test, "columns")
    else [f"feature {i}" for i in range(X_test.shape[1])]
)

result = permutation_importance(None)

forest_importances = pd.Series(None, index=None)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model - Lasso")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()

plt.xticks(rotation=45, ha="right")

plt.show()


In [None]:
feature_names = (
    X_test.columns
    if hasattr(X_test, "columns")
    else [f"feature {i}" for i in range(X_test.shape[1])]
)

result = permutation_importance(None)

forest_importances = pd.Series(None)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model - Elastic Net")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()

plt.xticks(rotation=45, ha="right")

plt.show()


Discussion: Are the feature coefficients simimar to the permutation importance for the different models?


### Task 2.4 Implement a similar pipeline for tree-based models and use the pipeline with random-forest and boosted regression trees to predict the wages from the other features.


In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.pipeline import make_pipeline
from sklearn.compose import TransformedTargetRegressor
import numpy as np
import scipy as sp

# Random Forest Model
rf_model = make_pipeline(
    None,
    TransformedTargetRegressor(None),
)

rf_model.fit(None)


In [None]:
# Gradient Boosting Model
gb_model = make_pipeline(
    None,
)
gb_model.fit(None)


### Task 2.5 Check the performance of the tree-based models plotting its predictions on the test set and computing the median absolute error of the model.


In [None]:
from sklearn.metrics import PredictionErrorDisplay, median_absolute_error

mae_train = median_absolute_error(None)
y_pred = None
mae_test = None
scores = {
    "MedAE on training set": f"{mae_train:.2f} $/hour",
    "MedAE on testing set": f"{mae_test:.2f} $/hour",
}

_, ax = plt.subplots(figsize=(5, 5))
display = PredictionErrorDisplay.from_predictions(
    y_test, y_pred, kind="actual_vs_predicted", ax=ax, scatter_kwargs={"alpha": 0.5}
)
ax.set_title("Random Forest model, fixed parameters")
for name, score in scores.items():
    ax.plot([], [], " ", label=f"{name}: {score}")
ax.legend(loc="upper left")
plt.tight_layout()


In [None]:
from sklearn.metrics import PredictionErrorDisplay, median_absolute_error

mae_train = None
mae_test = None
scores = {
    "MedAE on training set": f"{mae_train:.2f} $/hour",
    "MedAE on testing set": f"{mae_test:.2f} $/hour",
}

_, ax = plt.subplots(figsize=(5, 5))
display = PredictionErrorDisplay.from_predictions(
    y_test, y_pred, kind="actual_vs_predicted", ax=ax, scatter_kwargs={"alpha": 0.5}
)
ax.set_title("Gradient Boosting model, fixed parameters")
for name, score in scores.items():
    ax.plot([], [], " ", label=f"{name}: {score}")
ax.legend(loc="upper left")
plt.tight_layout()


### Task 2.6 Plot the feature importance for the different tree-based models


In [None]:
# Access the RandomForestRegressor object inside the TransformedTargetRegressor which is inside the pipeline
random_forest_regressor = rf_model.named_steps['transformedtargetregressor'].regressor_

# Get feature importances
feature_importances = random_forest_regressor.None


feature_names = preprocessor.get_feature_names_out()


# Create a pandas series with feature importances
importances_series = pd.Series(feature_importances, index=feature_names)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
importances_series.sort_values().plot.barh(ax=ax)
ax.set_title("Feature Importance - Random Forest")
ax.set_xlabel("Importance")
plt.tight_layout()
plt.show()


In [None]:
# Access the RandomForestRegressor object inside the TransformedTargetRegressor which is inside the pipeline
random_forest_regressor = gb_model.named_steps['transformedtargetregressor'].regressor_

# Get feature importances
feature_importances = random_forest_regressor.None


feature_names = preprocessor.get_feature_names_out()


# Create a pandas series with feature importances
importances_series = pd.Series(feature_importances, index=feature_names)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
importances_series.sort_values().plot.barh(ax=ax)
ax.set_title("Feature Importance - Gradient Bootsing")
ax.set_xlabel("Importance")
plt.tight_layout()
plt.show()


### Task 2.7 Plot the permutation feature importance for the different tree-based models


In [None]:
from sklearn.inspection import permutation_importance


feature_names = (
    X_test.columns
    if hasattr(X_test, "columns")
    else [f"feature {i}" for i in range(X_test.shape[1])]
)

result = permutation_importance(None)

forest_importances = pd.Series(result.importances_mean, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model - Random Forest")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()

plt.xticks(rotation=45, ha="right")
plt.show()


In [None]:
feature_names = (
    X_test.columns
    if hasattr(X_test, "columns")
    else [f"feature {i}" for i in range(X_test.shape[1])]
)

result = permutation_importance(None)

forest_importances = pd.Series(result.importances_mean, index=feature_names)

fig, ax = plt.subplots()
forest_importances.plot.bar(yerr=result.importances_std, ax=ax)
ax.set_title("Feature importances using permutation on full model - Gradient Boosting")
ax.set_ylabel("Mean accuracy decrease")
fig.tight_layout()

plt.xticks(rotation=45, ha="right")
plt.show()


Discussion: Are the feature importance and permutation feature importance similar for the different models?


### Task 2.8 For the best tree-based model use partial dependence plot to investigate dependence between the target response and each feature


In [None]:
from sklearn.metrics import r2_score, mean_absolute_error

# Predictions from Gradient Boosting model
y_pred_gb = gb_model.predict(X_test)

# Predictions from Random Forest model
y_pred_rf = rf_model.predict(X_test)

# Compute R² score
r2_gb = r2_score(y_test, y_pred_gb)
r2_rf = r2_score(y_test, y_pred_rf)

# Compute Mean Absolute Error
mae_gb = mean_absolute_error(y_test, y_pred_gb)
mae_rf = mean_absolute_error(y_test, y_pred_rf)

print(f"Gradient Boosting R² Score: {r2_gb:.4f}")
print(f"Random Forest R² Score: {r2_rf:.4f}")

print(f"Gradient Boosting MAE: {mae_gb:.4f} $/hour")
print(f"Random Forest MAE: {mae_rf:.4f} $/hour")


In [None]:
# refer to documentation ... create list of categories which signal which features are categorical

# Get the list of all feature names
feature_names = X_train.columns.tolist()

# Get the list of numeric feature names
numeric_feature_names = X_train.select_dtypes(include=[np.number]).columns.tolist()

# Initialize an empty list to store the boolean values
is_categorical = []

# Iterate over all feature names
for feature in feature_names:
    # If the feature is not in the list of numeric features, it is categorical
    if feature not in numeric_feature_names:
        is_categorical.append(None)
    else:
        is_categorical.append(None)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.inspection import PartialDependenceDisplay

# Generate partial dependence plots for selected features using the entire pipeline
fig, ax = plt.subplots(figsize=(12, 10))  # Adjusted figure size
PartialDependenceDisplay.from_estimator(None)

plt.xticks(rotation=45)  # Rotate x-axis labels if needed
plt.subplots_adjust(bottom=0.2, hspace=1, wspace=0.4)  # Adjust spacing
plt.tight_layout()  # Adjust layout

plt.show()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.inspection import PartialDependenceDisplay

# Individual (only for numeric)

# Generate partial dependence plots for selected features using the entire pipeline
fig, ax = plt.subplots(figsize=(12, 8))
PartialDependenceDisplay.from_estimator(None)

plt.show()


## Task 3: Include feature selection within the cross-validation pipeline implemented in Task 1 and try two different feature selection strategies (select k best and recursive feature elimination) with the ridge regression model.


In [None]:
from sklearn.feature_selection import RFECV, RFE, SelectKBest
from sklearn.svm import SVR


alphas = np.logspace(-10, 10, 21)  # alpha values to be chosen from by cross-validation
model = make_pipeline(None)
# selector = RFECV(model, cv=5)
model.fit(X_train, y_train)


### Task 3.1 Check the performance of the computed models plotting its predictions on the test set and computing the median absolute error of the model.


In [None]:
from sklearn.metrics import PredictionErrorDisplay, median_absolute_error

mae_train = None
y_pred = None
mae_test = None
scores = {
    "MedAE on training set": f"{mae_train:.2f} $/hour",
    "MedAE on testing set": f"{mae_test:.2f} $/hour",
}

_, ax = plt.subplots(figsize=(5, 5))
display = PredictionErrorDisplay.from_predictions(
    y_test, y_pred, kind="actual_vs_predicted", ax=ax, scatter_kwargs={"alpha": 0.5}
)
ax.set_title("Ridge model + RFE and optimum regularization")
for name, score in scores.items():
    ax.plot([], [], " ", label=f"{name}: {score}")
ax.legend(loc="upper left")
plt.tight_layout()


Discussion: Did the model performance improved with feature selection?


### Task 3.2 Plot the coefficients variability across folds for the linear model based on the selected features.


In [None]:
from sklearn.model_selection import RepeatedKFold, cross_validate

feature_names = model[:-1].get_feature_names_out()
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=0)
cv_model = cross_validate(None)
coefs = pd.DataFrame(
    [est[-1].regressor_.coef_ for est in cv_model["estimator"]], columns=feature_names
)


In [None]:
plt.figure(figsize=(9, 7))
sns.stripplot(data=coefs, orient="h", palette="dark:k", alpha=0.5)
sns.boxplot(data=coefs, orient="h", color="cyan", saturation=0.5)
plt.axvline(x=0, color=".5")
plt.title("Coefficient importance and its variability")
plt.xlabel("Coefficient importance")
plt.suptitle("Ridge model  + RFE and optimum regularization")
plt.subplots_adjust(left=0.3)


Discussion: Are similar features selected using the different strategies?
