# Feature Importance

In the context of [📓Feature Engineering](ml-feature-engineering.ipynb), it can be helpful to incorporate feedback from the machine learning algorithm about which features it relied on for its prediction. Several algorithms therefore provide **feature importance scores** after training. These are also frequently used to explain the predictions of a model.

## Preamble

In [None]:
import data_science_learning_paths
data_science_learning_paths.setup_plot_style()

In [None]:
import pandas
import numpy

## Example: Feature Importance in Titanic Survival Model

In the following, we build a simple classifier on the Titanic dataset:

In [None]:
data_path = "../.assets/data/titanic/titanic.csv"

In [None]:
data = pandas.read_csv(data_path)

Many implementations of machine learning models (e.g. found in `scikit-learn`) provide a way to access weights given to the features, depending on how important the features are for the model's decision. Computing and visualizing **feature importance** after model training is a helpful step in feature engineering.

Consider a simplistic classification model for survival on the Titanic using `scikit-learn`: 

In [None]:
features = ["Pclass", "Age", "SibSp", "Parch", "Sex"]
target =  "Survived"

In [None]:
data = data[features + [target]].dropna()

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
data["Sex"] = LabelEncoder().fit_transform(data["Sex"])

In [None]:
data.head()

In [None]:
X, y = data[features], data[target]

In [None]:
from sklearn.model_selection import cross_val_score

## Feature Importance in Various ML Algorithms

There is single definition of feature importance that applies universally. Depending on the internals of the ML algorithm, different metrics can be used to quantify how important a feature is for the decision.

### Decision Tree Algorithms

Recall how a decision tree-based algorithm splits the samples at its nodes to arrive at a decision. This [visual introduction to decision tree learning](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/) might help with that.

So each node of the decision tree is associated with a specific feature. Nodes closer to the root of the tree are applied to more samples. Feature importance can therefore be measured by **error reduction in that node, weighted by the number of samples that are routed through it**. More specifically:
1. initialize an array `feature_importances` of all zeros with size `n_features`.
2. traverse the tree: for each internal node that splits on feature `i`,  compute the error reduction of that node multiplied by the number of samples that were routed to the node and add this quantity to `feature_importances[i]`

The algorithm [was first described in 1984](https://books.google.de/books/about/Classification_and_Regression_Trees.html?id=JwQx-WOmSyQC&redir_esc=y). It is easy to imagine how it can be generalized to ensembles of decision trees, such as _Random Forest_ or _Gradient-boosted Trees_.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [None]:
classifier = DecisionTreeClassifier()

In [None]:
cross_val_score(
    classifier,
    X, 
    y,
    scoring="f1",
    cv=5
).mean()

In [None]:
classifier.fit(X, y)

In [None]:
pandas.Series(
    dict(zip(X.columns, classifier.feature_importances_)),
    index=X.columns
).sort_values().plot(kind="barh")

In [None]:
classifier = RandomForestClassifier()


In [None]:
cross_val_score(
    classifier,
    X, 
    y,
    scoring="f1",
    cv=5
).mean()

In [None]:
pandas.Series(
    dict(zip(
        X.columns, 
        classifier.fit(X, y).feature_importances_
    )),
    index=X.columns
).sort_values().plot(kind="barh")

### Linear Models

As discussed, different models require different approaches. For example, **Logistic Regression** is a linear model for binary classification. To gauge feature importance, it is [recommended to extract the coefficients of the model and multiply by the standard deviation of the feature](https://stackoverflow.com/questions/34052115/how-to-find-the-importance-of-the-features-for-a-logistic-regression-model)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
classifier = LogisticRegression()

In [None]:
cross_val_score(
    classifier,
    X, 
    y,
    scoring="f1",
    cv=5
).mean()

In [None]:
X.columns

In [None]:
classifier.fit(X, y).coef_

In [None]:
pandas.Series(
    dict(zip(
        X.columns, 
        classifier.fit(X, y).coef_[0] * numpy.std(X, axis=0)
    )),
    index=X.columns
).sort_values().plot(kind="barh")

## SHAP (SHapley Additive exPlanations)

Another way of using feature importances for explaining a model is proposed in the form of the [SHAP](https://github.com/slundberg/shap) library. The authors describe it as 

> a unified approach to explain the output of _any_ machine learning model

The authors [argue that SHAP has several advantages over other feature attribution methods](https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27), besides being algorithm-agnostic.

In [None]:
import shap

In [None]:
# load JS visualization code to notebook
shap.initjs()

In [None]:
classifier = RandomForestClassifier()
classifier.fit(X, y)

In [None]:
# explain the model's predictions using SHAP values
# (same syntax works for LightGBM, CatBoost, and scikit-learn models)
explainer = shap.TreeExplainer(classifier)
shap_values = explainer.shap_values(X)

# summarize the effects of all the features
shap.summary_plot(shap_values, X)

## Exercise: Feature Importance in the House Price Model

1. Train a regression model of your choice on the feature-rich house price data set.
2. Inspect the feature importance values provided by different methods
3. Experiment with a few model engineering choices and observe how they affect the feature importances.

In [None]:
price_data = data_science_learning_paths.datasets.read_house_prices()

Note: We have already done basic preprocessing on the dataset (dropped very sparse variables, encoded ordinal and categorial variables, etc.). If you don't agree with these steps, feel free to start from the unprocessed dataset:

In [None]:
ls ../.assets/data/house/

Attribute documentation is a necessary part of this dataset:

In [None]:
!head -n 30 ../.assets/data/house/data_description.txt

In [None]:
price_data.head()

In [None]:
# Your code here

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_