# Model understanding

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/model_understanding.ipynb)

## Setup

In [None]:
pip install ydf -U

In [1]:
import ydf
import pandas as pd

## What is model understanding?

The goal of **model understanding** is to understand how a model works in order to identify potential modeling or data issues and improve decision-making with the model's outputs.

This notebook presents multiple techniques for interpreting a model trained on the adult dataset.


## Gathering dataset and training model

In [2]:
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)

# Train a model
model = ydf.RandomForestLearner(label="income").train(train_ds)

Train model on 22792 examples
Model trained in 0:00:01.200725


## Model description

The **model description** contains two key pieces of information for understanding the model:

**Variable importances**: Variable importances show which features are most important to the model. A variable with a high score is more useful for the model to make its predictions than a variable with a low score.

There exist different measures of variable importances. For instance, the "num_nodes" indicates how many of the nodes in the decision trees use a particular feature.

Another type of variable importance is "mean {decrease,increase} in [metric name]". These show how much the quality of the model would decrease if the feature were removed (in practice, removing a feature is expensive, so the feature is shuffled instead).
For example, a feature with a "mean decrease in accuracy" of 0.1 means that removing this feature would reduce the accuracy of the model by ~0.1.

The variable importances measures shown by YDF complement each other; no single measure can give a full understanding of the model.

**Structure**: Decision forest models are made up of decision trees. The *structure* tab of the model description shows the first decision tree in the model. This can be helpful for understanding how the model is making predictions in general.


In [3]:
model.describe()

## Model analysis

In contrast to the model description, the **model analysis** requires a test dataset. The most informative results of the analysis are [partial dependence plots (PDP)](https://christophm.github.io/interpretable-ml-book/pdp.html), which show the model's prediction marginalized according to each feature value. The model analysis also shows variable importances computed on the provided dataset.

**Note:** Model analysis is computationally expensive. On large datasets, you can use the sampling parameter to run the analysis more quickly on a random subset of the data.

In [4]:
model.analyze(test_ds, sampling=0.1)

## Counterfactual clusters

Counterfactual examples are the training examples that are most similar to a prediction according to a model. Examining clusters of counterfactual examples can provide insight into how the model sees and segments the examples.

For more information, see the standalone [counterfactual notebook](../counterfactual).