# Example distance

## Setup


In [None]:
pip install ydf scikit-learn plotly -U

In [None]:
import ydf  # Yggdrasil Decision Forests
import pandas as pd  # We use Pandas to load small datasets
import numpy as np

## What is an example distance?

Decision forest models define an **implicit measure of proximity or similarity between two examples**, referred to as **distance**. The distance represents how two examples are treated similarly in the model. Informally, **two examples are close if they are of the same class and for the same reasons**.

This distance is useful for understanding models and their predictions. For example, we can use it for clustering, manifold learning, or simply to look at the training examples that are nearest to a test example. This can help us to understand why the model made its predictions.

Keep in mind that a decision forest's distance measure is just one of many reasonable distance metrics on a dataset. One of its many advantages is that allows comparing features on different scales and with different semantics. 

In this notebook, we will train a model and use its distance to:

- Find training examples that are neighbors of a test example and use them to explain the model's predictions.

- Map all the examples onto an interactive two-dimensional plot (also known as a 2D manifold) and automatically detect two-dimensional clusters of examples that behave similarly.

- Apply hierarchical clustering to explain how the model works as a whole.

**The More You Know:** [Leo Breiman](https://en.wikipedia.org/wiki/Leo_Breiman), the author of the [random forest](https://developers.google.com/machine-learning/glossary#random-forest) learning algorithm, [proposed](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#prox) a method to measure the *proximity* between two examples using a pre-trained Random Forest (RF) model. He qualifies this method as <i>"[...] one of the most useful tools in random forests."</i>. When using Random Forest models, this is the distance used by YDF.


## Find closest training examples to a test example

Let's download a classification dataset.

In [None]:
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

# Print the first 5 training examples
train_ds.head(5)

We train a random forest on this dataset.

In [None]:
model = ydf.RandomForestLearner(label="income").train(train_ds)

We need to select a example to explain. Let's select the first example of the testing dataset.

In [None]:
selected_example = test_ds[:1]
selected_example

On this example, the model predicts:

In [None]:
model.predict(selected_example)

In other words, the negative class `<=50K` with $1-0.01=99\%$ probability.

Now, we compute the distance between the selected test example and all the training examples.

In [None]:
distances = model.distance(train_ds, selected_example).squeeze()

print("distances:",distances)

Let's find the the five training examples with smallest distance to our chosen example.

In [None]:
close_train_idxs = np.argsort(distances)[:5]
print("close_train_idxs:",close_train_idxs)

print("Selected test examples:")
train_ds.iloc[close_train_idxs]

**Observations:**

- For the chosen example, the model predicted class `<=50K`. For the five closes examples, the model had the same prediction.
- The closest examples share many features values, such as `education`, `marital status`, `occupation`, `race`, and working between 37 and 40 `hours per week`. This explains well why these examples are close to each other.
- The examples' `age`s range between 30 and 40, meaning the model sees this age range as equivalent for those examples.


## Two dimensional projections of the examples

Our first use of the proximity is to project the examples on the two dimensional plane. For that, we use [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding).

In [None]:
from sklearn.manifold import TSNE # For 2d projections
from plotly.offline import iplot # For interactive plots
import plotly.graph_objs as go

In [None]:
# Pairwise distance between all testing examples
distances = model.distance(test_ds, test_ds)

In [None]:
# Find 2d projection
t_sne = TSNE(
    # Number of dimensions to display. 3d is also possible.
    n_components=2,
    # Control the shape of the projection. Higher values create more
    # distinct but also more collapsed clusters. Can be in 5-50.
    perplexity=20,
    metric="precomputed",
    init="random",
    verbose=1,
    learning_rate="auto").fit_transform(distances)

Let's create an interactive plot with the example features.

In [None]:
def example_to_html(example):
    return "<br>".join([f"<b>{k}:</b> {v}" for k, v in example.items()])


def interactive_plot(dataset, projections):
    colors = (dataset["income"] == ">50K").map(lambda x: ["red", "blue"][x])
    labels = list(dataset.apply(example_to_html, axis=1).values)
    args = {
        "data": [
            go.Scatter(
                x=projections[:, 0],
                y=projections[:, 1],
                text=labels,
                mode="markers",
                marker={"color": colors, "size": 3},
            )
        ],
        "layout": go.Layout(width=500, height=500, template="simple_white"),
    }
    iplot(args)


interactive_plot(test_ds, t_sne)

**Note:** Move your mouse over the plot to see the values of the examples.

The colors represent the labels. We can see clusters of uniform colors (clusters where all the labels are the same), and clusters of mixed colors (clusters where the model has difficulty making good predictions).

Can you make sense of those clusters?


## Cluster examples

We can also cluster examples. [Many methods](https://scikit-learn.org/stable/modules/clustering.html) are available. Let's use `AgglomerativeClustering`. 

In [None]:
from sklearn.cluster import AgglomerativeClustering

num_clusters = 6
clustering = AgglomerativeClustering(
    n_clusters=num_clusters,
    metric="precomputed",
    linkage="average",
).fit(distances)

Next, we print the statistics of the features and one example in each cluster.

In [None]:
import IPython

for cluster_idx in range(num_clusters):
    selected_examples = test_ds[clustering.labels_ == cluster_idx]
    print(f"Cluster #{cluster_idx} with {len(selected_examples)} examples")
    print("=============================")
    IPython.display.display(selected_examples.describe())
    IPython.display.display(selected_examples.iloc[:1])