# Prediction Insights

## Overview

Howso Engine enables powerful predictions with complete attribution and detailed explanations to make learning from
and debugging your data and predictions as easy as possible. For more information on predictions with Howso Engine,
check out the [predictions user guide](https://docs.howso.com/user_guide/basics/predictions.html).

In [1]:
from pprint import pprint

import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

from howso.engine import Trainee
from howso.utilities import infer_feature_attributes

## Setup

The [basic workflow guide](https://docs.howso.com/user_guide/basics/basic_workflow.html) goes into more specifics about the individual details of this section. This recipe will focus more on the insights.

### Load Data and Create Trainee

In [2]:
df = pd.read_csv("../../data/iris/iris.tsv.gz", sep="\t", compression="gzip")
train_data = df.iloc[:-30]
new_data = df[~df.index.isin(train_data.index)]
features = infer_feature_attributes(train_data)

t = Trainee(features=features)

df

Unnamed: 0.1,Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,27,5.2,3.5,1.5,0.2,Iris-setosa
1,14,5.8,4.0,1.2,0.2,Iris-setosa
2,25,5.0,3.0,1.6,0.2,Iris-setosa
3,82,5.8,2.7,3.9,1.2,Iris-versicolor
4,97,6.2,2.9,4.3,1.3,Iris-versicolor
...,...,...,...,...,...,...
145,126,6.2,2.8,4.8,1.8,Iris-virginica
146,124,6.7,3.3,5.7,2.1,Iris-virginica
147,78,6.0,2.9,4.5,1.5,Iris-versicolor
148,125,7.2,3.2,6.0,1.8,Iris-virginica


### Train, Analyze, and React

In [3]:
t.train(train_data)
t.analyze()
t.react_into_features(similarity_conviction=True)

reaction = t.react(
    contexts=new_data,
    context_features=["sepal length", "sepal width", "petal length", "petal width"],
    action_features=["class"],
    details={
        "influential_cases": True,
        "similarity_conviction": True,
        "feature_contributions": True,
        "feature_residuals": True,
        "robust_influences": True,
        "robust_residuals": True,
        "local_case_feature_residual_convictions": True,
        "categorical_action_probabilities": True,
    }
)

Note that, unlike in the basic workflow guide, we include several `details` in the react call. These are what will enable the insights that we're going to get after the predictions are made.

For more information, see [the API documentation for `Trainee.react()`](https://docs.howso.com/api_reference/_autosummary/howso.engine.html#howso.engine.Trainee.react).

### Inspect the Predictions

Howso Engine has high accuracy even on small datasets.

In [4]:
train_data = train_data.astype({"class": str})
predicted_data = pd.concat([new_data.reset_index(drop=True).drop(columns="class"), reaction["action"]], axis=1)
predicted_data = predicted_data.astype({"class": str})

cmap = px.colors.qualitative.D3
fig = go.Figure()
for i, (label, group) in enumerate(train_data.groupby("class")):
    fig.add_trace(go.Scatter(
        x=group["petal length"],
        y=group["petal width"],
        mode="markers",
        name=label,
        marker=dict(color=cmap[i], opacity=0.75),
        legendgroup="trained",
        legendgrouptitle_text="Trained class",
    ))

for i, (label, group) in enumerate(predicted_data.groupby("class")):
    fig.add_trace(go.Scatter(
        x=group["petal length"],
        y=group["petal width"],
        mode="markers",
        marker=dict(size=12, symbol="star", color=cmap[i], opacity=0.75),
        name=label,
        legendgroup="predicted",
        legendgrouptitle_text="Predicted class",
        hovertext=group.index,
    ))

fig.update_layout(
    xaxis_title="Petal Length",
    yaxis_title="Petal Width",
    width=1250,
    title="Trained and Predicted Values"
)
fig.show()

For categorical action features, the prediction can be further understood with the `categorical_action_probabilities` detail.  This information
can highlight cases that are on the border of two classes, like some of the above points are.  The closer a case gets to a class border, the
more mixed the categorical action probabilities may get.

In [5]:
pprint(reaction["details"]["categorical_action_probabilities"], compact=True)

[{'class': {'Iris-versicolor': 1}},
 {'class': {'Iris-versicolor': 0.1890321448500335,
            'Iris-virginica': 0.8109678551499665}},
 {'class': {'Iris-virginica': 1}}, {'class': {'Iris-virginica': 1}},
 {'class': {'Iris-versicolor': 0.3946197371788092,
            'Iris-virginica': 0.6053802628211907}},
 {'class': {'Iris-versicolor': 1}}, {'class': {'Iris-setosa': 1}},
 {'class': {'Iris-virginica': 1}},
 {'class': {'Iris-versicolor': 0.8057564626950173,
            'Iris-virginica': 0.19424353730498264}},
 {'class': {'Iris-versicolor': 1}}, {'class': {'Iris-virginica': 1}},
 {'class': {'Iris-versicolor': 1}}, {'class': {'Iris-setosa': 1}},
 {'class': {'Iris-versicolor': 0.20118009066416837,
            'Iris-virginica': 0.7988199093358316}},
 {'class': {'Iris-versicolor': 1}}, {'class': {'Iris-setosa': 1}},
 {'class': {'Iris-setosa': 1}},
 {'class': {'Iris-versicolor': 0.2046895097802475,
            'Iris-virginica': 0.7953104902197524}},
 {'class': {'Iris-virginica': 1}},
 {'cl

### Insight 1: Which Cases Contributed?

Howso provides complete attribution for any and all predictions, showing exactly which cases influenced each prediction.  This can be used to understand or debug predictions.  For instance,
we can inspect the influential cases for one of the cases (case `13`) that was on the decision boundary in the plot above.  We can see that there is a mix of class values and no cases stand
out in terms of influence weight:

In [6]:
inf_cases_13 = pd.DataFrame(reaction["details"]["influential_cases"][13])
inf_cases_13

Unnamed: 0,sepal width,.session_training_index,petal length,petal width,.influence_weight,sepal length,.session,class
0,2.8,99,5.1,1.5,0.207779,6.3,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-virginica
1,2.5,107,4.9,1.5,0.20118,6.3,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-versicolor
2,2.7,115,4.9,1.8,0.1995,6.3,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-virginica
3,3.0,8,5.1,1.8,0.196327,5.9,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-virginica
4,2.7,41,5.1,1.9,0.195214,5.8,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-virginica


Compare that to the influential cases for case `20`, which is firmly in the center of the solo cluster.  Here we see a single class value.

In [7]:
inf_cases_20 = pd.DataFrame(reaction["details"]["influential_cases"][20])
inf_cases_20

Unnamed: 0,sepal width,.session_training_index,petal length,petal width,.influence_weight,sepal length,.session,class
0,3.5,48,1.4,0.3,0.201282,5.1,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-setosa
1,3.4,63,1.6,0.4,0.200762,5.0,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-setosa
2,3.4,9,1.7,0.2,0.199698,5.4,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-setosa
3,3.3,116,1.7,0.5,0.199316,5.1,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-setosa
4,3.4,43,1.4,0.2,0.198941,5.2,e0f30f66-05ab-4817-a8a8-01d6de3255b9,Iris-setosa


With the influential cases we can derive additional insights, such as identifying anomalous cases from within the influential cases using `similarity_conviction`.  
This can help to identify data that are making predictions more noisy or identify what type of data should be collected to improve predictive power in the future.

In [8]:
inf_case_indices = inf_cases_13[[".session", ".session_training_index"]].values.tolist()

anom_df = t.get_cases(
    case_indices=inf_case_indices,
    features=["sepal length", "sepal width", "petal length", "petal width", "class", "similarity_conviction"]
)

anom_df.sort_values(by="similarity_conviction")

Unnamed: 0,sepal length,sepal width,petal length,petal width,class,similarity_conviction
1,6.3,2.5,4.9,1.5,Iris-versicolor,0.973029
0,6.3,2.8,5.1,1.5,Iris-virginica,0.974979
4,5.8,2.7,5.1,1.9,Iris-virginica,1.00847
3,5.9,3.0,5.1,1.8,Iris-virginica,1.008903
2,6.3,2.7,4.9,1.8,Iris-virginica,1.018925


In [9]:
inf_case_indices = inf_cases_20[[".session", ".session_training_index"]].values.tolist()

anom_df = t.get_cases(
    case_indices=inf_case_indices,
    features=["sepal length", "sepal width", "petal length", "petal width", "class", "similarity_conviction"],
)

anom_df.sort_values(by="similarity_conviction")

Unnamed: 0,sepal length,sepal width,petal length,petal width,class,similarity_conviction
3,5.1,3.3,1.7,0.5,Iris-setosa,0.973216
2,5.4,3.4,1.7,0.2,Iris-setosa,0.994447
1,5.0,3.4,1.6,0.4,Iris-setosa,1.004502
0,5.1,3.5,1.4,0.3,Iris-setosa,1.016745
4,5.2,3.4,1.4,0.2,Iris-setosa,1.017943


### Insight 2: Which Features Contributed?

In addition to providing attribution to cases, Howso also provides robust feature contributions to explain which features contributed to each prediction.
When inspecting the feature importances for case `13`,  we see that `petal width` and `petal length` provided the majority of the contribution to the 
prediction, whereas for case `20` `petal width` and `petal length` only have slightly higher contributions than the other features.  This could indicate 
that different features are more important for predicting different classes within this dataset and that focusing on those features would be prudent.

In [10]:
fcs_13 = pd.DataFrame(reaction["details"]["feature_contributions"][13:14], index=[13])
fcs_20 = pd.DataFrame(reaction["details"]["feature_contributions"][20:21], index=[20])

display(fcs_13)
display(fcs_20)

Unnamed: 0,sepal width,petal length,petal width,sepal length
13,0.070531,0.219668,0.413826,0.062466


Unnamed: 0,sepal width,petal length,petal width,sepal length
20,0.099688,0.161151,0.164165,0.121481


### Insight 3: How Certain is the Prediction?

[Residuals](https://docs.howso.com/user_guide/basic_capabilities/residuals.html) can characterize the uncertainty of the data around the prediction.  This will tell us which features are hard to predict in the region of the data around each case we're predicting.
If a prediction is less accurate than expected, this can explain which features were noisy and may have contributed to the problem.

In [11]:
pd.DataFrame(reaction["details"]["feature_residuals"])

Unnamed: 0,sepal width,petal length,petal width,sepal length,class
0,0.112866,0.386771,0.241644,0.371817,0.323734
1,0.267267,0.543147,0.226509,0.295007,0.614795
2,0.15383,0.342449,0.247924,0.327834,0.236677
3,0.246963,0.668989,0.388309,0.385393,0.435977
4,0.129884,0.277258,0.214948,0.302588,0.230014
5,0.239334,0.32881,0.201537,0.345923,0.155144
6,0.24518,0.762376,0.41092,0.474553,0.173611
7,0.182487,0.263886,0.143381,0.238102,0.164148
8,0.102455,0.182326,0.071838,0.246158,0.136051
9,0.273972,0.26143,0.133425,0.309792,0.133051


We can also use [`residual conviction`](https://docs.howso.com/user_guide/basic_capabilities/conviction.html#prediction-residual-conviction) to determine which features are uncertain in a scale-invariant manner,
which can be useful if you wish to compare different features of different scales against each other.

In [12]:
pd.DataFrame(reaction["details"]["local_case_feature_residual_convictions"])

Unnamed: 0,sepal width,petal length,petal width,sepal length,class
0,1.699643,0.983376,1.029561,0.832451,2.142331
1,0.551831,0.896642,1.223313,1.474671,1.014693
2,5.324531,0.597609,0.773709,0.888187,0.684881
3,0.625588,0.569873,1.213294,0.625019,0.737548
4,0.605051,0.578764,0.591898,1.071607,0.7742
5,0.763777,3.423105,1.36224,1.357466,1.113531
6,0.651159,0.627725,0.815623,2.570424,0.426374
7,0.622999,1.445718,1.115362,1.109143,1.01173
8,0.399236,1.309102,0.9735,0.577948,1.453606
9,0.302498,2.428158,1.000755,1.701415,1.185692


### Insight 4: How Anomalous are the Predicted Cases?

By getting the [`similarity conviction`](https://docs.howso.com/user_guide/basic_capabilities/conviction.html#similarity-conviction) of the cases that we predict, we can determine which of them are anomalous relative to the trained data.
This could help to highlight cases that are unusually difficult or easy to predict, as well as discover potentially malicious or poisoned cases.

In [13]:
pprint(reaction["details"]["similarity_conviction"], compact=True)

[1.260835585599987, 1.227050047339045, 1.228695629346302, 1.2004965983246094,
 1.1157927709242215, 1.2554462857202653, 1.2458724147103153, 1.2189325533603856,
 1.14063203996104, 1.2225394771011675, 1.2749788292171218, 1.2004284342383167,
 1.2492500480477184, 1.2487892256961939, 1.2200623536345236, 1.232008421897604,
 1.2162095101968016, 1.267347152937185, 1.1469941030810846, 1.0109209497664502,
 1.2063451120669346, 1.2543503045201094, 1.1488859912924438, 1.2350786536296818,
 1.240763441000068, 1.27176056913252, 1.2424706351202268, 1.2415632688477891,
 1.1933066876452634, 1.2212768946158383]
