# Prediction Insights

## Overview

Howso Engine enables powerful predictions with complete attribution and detailed explanations to make learning from
and debugging your data and predictions as easy as possible. For more information on predictions with Howso Engine,
check out the [predictions user guide](https://docs.howso.com/user_guide/basics/predictions.html).

In [1]:
from pprint import pprint

import pandas as pd
from pmlb import fetch_data
import plotly.graph_objects as go
import plotly.express as px

from howso.engine import Trainee
from howso.utilities import infer_feature_attributes
import howso.visuals as vis

## Setup

The [basic workflow guide](https://docs.howso.com/user_guide/basics/basic_workflow.html) goes into more specifics about the individual details of this section. This recipe will focus more on the insights.

### Load Data and Create Trainee

In [2]:
df = fetch_data("iris", local_cache_dir="data")
train_data = df.iloc[:-30]
new_data = df[~df.index.isin(train_data.index)]
features = infer_feature_attributes(train_data)

t = Trainee(features=features)

The following parameters from configuration file will override the Amalgam parameters set in the code: {'trace'}


### Train, Analyze, and React

In [3]:
t.train(train_data)
t.analyze()
t.react_into_features(similarity_conviction=True)

reaction = t.react(
    contexts=new_data,
    context_features=["sepal-length", "sepal-width", "petal-length", "petal-width"],
    action_features=["target"],
    details={
        "influential_cases": True,
        "similarity_conviction": True,
        "feature_contributions": True,
        "feature_residuals": True,
        "robust_influences": True,
        "robust_residuals": True,
        "local_case_feature_residual_convictions": True,
        "categorical_action_probabilities": True,
    }
)

Note that, unlike in the basic workflow guide, we include several `details` in the react call. These are what will enable the insights that we're going to get after the predictions are made.

For more information, see [the API documentation for `Trainee.react()`](https://docs.howso.com/api_reference/_autosummary/howso.engine.html#howso.engine.Trainee.react).

### Inspect the Predictions

Howso Engine has high accuracy even on small datasets.

In [5]:
train_data = train_data.astype({"target": str})
predicted_data = pd.concat([new_data.reset_index(drop=True).drop(columns="target"), reaction["action"]], axis=1)
predicted_data = predicted_data.astype({"target": str})

cmap = px.colors.qualitative.D3
fig = go.Figure()
for label, group in train_data.groupby("target"):
    fig.add_trace(go.Scatter(
        x=group["petal-length"],
        y=group["petal-width"],
        mode="markers",
        name=label,
        marker=dict(color=cmap[int(label)], opacity=0.75),
        legendgroup="trained",
        legendgrouptitle_text="Trained Target",
    ))

for label, group in predicted_data.groupby("target"):
    fig.add_trace(go.Scatter(
        x=group["petal-length"],
        y=group["petal-width"],
        mode="markers",
        marker=dict(size=12, symbol="star", color=cmap[int(label)], opacity=0.75),
        name=label,
        legendgroup="predicted",
        legendgrouptitle_text="Predicted Target",
        hovertext=group.index,
    ))

fig.update_layout(
    xaxis_title="Petal Length",
    yaxis_title="Petal Width",
    width=1250,
    title="Trained and Predicted Values"
)
fig.show()

For categorical action features, the prediction can be further understood with the `categorical_action_probabilities` detail.  This information
can highlight cases that are on the border of two classes, like some of the above points are.  The closer a case gets to a class border, the
more mixed the categorical action probabilities may get.

In [6]:
pprint(reaction["details"]["categorical_action_probabilities"], compact=True)

[{'target': {'2': 1}}, {'target': {'2': 1}}, {'target': {'2': 1}},
 {'target': {'2': 1}},
 {'target': {'1': 0.21046908640357165, '2': 0.7895309135964284}},
 {'target': {'1': 1}}, {'target': {'1': 1}}, {'target': {'1': 1}},
 {'target': {'1': 1}}, {'target': {'1': 1}}, {'target': {'0': 1}},
 {'target': {'0': 1}}, {'target': {'0': 1}}, {'target': {'0': 1}},
 {'target': {'0': 1}}, {'target': {'2': 1}}, {'target': {'2': 1}},
 {'target': {'2': 1}}, {'target': {'2': 1}}, {'target': {'2': 1}},
 {'target': {'1': 1}}, {'target': {'1': 1}},
 {'target': {'1': 0.9011716794111256, '2': 0.0988283205888744}},
 {'target': {'1': 0.9080591869001543, '2': 0.09194081309984566}},
 {'target': {'1': 0.41668432138164674, '2': 0.5833156786183532}},
 {'target': {'0': 1}}, {'target': {'0': 1}}, {'target': {'0': 1}},
 {'target': {'0': 1}}, {'target': {'0': 1}}]


### Insight 1: Which Cases Contributed?

Howso provides complete attribution for any and all predictions, showing exactly which cases influenced each prediction.  This can be used to understand or debug predictions.  For instance,
we can inspect the influential cases for one of the cases (case `24`) that was on the decision boundary in the plot above.  We can see that there is a mix of target values and no cases stand
out in terms of influence weight:

In [13]:
inf_cases_24 = pd.DataFrame(reaction["details"]["influential_cases"][24])
inf_cases_24

Unnamed: 0,petal-width,.session_training_index,.influence_weight,petal-length,target,sepal-length,.session,sepal-width
0,1.5,22,0.156602,4.9,1,6.9,28879251-80f5-4db9-a45f-14c019ed894d,3.1
1,1.5,36,0.147282,4.7,1,6.7,28879251-80f5-4db9-a45f-14c019ed894d,3.1
2,1.8,77,0.130476,5.5,2,6.5,28879251-80f5-4db9-a45f-14c019ed894d,3.0
3,1.8,32,0.116727,5.5,2,6.4,28879251-80f5-4db9-a45f-14c019ed894d,3.1
4,1.8,47,0.113298,4.9,2,6.3,28879251-80f5-4db9-a45f-14c019ed894d,2.7
5,1.5,17,0.113021,5.1,2,6.3,28879251-80f5-4db9-a45f-14c019ed894d,2.8
6,1.5,24,0.1128,4.6,1,6.5,28879251-80f5-4db9-a45f-14c019ed894d,2.8
7,1.8,2,0.109794,4.8,2,6.2,28879251-80f5-4db9-a45f-14c019ed894d,2.8


Compare that to the influential cases for case `29`, which is firmly in the center of the solo cluster.  Here we see a single target value.

In [15]:
inf_cases_29 = pd.DataFrame(reaction["details"]["influential_cases"][29])
inf_cases_29

Unnamed: 0,petal-width,.session_training_index,.influence_weight,petal-length,target,sepal-length,.session,sepal-width
0,0.5,115,0.173459,1.7,0,5.1,28879251-80f5-4db9-a45f-14c019ed894d,3.3
1,0.3,27,0.148637,1.4,0,5.1,28879251-80f5-4db9-a45f-14c019ed894d,3.5
2,0.3,56,0.128717,1.3,0,5.0,28879251-80f5-4db9-a45f-14c019ed894d,3.5
3,0.2,89,0.123908,1.5,0,5.0,28879251-80f5-4db9-a45f-14c019ed894d,3.4
4,0.4,55,0.109599,1.5,0,5.4,28879251-80f5-4db9-a45f-14c019ed894d,3.4
5,0.2,14,0.107736,1.4,0,5.0,28879251-80f5-4db9-a45f-14c019ed894d,3.3
6,0.2,43,0.104767,1.4,0,5.1,28879251-80f5-4db9-a45f-14c019ed894d,3.5
7,0.2,101,0.103177,1.5,0,5.2,28879251-80f5-4db9-a45f-14c019ed894d,3.5


With the influential cases we can derive additional insights, such as identifying anomalous cases from within the influential cases using `similarity_conviction`.  
This can help to identify data that are making predictions more noisy or identify what type of data should be collected to improve predictive power in the future.

In [16]:
inf_case_indices = inf_cases_24[[".session", ".session_training_index"]].values.tolist()

anom_df = t.get_cases(
    case_indices=inf_case_indices,
    features=["sepal-length", "sepal-width", "petal-length", "petal-width", "target", "similarity_conviction"]
)

anom_df.sort_values(by="similarity_conviction")

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,target,similarity_conviction
5,6.3,2.8,5.1,1.5,2,0.894774
0,6.9,3.1,4.9,1.5,1,0.95226
6,6.5,2.8,4.6,1.5,1,0.99231
1,6.7,3.1,4.7,1.5,1,1.167468
3,6.4,3.1,5.5,1.8,2,1.263361
2,6.5,3.0,5.5,1.8,2,1.306136
4,6.3,2.7,4.9,1.8,2,1.354666
7,6.2,2.8,4.8,1.8,2,1.389361


In [20]:
inf_case_indices = inf_cases_29[[".session", ".session_training_index"]].values.tolist()

anom_df = t.get_cases(
    case_indices=inf_case_indices,
    features=["sepal-length", "sepal-width", "petal-length", "petal-width", "target", "similarity_conviction"],
)

anom_df.sort_values(by="similarity_conviction")

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,target,similarity_conviction
0,5.1,3.3,1.7,0.5,0,0.536097
4,5.4,3.4,1.5,0.4,0,0.72731
5,5.0,3.3,1.4,0.2,0,1.146473
2,5.0,3.5,1.3,0.3,0,1.174412
3,5.0,3.4,1.5,0.2,0,1.275603
1,5.1,3.5,1.4,0.3,0,1.345639
7,5.2,3.5,1.5,0.2,0,1.436891
6,5.1,3.5,1.4,0.2,0,1.504348


### Insight 2: Which Features Contributed?

In addition to providing attribution to cases, Howso also provides robust feature contributions to explain which features contributed to each prediction.
When inspecting the feature importances for case `24`,  we see that `petal-width` and `petal-length` provided the majority of the contribution to the 
prediction, whereas for case `29` `petal-width` and `petal-length` only have slightly higher contributions than the other features.  This could indicate 
that different features are more important for predicting different classes within this dataset and that focusing on those features would be prudent.

In [30]:
fcs_24 = pd.DataFrame(reaction["details"]["feature_contributions"][24:25], index=[24])
fcs_29 = pd.DataFrame(reaction["details"]["feature_contributions"][29:30], index=[29])

display(fcs_24)
display(fcs_29)


Unnamed: 0,petal-width,petal-length,sepal-length,sepal-width
24,0.396829,0.290322,0.07591,0.113755


Unnamed: 0,petal-width,petal-length,sepal-length,sepal-width
29,0.18456,0.182821,0.142534,0.118256


### Insight 3: How Certain is the Prediction?

Residuals can characterize the uncertainty of the data around the prediction.  This will tell us which features are hard to predict in the region of the data around each case we're predicting.
If a prediction is less accurate than expected, this can explain which features were noisy and may have contributed to the problem.

In [32]:
pd.DataFrame(reaction["details"]["feature_residuals"])

Unnamed: 0,petal-width,petal-length,target,sepal-length,sepal-width
0,0.233953,0.233947,0.094208,0.358054,0.180603
1,0.314069,0.538957,0.356791,0.300647,0.216106
2,0.313955,0.515043,0.338897,0.314033,0.296043
3,0.380251,0.382194,0.09375,0.379551,0.16923
4,0.275574,0.445493,0.254714,0.363986,0.197646
5,0.179338,0.318666,0.102229,0.300717,0.233942
6,0.189839,0.447,0.318182,0.421429,0.409611
7,0.140272,0.378,0.180952,0.37141,0.281185
8,0.194962,0.360718,0.258733,0.299183,0.182117
9,0.18661,0.368168,0.18144,0.224058,0.160727


We can also use the `residual conviction` to determine which features are uncertain in a scale-invariant manner, which can be useful if you wish
to compare different features of different scales against each other.

In [35]:
pd.DataFrame(reaction["details"]["local_case_feature_residual_convictions"])

Unnamed: 0,petal-width,petal-length,target,sepal-length,sepal-width
0,1.45771,0.770319,1.235247,1.24584,0.665037
1,2.311301,2.467796,1.300407,0.982263,0.865275
2,1.830438,1.203424,1.271511,3.790964,2.037936
3,1.41811,1.059149,0.74834,1.177897,0.609685
4,1.062281,0.971513,0.85077,1.63111,1.102028
5,0.741677,1.202031,1.162123,3.126954,1.254369
6,1.206962,0.683119,0.432293,0.701598,1.157099
7,1.732242,0.366978,1.476163,0.760095,1.121176
8,0.697083,1.623602,1.292816,1.362158,0.656937
9,0.98887,1.344432,0.670218,1.689911,1.285247


### Insight 4: How Anomalous are the Predicted Cases?

By getting the `similarity conviction` of the cases that we predict, we can determine which of them are anomalous relative to the trained data.
This could help to highlight cases that are unusually difficult or easy to predict, as well as discover potentially malicious or poisoned cases.

In [34]:
pprint(reaction["details"]["similarity_conviction"], compact=True)

[0.944973813536701, 1.0634885812371422, 2.4591688656461055, 0.9459478034438302,
 1.5043716628329622, 1.3148960999454378, 0.8593179782980951, 0.6111352473264055,
 1.092451006936613, 1.1240962791385916, 0.6156625668364372, 0.9759978812244915,
 0.8364501577689606, 1.2205135784345802, 1.105640525836773, 1.1640762161052323,
 0.8811111989833429, 0.8022076896889794, 0.9228534517611904, 1.0371725437694723,
 1.078206170467167, 0.9808268108925481, 1.033230709602799, 1.2635154300459304,
 0.9022696746946363, 0.6142630858335968, 0.9599712973636446, 1.5076132971921148,
 1.2673090525379531, 0.9304718966382368]
