# Overview

At a very high level, Howso Engine is about: 

- Making an accurate prediction (even with limited or sparse data!) 

- Explaining the prediction process 

- Showing key properties of the data 

In this notebook, we will be using the adult data set as an example to demonstrate some of Howso Engine’s capabilities, including cases and features which contribute to predictions, anomalies analysis, and potential improvements to the data to gain more insight into the data.  


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from howso import engine
from howso.utilities import infer_feature_attributes
from howso.visuals import plot_feature_importances, plot_anomalies, plot_dataset

In [2]:
# Load adult data
df = pd.read_csv('data/adult.data', header=None)

# Specify column names
df.columns = ['age', 'workclass', 'fnlwgt', 'education', 
              'education-num', 'marital-status', 'occupation',
              'relationship', 'race', 'sex', 'capital-gain', 
              'capital-loss', 'hours-per-week', 'native-country', 'target']

# Sample the data for demo purpose
df = df.sample(1_000).reset_index(drop=True)

df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,45,State-gov,213646,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,7298,0,40,United-States,>50K
1,58,Private,242670,HS-grad,9,Never-married,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
2,27,Private,211032,Preschool,1,Married-civ-spouse,Farming-fishing,Other-relative,White,Male,41310,0,24,Mexico,<=50K
3,39,Private,165799,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,<=50K
4,28,Private,196690,Assoc-voc,11,Never-married,Machine-op-inspct,Not-in-family,White,Female,0,1669,42,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,20,?,369678,HS-grad,9,Never-married,?,Not-in-family,Other,Male,0,0,43,United-States,<=50K
996,50,Local-gov,138358,Some-college,10,Separated,Other-service,Unmarried,Black,Female,0,0,28,United-States,<=50K
997,27,Federal-gov,148153,HS-grad,9,Married-civ-spouse,Handlers-cleaners,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,<=50K
998,45,Private,343377,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,<=50K


In [3]:
partial_features = {
    "education": {"type": "nominal"}
}

# Infer features types
features = infer_feature_attributes(df, features=partial_features)

# Specify the context and action feature
action_features = ['target']
context_features = features.get_names(without=['target'])

In [4]:
# Create the trainee with custom name
t = engine.Trainee(name='Engine - Predictions and Explanations Recipe', features=features, overwrite_existing=True)

# Train
t.train(df)

# Analyze the model
t.analyze(action_features=action_features)


In [5]:
t.react_into_trainee(residuals=True)

accuracy = t.get_prediction_stats(stats=['accuracy'])['target'].iloc[0]

print("Test set prediction accuracy: {acc}".format(acc=accuracy))

Test set prediction accuracy: 0.753


# Explain

How was the predictions made? 

Howso Engine provides detailed explanation for complete model transparency. Let's examine a subset of the explanations.


## Feature importance (global)

The feature importance information provides insight into the feature[s] which were primary drivers for each of the prediction. This is important to understand in the context of AI bias and discrimination (ex. Sensitive attribute being the primary contribution to a prediction). 

This information is available at the global level (overall model), but can also be extracted at the local level (regional model for each case).

In [6]:
# Extract the global MDA (mean decrease in accuracy)
t.react_into_trainee(action_feature=action_features[0], mda_robust=True, residuals=True)
global_mda = t.get_prediction_stats(action_feature=action_features[0], stats=['mda'])
plot_feature_importances(global_mda, title="Global Mean Decrease in Accuracy (MDA)", yaxis_title="MDA")

## Feature uncertainty (global)

Are there any noisy features? 

Howso Engine’s performance is robust against noisy feature[s], and can maintain a high level of accuracy despite noisy data. 

Part of the reason  Howso Engine can maintain the level of performance despite noisy data is through characterization of feature uncertainties (residuals). The feature residuals can be extracted for user review. Note, the residuals are in the same units as the original features which makes it easy to interpret. For example, the residual for the “age” feature has the unit of years as in the original data.

Feature residuals are available at the global level (overall model) and at the local level (regional model for each case).


In [7]:
# Global feature residuals
global_feature_residuals = t.get_prediction_stats(stats=['mae']).T.rename(columns={'mae':'residuals'}).sort_values('residuals', ascending=False)
global_feature_residuals.iloc[0:10]

Unnamed: 0,residuals
fnlwgt,85815.645968
capital-gain,2108.756136
age,11.446891
hours-per-week,9.251702
occupation,0.902195
education-num,0.813602
education,0.813602
relationship,0.735356
marital-status,0.646141
workclass,0.516781


# "Show me..."

Howso Engine can be used to show interesting information pertaining to the data and model, such as anomalous cases and potential model improvements. 
 
For each prediction, Howso Engine can also extract the influential cases and boundary cases to provide an exact explanation to the prediction process. More details on what’s available can be found in the notebook “2-interpretability.ipynb”.


## Anomalous cases

Anomalous cases can exist in the data as either an outlier or inlier. Outliers are cases which are very different than other cases. Inliers are cases which are too similar to other cases and do not follow the expected distribution. Inliers can be an indication of a fraudulent case that is “too good to be true”. 



In [8]:
# Store the familiarity conviction, this will be used to identify anomalous cases
t.analyze()
t.react_into_features(familiarity_conviction_addition=True, distance_contribution=True)
stored_convictions = t.get_cases(session=t.active_session, features=df.columns.tolist() + ['familiarity_conviction_addition','.session_training_index', '.session', 'distance_contribution'])

stored_convictions

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target,familiarity_conviction_addition,.session_training_index,.session,distance_contribution
0,45,State-gov,213646,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,7298,0,40,United-States,>50K,2.609425,0,baba7666-8dac-40ae-a565-3e2642d0ef01,4.810184
1,58,Private,242670,HS-grad,9,Never-married,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K,11.774335,1,baba7666-8dac-40ae-a565-3e2642d0ef01,5.886059
2,27,Private,211032,Preschool,1,Married-civ-spouse,Farming-fishing,Other-relative,White,Male,41310,0,24,Mexico,<=50K,0.094427,2,baba7666-8dac-40ae-a565-3e2642d0ef01,29.202523
3,39,Private,165799,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,<=50K,0.293359,3,baba7666-8dac-40ae-a565-3e2642d0ef01,1.918331
4,28,Private,196690,Assoc-voc,11,Never-married,Machine-op-inspct,Not-in-family,White,Female,0,1669,42,United-States,<=50K,0.436970,4,baba7666-8dac-40ae-a565-3e2642d0ef01,14.927844
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,20,?,369678,HS-grad,9,Never-married,?,Not-in-family,Other,Male,0,0,43,United-States,<=50K,2.814812,995,baba7666-8dac-40ae-a565-3e2642d0ef01,9.646100
996,50,Local-gov,138358,Some-college,10,Separated,Other-service,Unmarried,Black,Female,0,0,28,United-States,<=50K,5.397442,996,baba7666-8dac-40ae-a565-3e2642d0ef01,8.796548
997,27,Federal-gov,148153,HS-grad,9,Married-civ-spouse,Handlers-cleaners,Husband,Asian-Pac-Islander,Male,0,0,40,United-States,<=50K,13.211536,997,baba7666-8dac-40ae-a565-3e2642d0ef01,8.087714
998,45,Private,343377,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,<=50K,1.310455,998,baba7666-8dac-40ae-a565-3e2642d0ef01,4.075570


In [9]:
# Threshold to determine which cases will be deemed anomalous
convict_threshold = 0.75

# Extract the anomalous cases
low_convicts = stored_convictions[stored_convictions['familiarity_conviction_addition'] <= convict_threshold ].sort_values('familiarity_conviction_addition', ascending=True)

# Average distance contribution will be used to determine if a case is an outlier or inlier
average_dist_contribution = low_convicts['distance_contribution'].mean()

# A case with distance contribution greater than average will be tagged as outlier, and vise versa for inliers
cat = ['inlier' if d < average_dist_contribution else 'outlier' for d in low_convicts['distance_contribution']]
low_convicts['category'] = cat

## Outliers

Let’s examine a few outlier cases. Outliers are cases which are very different than other cases.

In [10]:
# Extract the outliers cases
outliers = low_convicts[low_convicts['category'] == 'outlier'].reset_index(drop=True)
outliers

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target,familiarity_conviction_addition,.session_training_index,.session,distance_contribution,category
0,38,Federal-gov,37683,Prof-school,15,Never-married,Prof-specialty,Not-in-family,Asian-Pac-Islander,Female,99999,0,57,Canada,>50K,0.059752,545,baba7666-8dac-40ae-a565-3e2642d0ef01,37.956412,outlier
1,54,Self-emp-not-inc,269068,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,99999,0,50,Philippines,>50K,0.078857,453,baba7666-8dac-40ae-a565-3e2642d0ef01,31.899997,outlier
2,36,Self-emp-inc,216711,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,99999,0,50,?,>50K,0.093408,856,baba7666-8dac-40ae-a565-3e2642d0ef01,28.934321,outlier
3,27,Private,211032,Preschool,1,Married-civ-spouse,Farming-fishing,Other-relative,White,Male,41310,0,24,Mexico,<=50K,0.094427,2,baba7666-8dac-40ae-a565-3e2642d0ef01,29.202523,outlier
4,49,Self-emp-not-inc,43348,Doctorate,16,Never-married,Prof-specialty,Not-in-family,White,Male,99999,0,70,United-States,>50K,0.097505,416,baba7666-8dac-40ae-a565-3e2642d0ef01,28.162358,outlier
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
100,50,Federal-gov,32801,Bachelors,13,Married-civ-spouse,Exec-managerial,Wife,Amer-Indian-Eskimo,Female,0,0,40,United-States,>50K,0.689422,11,baba7666-8dac-40ae-a565-3e2642d0ef01,12.985062,outlier
101,42,Private,367533,10th,6,Married-civ-spouse,Craft-repair,Own-child,Other,Male,0,0,43,United-States,>50K,0.694713,886,baba7666-8dac-40ae-a565-3e2642d0ef01,12.960099,outlier
102,27,Private,150025,5th-6th,3,Never-married,Handlers-cleaners,Other-relative,White,Male,0,0,40,Guatemala,<=50K,0.720323,239,baba7666-8dac-40ae-a565-3e2642d0ef01,12.826780,outlier
103,35,Private,90273,7th-8th,4,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,?,>50K,0.732626,905,baba7666-8dac-40ae-a565-3e2642d0ef01,12.770016,outlier


In [11]:
# Cache global non-robust residuals into trainee
t.react_into_trainee(residuals=True)

# Get the case_feature_residual_convictions, influential_cases and boundary_cases
details = {'robust_residuals': True,
           'global_case_feature_residual_convictions': True, 
           'local_case_feature_residual_convictions': True}

# Specify outlier cases
outliers_indices = outliers[['.session', '.session_training_index']].values

# React to get the details of each case
results = t.react(case_indices=outliers_indices, 
                  preserve_feature_values=df.columns.tolist(), 
                  leave_case_out=True, 
                  details=details)

In [12]:
# Extract the global and local case feature residual convictions
global_case_feature_residual_convictions = pd.DataFrame(results['details']['global_case_feature_residual_convictions'])[df.columns.tolist()]
local_case_feature_residual_convictions = pd.DataFrame(results['details']['local_case_feature_residual_convictions'])[df.columns.tolist()]

In [13]:
plot_anomalies(outliers, local_case_feature_residual_convictions, title="Outliers", yaxis_title="Residual Conviction")

The heat map explains the reason why each case was an outlier. The darker the shade of red, the higher the contribution to the case being an outlier. 

## Inliers

Let’s examine a few inlier cases. Inliers are cases which are too similar to other cases and do not follow the expected distribution. Inliers can be an indication of a fraudulent case that is “too good to be true”. 

In [14]:
# Get the inlier cases
inliers = low_convicts[low_convicts['category'] == 'inlier'].reset_index(drop=True)
inliers

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target,familiarity_conviction_addition,.session_training_index,.session,distance_contribution,category
0,37,Private,184556,Some-college,10,Divorced,Tech-support,Unmarried,White,Female,0,0,40,United-States,<=50K,0.144743,963,baba7666-8dac-40ae-a565-3e2642d0ef01,1.156757,inlier
1,36,Private,182013,Some-college,10,Divorced,Tech-support,Unmarried,White,Female,0,0,40,United-States,<=50K,0.145409,61,baba7666-8dac-40ae-a565-3e2642d0ef01,1.160345,inlier
2,37,Private,179468,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K,0.169928,565,baba7666-8dac-40ae-a565-3e2642d0ef01,1.146730,inlier
3,38,Private,159179,HS-grad,9,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K,0.172387,111,baba7666-8dac-40ae-a565-3e2642d0ef01,1.165582,inlier
4,23,Private,186014,Some-college,10,Never-married,Adm-clerical,Own-child,White,Female,0,0,15,United-States,<=50K,0.182321,849,baba7666-8dac-40ae-a565-3e2642d0ef01,1.287704,inlier
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135,18,?,171088,Some-college,10,Never-married,?,Own-child,White,Female,0,0,40,United-States,<=50K,0.731097,335,baba7666-8dac-40ae-a565-3e2642d0ef01,3.327584,inlier
136,35,Private,160910,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,50,United-States,<=50K,0.735100,702,baba7666-8dac-40ae-a565-3e2642d0ef01,3.282702,inlier
137,44,Private,98211,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,40,United-States,>50K,0.740318,241,baba7666-8dac-40ae-a565-3e2642d0ef01,3.307519,inlier
138,54,Local-gov,279452,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,40,United-States,<=50K,0.740868,630,baba7666-8dac-40ae-a565-3e2642d0ef01,3.312440,inlier


In [15]:
# Specify the inlier cases
inliers_indices = inliers[['.session', '.session_training_index']].values

# React to get the details of each case
results = t.react(case_indices=inliers_indices, 
                  preserve_feature_values=df.columns.tolist(), 
                  leave_case_out=True, 
                  details=details)

In [16]:
# Extract the global and local case feature residual convictions
global_case_feature_residual_convictions = pd.DataFrame(results['details']['global_case_feature_residual_convictions'])[df.columns.tolist()]
local_case_feature_residual_convictions = pd.DataFrame(results['details']['local_case_feature_residual_convictions'])[df.columns.tolist()]

In [17]:
plot_anomalies(inliers, local_case_feature_residual_convictions, title="Inliers", yaxis_title="Residual Conviction")

The heat map explains the reason why each case was an inlier. The darker the shade of blue, the higher the contribution the to case being an inlier.

## Potential improvements

Sparse regions of the model or under defined problems can make it difficult to make an accurate prediction. Howso Engine can be used to identify potential data, or model improvements by examining the residual conviction and density.

In [18]:
# Identify cases for investigation
partial_train_df = stored_convictions
partial_train_cases = partial_train_df[['.session', '.session_training_index']]


In [19]:
# Residual convictions are output via the local_case_feature_residual_convictions explanation
details = {'global_case_feature_residual_convictions':True}

# Get the residual convictions for the specified cases
new_result = t.react(case_indices=partial_train_cases.values.tolist(), 
                     leave_case_out=True, 
                     preserve_feature_values=df.drop(action_features, axis=1).columns.tolist(), 
                     action_features=action_features,
                     details=details)

In [20]:
# Extract residual conviction
target_residual_convictions = [ x['target'] for x in new_result['details']['global_case_feature_residual_convictions'] ]

# Binarize residual conviction
convict_threshold = 0.75
low_residual_conviction = [1 if x <= convict_threshold else 0 for x in target_residual_convictions]

# Density is just the inverse of distance_contribution
density = 1 / partial_train_df['distance_contribution']

# Add new features to the dataframe
partial_train_df['density'] = density
partial_train_df['target_residual_conviction'] = target_residual_convictions
partial_train_df['low_residual_conviction'] = low_residual_conviction

In [21]:
partial_train_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,...,hours-per-week,native-country,target,familiarity_conviction_addition,.session_training_index,.session,distance_contribution,density,target_residual_conviction,low_residual_conviction
0,45,State-gov,213646,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,...,40,United-States,>50K,2.609425,0,baba7666-8dac-40ae-a565-3e2642d0ef01,4.810184,0.207892,0.651194,1
1,58,Private,242670,HS-grad,9,Never-married,Adm-clerical,Unmarried,White,Female,...,40,United-States,<=50K,11.774335,1,baba7666-8dac-40ae-a565-3e2642d0ef01,5.886059,0.169893,22.815632,0
2,27,Private,211032,Preschool,1,Married-civ-spouse,Farming-fishing,Other-relative,White,Male,...,24,Mexico,<=50K,0.094427,2,baba7666-8dac-40ae-a565-3e2642d0ef01,29.202523,0.034244,0.659037,1
3,39,Private,165799,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,...,40,United-States,<=50K,0.293359,3,baba7666-8dac-40ae-a565-3e2642d0ef01,1.918331,0.521286,1.943868,0
4,28,Private,196690,Assoc-voc,11,Never-married,Machine-op-inspct,Not-in-family,White,Female,...,42,United-States,<=50K,0.436970,4,baba7666-8dac-40ae-a565-3e2642d0ef01,14.927844,0.066989,1.710386,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,20,?,369678,HS-grad,9,Never-married,?,Not-in-family,Other,Male,...,43,United-States,<=50K,2.814812,995,baba7666-8dac-40ae-a565-3e2642d0ef01,9.646100,0.103669,2.010284,0
996,50,Local-gov,138358,Some-college,10,Separated,Other-service,Unmarried,Black,Female,...,28,United-States,<=50K,5.397442,996,baba7666-8dac-40ae-a565-3e2642d0ef01,8.796548,0.113681,1.775804,0
997,27,Federal-gov,148153,HS-grad,9,Married-civ-spouse,Handlers-cleaners,Husband,Asian-Pac-Islander,Male,...,40,United-States,<=50K,13.211536,997,baba7666-8dac-40ae-a565-3e2642d0ef01,8.087714,0.123644,0.995310,0
998,45,Private,343377,Some-college,10,Married-civ-spouse,Adm-clerical,Husband,White,Male,...,40,United-States,<=50K,1.310455,998,baba7666-8dac-40ae-a565-3e2642d0ef01,4.075570,0.245364,15.429044,0


In [22]:
# Helper function to resize the data points
def get_sizes(min_size, max_size, series):
    min_value = series.min()
    max_value = series.max()
    
    m = (max_size - min_size) / (max_value - min_value)
    
    sizes = series * m + min_size
    return (sizes)

partial_train_df["density"] = get_sizes(5, 500, partial_train_df["density"])

In [23]:
plot_dataset(partial_train_df, x="age", y="education-num", size="density", hue="low_residual_conviction", alpha=0.4)

The above graph is a visualization of the data set in 2-dimensions, with the color as an indication of residual conviction and the size representing the density of the data. More specifically, the orange color represents the low conviction points (points which are very uncertain), and small size represents low density. Therefore, adding more data to the region with small, orange points can improve model performance. 


On the other hand, an orange point that is large would be an indication that this case lies in an dense region but was not predictable. Hence, this will be an indication where the problem is not well defined, or the data is missing key features.  
