# Recipe: Generative Reacts and Generating Data Using the Howso Engine

## Overview 

Although most Engine recipes use discriminative analysis, Engine can be easily configured to perform generative analysis. But what is the difference? **Discriminative** analysis utilizes the maximum likelihood estimate (MLE) to make predictions. In contrast, **generative** analysis samples a prediction from a probability distribution based on the trained data. A defining feature of Howso Engine is that it utilizes the concept of `conviction` to condition its sample from the likelihood distribution to be more or less unusual (or surprising).

In this recipe, we will review how to perform generative analysis using Howso Engine, and contrast its result with discriminative analysis. Then, we will demonstrate a possible use case for these generative predictions: Synthesizing data.

In [1]:
import pandas as pd
from pmlb import fetch_data

from howso import engine
from howso.engine import Trainee
from howso.utilities import infer_feature_attributes

## Step 1: Load Data
We will be using the `Adult` dataset where the Action Feature is a binary indicator of whether a person makes over $50k/year.

In [2]:
df_original = fetch_data('adult', local_cache_dir="data/adult")

# Use the first 1000 data points as training
df = df_original.iloc[:1000]

df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39.0,7,77516.0,9,13.0,4,1,1,4,1,2174.0,0.0,40.0,39,1
1,50.0,6,83311.0,9,13.0,2,4,0,4,1,0.0,0.0,13.0,39,1
2,38.0,4,215646.0,11,9.0,0,6,1,4,1,0.0,0.0,40.0,39,1
3,53.0,4,234721.0,1,7.0,2,6,0,2,1,0.0,0.0,40.0,39,1
4,28.0,4,338409.0,9,13.0,2,10,5,2,0,0.0,0.0,40.0,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,56.0,4,112840.0,11,9.0,2,4,0,4,1,0.0,0.0,55.0,39,0
996,45.0,4,89325.0,12,14.0,0,10,1,4,1,0.0,0.0,45.0,39,1
997,48.0,1,33109.0,9,13.0,0,4,4,4,1,0.0,0.0,58.0,39,0
998,40.0,4,82465.0,15,10.0,2,7,0,4,1,2580.0,0.0,40.0,39,1


## Step 2: Compare Discriminative and Generative Reacts

A *generative* `react` call is very similar to a *discriminative* `react` call, with the only difference being a `desired_conviction` value is set. This conviction transitions Engine's predictions into probability space, and indicates how surprising the new point generated by Engine will be. Here, we will ask Engine to provide a generative prediction for a test dataset, using both a high (conviction=10, indicating the prediction is ten times less surprising than average) and low (conviction=0.1, indicating the prediction is ten times more surprising than average)  conviction, and compare the results to a discriminative prediction.

As usual, we must first initialize our Trainee with the feature attributes, train, and analyze.

In [3]:
# Infer features attributes
features = infer_feature_attributes(df)

# Display the feature attributes for verification
features

{'age': {'type': 'continuous',
  'decimal_places': 0,
  'original_type': {'data_type': 'numeric', 'size': 8},
  'bounds': {'min': 7.0, 'max': 148.0}},
 'workclass': {'type': 'nominal',
  'data_type': 'number',
  'decimal_places': 0,
  'original_type': {'data_type': 'integer', 'size': 8},
  'bounds': {'allow_null': False}},
 'fnlwgt': {'type': 'continuous',
  'decimal_places': 0,
  'original_type': {'data_type': 'numeric', 'size': 8},
  'bounds': {'min': 8103.0, 'max': 1202604.0}},
 'education': {'type': 'nominal',
  'data_type': 'number',
  'decimal_places': 0,
  'original_type': {'data_type': 'integer', 'size': 8},
  'bounds': {'allow_null': False}},
 'education-num': {'type': 'continuous',
  'decimal_places': 0,
  'original_type': {'data_type': 'numeric', 'size': 8},
  'bounds': {'min': 1.0, 'max': 20.0}},
 'marital-status': {'type': 'nominal',
  'data_type': 'number',
  'decimal_places': 0,
  'original_type': {'data_type': 'integer', 'size': 8},
  'bounds': {'allow_null': False}},
 

In [4]:
# Specify Context and Action Features
action_features = ['target']
context_features = features.get_names(without=action_features)

# Create the Trainee
t = Trainee(
    features=features,
    overwrite_existing=True
)

# Train
t.train(df)

# Targeted Analysis
t.analyze(context_features=context_features, action_features=action_features)


Additionally, we will store the context values and target values for a set of untrained cases. These will be useful for our demonstrations.

In [5]:
# Obtain context values for test cases, for which you want a generative prediction, and the corresponding action values
context_values = df_original[context_features].iloc[1001:1011].reset_index(drop=True)
action_values = df_original[action_features].iloc[1001:1011].reset_index(drop=True)

First, we will perform a generative react on the test cases with high conviction, which should return less surprising and more accurate results:

In [6]:
# Perform generative react
result_gen_high =  t.react(context_features=context_features,
                action_features=action_features,
                contexts=context_values,
                desired_conviction=10,
                num_cases_to_generate=len(context_values)
                )

Then we will perform a generative react on the test cases with low conviction, which should return more surprising and less accurate results:

In [7]:
# Perform generative react
result_gen_low =  t.react(context_features=context_features,
                action_features=action_features,
                contexts=context_values,
                desired_conviction=0.1,
                num_cases_to_generate=len(context_values)
                )

Then, we will perform a discriminative react on the test cases

In [8]:
# Perform discriminative react by not specifying 'desired_conviction'
result_disc =  t.react(context_features=context_features,
                action_features=action_features,
                contexts=context_values
                )

We will compare results between the three types of prediction. 

In [9]:
action_values['gen-high'] = result_gen_high['action']
action_values['gen-low'] = result_gen_low['action']
action_values['disc'] = result_disc['action']
action_values

Unnamed: 0,target,gen-high,gen-low,disc
0,0,0,1,0
1,0,1,0,1
2,1,0,0,1
3,1,1,0,1
4,1,0,1,0
5,1,0,0,1
6,0,1,0,1
7,0,0,1,0
8,1,1,0,1
9,1,1,0,1


In [10]:
acc_gen_high = len(action_values.where(action_values['target'] == action_values['gen-high']).dropna()) / len(action_values) * 100
acc_gen_low = len(action_values.where(action_values['target'] == action_values['gen-low']).dropna()) / len(action_values) * 100
acc_disc = len(action_values.where(action_values['target'] == action_values['disc']).dropna()) / len(action_values) * 100

print('Accuracy')

print('Generative - High Conviction',acc_gen_high)
print('Generative - Low Conviction',acc_gen_low)
print('Discriminative',acc_disc)

Accuracy
Generative - High Conviction 50.0
Generative - Low Conviction 30.0
Discriminative 70.0


We can see here that generative predictions with high conviction typically have higher accuracy than generative predictions with low conviction. Ideally, the discriminative predictions will have the highest accuracy, but it's not impossible that the generative predictions give more accurate predictions on untrained data.

# Step 3: Synthetic Data Generation with Generative React

Generative analysis is also useful to create synthetic data, as it can generate entirely new points that follow the distribution of the original data.

Synthetic data creation is similar to the generative analysis performed in Section 1, but we recommend using a *targetless* analysis. Additionally, there are two other parameters you should consider based on your use-case:

- `generate_new_cases`: A string that specifies if generated feature values should represent completely new cases. The available values are "no", "attempt", and "always".
    
- `num_cases_to_generate`: The number of cases to synthesize generatively.

In [11]:
# Targetless optimization
t.analyze()

# Synthesize entire cases
synth = t.react(action_features=df.columns.tolist(),
                desired_conviction=5,
                generate_new_cases='no',
                num_cases_to_generate=len(df))


In [12]:
# Print out synthetic dataset
synthetic_data = synth['action']
synthetic_data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,20.0,4,259913.0,15,10.0,2,8,0,4,1,458.0,0.0,7.0,39,1
1,46.0,4,206890.0,15,13.0,2,10,1,4,0,15184.0,0.0,57.0,39,0
2,35.0,2,44707.0,11,10.0,4,6,1,4,1,0.0,1.0,51.0,39,1
3,61.0,4,267499.0,15,9.0,2,12,0,4,1,0.0,0.0,63.0,39,1
4,43.0,4,241354.0,9,13.0,4,10,1,4,1,0.0,28.0,18.0,39,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,56.0,4,255691.0,7,13.0,0,3,1,4,0,0.0,0.0,42.0,39,1
996,61.0,0,36148.0,0,9.0,0,0,3,0,1,0.0,0.0,8.0,39,1
997,31.0,0,877547.0,15,10.0,2,0,5,2,0,15.0,3.0,41.0,39,1
998,19.0,0,134707.0,15,10.0,4,6,3,4,1,0.0,0.0,39.0,39,1


Upon visual inspection, it should be clear that this data does indeed resemble the training data preserving the same relationships between features and maintaining each feature's marginal distributions.

# Conclusion

This recipe demonstrates how to use the `react` method of the Howso Engine to do both discriminative and generative predictions. Generative `react` calls can be used to inspect the probability distribution of a feature's values in different regions of the data based on varying contexts while discriminative `react` calls are useful for more typical inference use-cases.

Additionally, we show that a generative `react` can be used to generate entire collections of data that resemble the training data. We also discuss some of the imporatnt parameters in this use-case that allow users to specify important levels of noise and privacy when synthesizing new data.