# Recipe: Generative React and Generating Synthetic Data Using Engine
## Overview 

Although most Engine recipes use discriminative analysis, Engine can be easily configured to perform generative analysis. But what is the difference? **Discriminative** analysis utilizes the maximum likelihood estimate (MLE) to make predictions. In contrast, **generative** analysis samples a prediction from the likelihood distribution. A defining feature of Howso Engine is that it utilizes the concept of `conviction` to condition its sample from the likelihood distribution to be more or less unusual (or surprising). T

In this recipe, we will review how to perform generative analysis using Howso Engine, and contrast its result with discriminative analysis. Then, we will demonstrate a possible use case for this generative predictions: synthetic data creation.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
from pmlb import fetch_data

from howso import engine
from howso.engine import Trainee
from howso.utilities import infer_feature_attributes

We will be using the `Adult` dataset where the Action Feature is a binary indicator of whether a person makes over $50k/year.

In [2]:
df_original = fetch_data('adult', local_cache_dir="data/adult")

# Use the first 1000 data points as training
df = df_original.iloc[:1000]

df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39.0,7,77516.0,9,13.0,4,1,1,4,1,2174.0,0.0,40.0,39,1
1,50.0,6,83311.0,9,13.0,2,4,0,4,1,0.0,0.0,13.0,39,1
2,38.0,4,215646.0,11,9.0,0,6,1,4,1,0.0,0.0,40.0,39,1
3,53.0,4,234721.0,1,7.0,2,6,0,2,1,0.0,0.0,40.0,39,1
4,28.0,4,338409.0,9,13.0,2,10,5,2,0,0.0,0.0,40.0,5,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,56.0,4,112840.0,11,9.0,2,4,0,4,1,0.0,0.0,55.0,39,0
996,45.0,4,89325.0,12,14.0,0,10,1,4,1,0.0,0.0,45.0,39,1
997,48.0,1,33109.0,9,13.0,0,4,4,4,1,0.0,0.0,58.0,39,0
998,40.0,4,82465.0,15,10.0,2,7,0,4,1,2580.0,0.0,40.0,39,1


In [3]:
# Infer features using dataframe format
features = infer_feature_attributes(df)

# Section 1: Generative React

A *generative* `react` call is very similar to a *discriminative* `react` call, with the only difference being a `desired_conviction` value is set. This conviction transitions Engine's predictions into probability space, and indicates how surprising the new point generated by Engine will be. Here, we will ask Engine to provide a generative outcome for a test dataset, using both a high (conviction=10, indicating the prediction is ten times less surprising than average) and low (conviction=0.1, indicating the prediction is ten times more surprising than average)  conviction, and compare the results to a discriminative outcome.

In [4]:
# Specify Context and Action Features
action_features = ['target']
context_features = features.get_names(without=action_features)

# Create the Trainee
t = Trainee(
    features=features,
    overwrite_existing=True
)

# Train
t.train(df)

# Targeted Analysis
t.analyze(context_features=context_features, action_features=action_features)


In [5]:
# Obtain context values for test cases, for which you want a generative prediction, and the corresponding action values
context_values = df_original[context_features].iloc[1001:1011].reset_index(drop=True)
action_values = df_original[action_features].iloc[1001:1011].reset_index(drop=True)

First, we will perform a generative react on the test cases with high conviction, which should return less surprising and more accurate results:

In [6]:
# Perform generative react
result_gen_high =  t.react(context_features=context_features,
                action_features=action_features,
                contexts=context_values,
                desired_conviction=10,
                num_cases_to_generate=len(context_values)
                )

Then we will perform a generative react on the test cases with low conviction, which should return more surprising and less accurate results:

In [7]:
# Perform generative react
result_gen_low =  t.react(context_features=context_features,
                action_features=action_features,
                contexts=context_values,
                desired_conviction=0.1,
                num_cases_to_generate=len(context_values)
                )

Then, we will perform a discriminative react on the test cases

In [8]:
# Perform discriminative react
result_disc =  t.react(context_features=context_features,
                action_features=action_features,
                contexts=context_values
                )

We will compare results between the three types of prediction. 

In [9]:
action_values['gen-high'] = result_gen_high['action']
action_values['gen-low'] = result_gen_low['action']
action_values['disc'] = result_disc['action']
action_values

Unnamed: 0,target,gen-high,gen-low,disc
0,0,0,1,0
1,0,0,1,0
2,1,1,1,1
3,1,1,0,1
4,1,0,1,0
5,1,1,0,1
6,0,1,1,0
7,0,0,1,0
8,1,1,0,1
9,1,1,0,1


In [10]:
acc_gen_high = len(action_values.where(action_values['target'] == action_values['gen-high']).dropna()) / len(action_values) * 100
acc_gen_low = len(action_values.where(action_values['target'] == action_values['gen-low']).dropna()) / len(action_values) * 100
acc_disc = len(action_values.where(action_values['target'] == action_values['disc']).dropna()) / len(action_values) * 100

print('Accuracy')

print('Generative - High Conviction',acc_gen_high)
print('Generative - Low Conviction',acc_gen_low)
print('Discriminative',acc_disc)

Accuracy
Generative - High Conviction 80.0
Generative - Low Conviction 20.0
Discriminative 90.0


We can see here that generative predictions with high conviction have higher accuracy than generative predictions with low conviction, while the discriminative predictions have the highest accuracy overall.

# Section 2: Synthetic Data Generation

Generative analysis is also useful to create synthetic data, as it can generate entirely new points that follow the distibution of the original data.

Synthetic data creation is similar to the generative analysis performed in Section 1, but uses a *targetless* optimization scheme. Additionally, you will set two additional parameters:

    -`generate_new_cases`: whether a completely new case is or is not always generated
    
    -`num_cases_to_generate`: number of synthetic cases to generate

In [11]:
# Infer features using dataframe format
features = infer_feature_attributes(df)

# Create the Trainee
t = Trainee(
    features=features,
    overwrite_existing=True
)

# Train
t.train(df)

# Targetless optimization
t.analyze()

# Synthesize
synth = t.react(action_features=df.columns.tolist(), # What features to generate? In this case, the same features as the original data
                desired_conviction=5, # Set at GeminAI's default desired conviction value
                generate_new_cases='always', # Indicates that we always want to create entirely new cases from the original data
                num_cases_to_generate=len(df)) # Number of new points to generate? In this case, the same number as the original data


In [12]:
# Print out synthetic dataset
synthetic_data = synth['action']
synthetic_data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,27.0,4,181392.0,11,10.0,0,4,1,4,1,0.0,80.0,77.0,39,1
1,69.0,2,432941.0,3,2.0,0,0,0,2,1,0.0,1.0,40.0,39,1
2,43.0,4,73258.0,9,13.0,2,10,0,4,1,0.0,2.0,39.0,39,1
3,49.0,7,197962.0,8,9.0,2,10,0,4,1,0.0,6.0,44.0,39,1
4,43.0,4,48954.0,9,9.0,0,1,4,4,1,14.0,0.0,40.0,32,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,49.0,7,338443.0,1,9.0,2,8,0,4,1,0.0,0.0,43.0,39,1
996,34.0,4,258422.0,11,9.0,4,3,3,2,1,0.0,0.0,58.0,39,1
997,33.0,4,138232.0,7,13.0,2,6,0,1,1,6446.0,0.0,39.0,39,0
998,24.0,4,130938.0,13,10.0,6,0,3,4,1,220.0,3.0,21.0,39,1
