# Introduction
> Before deploying a Machine Learning model, it is important to understand the performance of the model

> The What-if-tool is a visual interface desgined by Google which helps analyze the machine learning models with minimal lines of code

> This Notebook will show you how to use the What-if-tool, we will be using a dataset that is provided for all demos. In this example we will explore how we can use the Google What If Tool in order to measure the performance of a linear classifier model and examine how different features affect the models prediction

## Required Imports

In [2]:
import pandas as pd
import numpy as np
import tensorflow as tf
import functools

from sklearn.model_selection import train_test_split
from witwidget.notebook.visualization import WitConfigBuilder
from witwidget.notebook.visualization import WitWidget

## Create required definitions

In [3]:
# Creates a tf feature spec from the dataframe and columns specified.
def create_feature_spec(df, columns=None):
    feature_spec = {}
    if columns == None:
        columns = df.columns.values.tolist()
    for f in columns:
        if df[f].dtype is np.dtype(np.int64):
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.int64)
        elif df[f].dtype is np.dtype(np.float64):
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.float32)
        else:
            feature_spec[f] = tf.io.FixedLenFeature(shape=(), dtype=tf.string)
    return feature_spec

# Creates simple numeric and categorical feature columns from a feature spec and a
# list of columns from that spec to use.
#
# NOTE: Models might perform better with some feature engineering such as bucketed
# numeric columns and hash-bucket/embedding columns for categorical features.
def create_feature_columns(columns, feature_spec):
    ret = []
    for col in columns:
        if feature_spec[col].dtype is tf.int64 or feature_spec[col].dtype is tf.float32:
            ret.append(tf.feature_column.numeric_column(col))
        else:
            ret.append(tf.feature_column.indicator_column(
                tf.feature_column.categorical_column_with_vocabulary_list(col, list(df[col].unique()))))
    return ret

# An input function for providing input to a model from tf.Examples
def tfexamples_input_fn(examples, feature_spec, label, mode=tf.estimator.ModeKeys.EVAL,
                       num_epochs=None, 
                       batch_size=64):
    def ex_generator():
        for i in range(len(examples)):
            yield examples[i].SerializeToString()
    dataset = tf.data.Dataset.from_generator(
      ex_generator, tf.dtypes.string, tf.TensorShape([]))
    if mode == tf.estimator.ModeKeys.TRAIN:
        dataset = dataset.shuffle(buffer_size=2 * batch_size + 1)
    dataset = dataset.batch(batch_size)
    dataset = dataset.map(lambda tf_example: parse_tf_example(tf_example, label, feature_spec))
    dataset = dataset.repeat(num_epochs)
    return dataset

# Parses Tf.Example protos into features for the input function.
def parse_tf_example(example_proto, label, feature_spec):
    parsed_features = tf.io.parse_example(serialized=example_proto, features=feature_spec)
    target = parsed_features.pop(label)
    return parsed_features, target

# Converts a dataframe into a list of tf.Example protos.
def df_to_examples(df, columns=None):
    examples = []
    if columns == None:
        columns = df.columns.values.tolist()
    for index, row in df.iterrows():
        example = tf.train.Example()
        for col in columns:
            if df[col].dtype is np.dtype(np.int64):
                example.features.feature[col].int64_list.value.append(int(row[col]))
            elif df[col].dtype is np.dtype(np.float64):
                example.features.feature[col].float_list.value.append(row[col])
            elif row[col] == row[col]:
                example.features.feature[col].bytes_list.value.append(row[col].encode('utf-8'))
        examples.append(example)
    return examples

# Converts a dataframe column into a column of 0's and 1's based on the provided test.
# Used to force label columns to be numeric for binary classification using a TF estimator.
def make_label_column_numeric(df, label_column, test):
  df[label_column] = np.where(test(df[label_column]), 1, 0)

## Read dataset

In [4]:
csv_path = 'dataset/propublica_data_for_fairml.csv'

# Set the column names for the columns in the CSV. If the CSV's first line is a header line containing
# the column names, then set this to None. In this example we do not need to define the column names

# csv_columns = ["Two_yr_Recidivism", "Number_of_Priors", "score_factor", "Age_Above_FourtyFive", "Age_Below_TwentyFive", 
#                "African_American", "Asian", "Hispanic", "Native_American", "Other", "Female", "Misdemeanor" ]

# Read the dataset from the provided CSV and print out information about it
df = pd.read_csv(csv_path, skipinitialspace=True)
df

# If using csv_columns defined above use the code shown below:
# df = pd.read_csv(csv_path, names=csv_columns, skipinitialspace=True)


Unnamed: 0,Two_yr_Recidivism,Number_of_Priors,score_factor,Age_Above_FourtyFive,Age_Below_TwentyFive,African_American,Asian,Hispanic,Native_American,Other,Female,Misdemeanor
0,0,0,0,1,0,0,0,0,0,1,0,0
1,1,0,0,0,0,1,0,0,0,0,0,0
2,1,4,0,0,1,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,0,1
4,1,14,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
6167,0,0,1,0,1,1,0,0,0,0,0,0
6168,0,0,0,0,1,1,0,0,0,0,0,0
6169,0,0,0,1,0,0,0,0,0,1,0,0
6170,0,3,0,0,0,1,0,0,0,0,1,1


## Specify input columns and column to predict
> In this example we will try to use our dataset to identify and predict if an individual will have a misdemeanor based on the features / data provided

> NOTE: In this example all of our values are already binary (1 & 0). In the case that you wish to use text based values this can be done. However, the column that you are predicting on needs to be binary (1 & 0). If our dataset had text values and a misdemeanor was identified by "Yes" or "No", we could use the below code example to turn those values into Binary, again, only the field we are predicting on requires binary values.

> Example code: make_label_column_numeric(df, label_column, lambda val: val == 'Yes')


In [5]:
# Set the column in the dataset you wish for the model to predict

label_column = 'Misdemeanor'

# Set list of all columns from the dataset we will use for the model input
input_features = ["Two_yr_Recidivism", "Number_of_Priors", "score_factor", "Age_Above_FourtyFive", "Age_Below_TwentyFive", 
                "African_American", "Asian", "Hispanic", "Native_American", "Other", "Female"]

# Create a list containing all input features and the label column
features_and_labels = input_features + [label_column]

## Convert dataset to tf.example protos

In [6]:
examples = df_to_examples(df)

## Create and train the linear classifier

In [7]:
num_steps = 2000  #@param {type: "number"}

# Create a feature spec for the classifier
feature_spec = create_feature_spec(df, features_and_labels)

# Define and train the classifier
train_inpf = functools.partial(tfexamples_input_fn, examples, feature_spec, label_column)
classifier = tf.estimator.LinearClassifier(
    feature_columns=create_feature_columns(input_features, feature_spec))
classifier.train(train_inpf, steps=num_steps)

Instructions for updating:
Use Keras preprocessing layers instead, either directly or via the `tf.keras.utils.FeatureSpace` utility. Each of `tf.feature_column.*` has a functional equivalent in `tf.keras.layers` for feature preprocessing when training a Keras model.
Instructions for updating:
Use tf.keras instead.
Instructions for updating:
Use tf.keras instead.
Instructions for updating:
Use tf.keras instead.
Instructions for updating:
Use tf.keras instead.
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\kchp100\\AppData\\Local\\Temp\\tmptxmkx01t', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol

INFO:tensorflow:global_step/sec: 27.155
INFO:tensorflow:loss = 0.558056, step = 100 (3.688 sec)
INFO:tensorflow:global_step/sec: 30.6823
INFO:tensorflow:loss = 0.56280184, step = 200 (3.258 sec)
INFO:tensorflow:global_step/sec: 39.9146
INFO:tensorflow:loss = 0.65184355, step = 300 (2.503 sec)
INFO:tensorflow:global_step/sec: 43.2876
INFO:tensorflow:loss = 0.6953557, step = 400 (2.312 sec)
INFO:tensorflow:global_step/sec: 39.3894
INFO:tensorflow:loss = 0.66575015, step = 500 (2.539 sec)
INFO:tensorflow:global_step/sec: 43.9131
INFO:tensorflow:loss = 0.6295401, step = 600 (2.276 sec)
INFO:tensorflow:global_step/sec: 39.8001
INFO:tensorflow:loss = 0.63185656, step = 700 (2.514 sec)
INFO:tensorflow:global_step/sec: 28.3034
INFO:tensorflow:loss = 0.6153143, step = 800 (3.533 sec)
INFO:tensorflow:global_step/sec: 35.7253
INFO:tensorflow:loss = 0.5769005, step = 900 (2.800 sec)
INFO:tensorflow:global_step/sec: 32.3486
INFO:tensorflow:loss = 0.567963, step = 1000 (3.091 sec)
INFO:tensorflow:gl

<tensorflow_estimator.python.estimator.canned.linear.LinearClassifierV2 at 0x27ec4fb92b0>

## Split our data so that we have train & test data

In [8]:
train_df, test_df = train_test_split(df, test_size=0.33, random_state=42)

## Invoke the What-If Tool for test data and the trained models

In [9]:
num_datapoints = 2000  #@param {type: "number"}
tool_height_in_px = 1000  #@param {type: "number"}

# Load up the test dataset
test_examples = df_to_examples(test_df[0:num_datapoints])

# Setup the tool with the test examples and the trained classifier
config_builder = WitConfigBuilder(test_examples[0:num_datapoints]).set_estimator_and_feature_spec(
    classifier, feature_spec).set_label_vocab(['Will have Misdemeanor', 'Will not have Misdemeanor'])
a = WitWidget(config_builder, height=tool_height_in_px)

# .set_compare_estimator_and_feature_spec(
#     classifier2, feature_spec).set_label_vocab(['Under 50K', 'Over 50K'])
# a = WitWidget(config_builder, height=tool_height_in_px)

# Display our WitWidget defined

In [10]:
a

WitWidget(config={'model_type': 'classification', 'label_vocab': ['Will have Misdemeanor', 'Will not have Misd…

# Compare models using Google What If Toolkit
> We can also use the Google What If Toolkit not on a singular model but on multiple models. We can use the Toolkit in order to compare how two models predict, this can be useful for the selection of a relevant model to a specific use
> In the below example we will deploy an additional DNN Classifier model and identify how this compares to our Linear Classifier model

## Create and train our DNN Classifier model

In [31]:
num_steps_2 = 2000
classifier2 = tf.estimator.DNNClassifier(
    feature_columns=create_feature_columns(input_features, feature_spec),
    hidden_units=[128, 64, 32])
classifier2.train(train_inpf, steps=num_steps_2)



<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifierV2 at 0x224dd4da970>

## Invoke What-If Tool for test data and the trained models

In [32]:
num_datapoints = 2000  #@param {type: "number"}
tool_height_in_px = 1000  #@param {type: "number"}

# Load up the test dataset
test_examples = df_to_examples(test_df[0:num_datapoints])

# Setup the tool with the test examples and the trained classifier
config_builder = WitConfigBuilder(test_examples[0:num_datapoints]).set_estimator_and_feature_spec(
    classifier, feature_spec).set_compare_estimator_and_feature_spec(
    classifier2, feature_spec).set_label_vocab(['Under 50K', 'Over 50K'])
a = WitWidget(config_builder, height=tool_height_in_px)

# Display our witwidget defined

In [1]:
a

NameError: name 'a' is not defined

#### Exploration ideas

- Organize datapoints by setting X-axis scatter to "inference score 1" and Y-axis scatter to "inference score 2" to see how each datapoint differs in score between the linear model (1) and DNN model (2). Points off the diagonal have differences in results between the two models.
  - Are there patterns of which datapoints don't agree between the two models?
  - If you set the ground truth feature dropdown in the "Performance + Fairness" tab to "Over-50K", then you can color or bin the datapoints by "inference correct 1" or "inference correct 2". Are there patterns of which datapoints are incorrect for model 1? For model 2?

- Explore performance of the two models through the confusion matrices in the "Performance + Fairness" tab. Which model is best? Train either model for longer and see if you can change this. Are the rates of errors (false positives and false negatives) that the two models make different?
  - Click the "optimize threshold" button to set the optimal positive classification threshold for each model based on the current cost ratio of 1. How do those thresholds and the resulting confusion matrices differ?
    - Change the cost ratio and optimize the threshold again. How does the threshold and performance change on the two models?
  - Slice the dataset by features, such as "sex" or "race". Does either model have more-equal performance between slices?
    - Use the threshold optimization buttons to set optimal thresholds based on the different fairness constraints. How does performance between slices differ between the two models. Does one require larger differences in threshold values per slice to achieve the desired constraint?

- Looking at the create_feature_columns function in the "Define helper methods" cell, categorical features use one-hot encodings in the model. Perhaps change a many-valued categorical feature, such as education to use an embedding layer. Does anything change in the model behavior (can look through partial dependence plots as one way to investigate).