Copyright 2018 Google LLC.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

# Evaluation code


__Disclaimer__
*   This notebook contains experimental code, which may be changed without notice.
*   The ideas here are some ideas relevant to fairness - they are not the whole story!



# Notebook summary

This notebook intends to evaluate a list of models on two dimensions:
- "Performance": How well the model perform to classify the data (intended bias). Currently, we use the AUC.
- "Bias": How much bias does the model contain (unintended bias). Currently, we use the pinned auc.

This script takes the following steps:

- Defines the models to evaluate and specify their signature (expected inputs/outputs).
- Write input function to generate 2 datasets:
    - A "performance dataset" which will be used for the first set of metrics. This dataset is supposed to be similar format to the training data (contain a piece of text and a label).
    - A "bias dataset" which will be used for the second set of metrics. This data contains a piece of text, a label but also some subgroup information to evaluate the unintended bias on.
- Runs predictions with the export_utils.
- Evaluate metrics.

In [1]:
%load_ext autoreload

In [2]:
%autoreload 2

In [3]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import getpass
from IPython.display import display
import json
import nltk
import numpy as np
import pandas as pd
import pkg_resources
import os
import random
import re
import seaborn as sns

import tensorflow as tf
from tensorflow.python.lib.io import file_io

In [4]:
#from google.colab import auth
#auth.authenticate_user()

In [5]:
#!pip install -U -q git+https://github.com/conversationai/unintended-ml-bias-analysis

In [6]:
from unintended_ml_bias import model_bias_analysis

In [7]:
import input_fn_example
from utils_export.dataset import Dataset, Model
from utils_export import utils_cloudml
from utils_export import utils_tfrecords

In [8]:
os.environ['GCS_READ_CACHE_MAX_SIZE_MB'] = '0' #Faster to access GCS file + https://github.com/tensorflow/tensorflow/issues/15530

In [9]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/nthain/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Settings

### Global variables

In [10]:
# User inputs
PROJECT_NAME = 'conversationai-models'

# Part 1: Defining your model

An important user input is the description of the deployed models that are evaluated.

1- Defining which model will be used.
$MODEL_NAMES defined the different names (format: "model_name:version").

2- Defining the model signature.
Currently, the `Dataset` API does not detect the signature of a CMLE model, so this information is given by a `Model` instance.
You need to describe:
- input_spec: what the input_file should be (argument `feature_keys_spec`). It is a dictionary which describes the name of the fields and their types.
- prediction_keys (argument `prediction_keys`). It is the name of the prediction field in the model output.
- Name of the example key (argument `example_key`). A unique identifier for each sentence which will be generated by the dataset API (a.k.a. your input data does not need to have this field).
    - When using Cloud MLE for batch predictions, data is processed in an unpredictable order. To be able to match the returned predictions with your input instances, you must have instance keys defined.

In [11]:
# User inputs:
MODEL_NAMES = [
    'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738', # ??
    'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132748', # ??
    'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820', # ??
    'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828', # ??
]

In [12]:
# User inputs: Model description (see above for more info).
TEXT_FEATURE_NAME = 'tokens' #Input defined in serving function called in run.py (arg: `text_feature_name`).
SENTENCE_KEY = 'comment_key' #Input key defined in serving functioncalled in run.py (arg: `example_key_name`).
#LABEL_NAME_PREDICTION_MODEL = 'scores' # Output prediction: typically $label_name/logistic
LABEL_NAME_PREDICTION_MODEL = 'probabilities' # Output prediction: typically $label_name/logistic

In [13]:
model_input_spec = {
    TEXT_FEATURE_NAME: utils_tfrecords.EncodingFeatureSpec.LIST_STRING} #library will use this automatically

model = Model(
    feature_keys_spec=model_input_spec,
    prediction_keys=LABEL_NAME_PREDICTION_MODEL,
    example_key=SENTENCE_KEY,
    model_names=MODEL_NAMES,
    project_name=PROJECT_NAME)

# Part 2: Defining the input_fn

In [14]:
def tokenizer(text, lowercase=True):
  """Converts text to a list of words.

  Args:
    text: piece of text to tokenize (string).
    lowercase: whether to include lowercasing in preprocessing (boolean).
    tokenizer: Python function to tokenize the text on.

  Returns:
    A list of strings (words).
  """
  words = nltk.word_tokenize(text.decode('utf-8'))
  if lowercase:
    words = [w.lower() for w in words]
  return words

### Defining input_fn

We need to define first some input_fn which will be fed to the `Dataset` API.
An input_fn must follow the following requirements:
- Returns a pandas DataFrame
- Have an argument 'max_n_examples' to control the size of the dataframe.
- Containing at least a field $TEXT_FEATURE_NAME, which maps to a tokenized text (list of words) AND  a field 'label' which is 1 for toxic (0 otherwise).

We will define two different input_fn (1 for performance, 1 for bias). The bias input_fn should also contain identity information.

Note: You can use ANY input_fn that matches those requirements. You can find a few examples of input_fn in the file input_fn_example.py (for toxicity and civil_comments dataset).

In [15]:
# User inputs: Choose which one you want to use OR create your own!
INPUT_FN_PERFORMANCE = input_fn_example.create_input_fn_biasbios(
    tokenizer,
    model_input_comment_field=TEXT_FEATURE_NAME,
    )

# Part 3: Running prediction

### Performance dataset

In [16]:
# User inputs
SIZE_PERFORMANCE_DATA_SET = 10000

In [17]:
# Pattern for path of tf_records
PERFORMANCE_DATASET_DIR = os.path.join(
    'gs://conversationai-models/',
    getpass.getuser(),
    'tfrecords',
    'performance_dataset_dir')
print(PERFORMANCE_DATASET_DIR)

gs://conversationai-models/nthain/tfrecords/performance_dataset_dir


In [18]:
dataset_performance = Dataset(INPUT_FN_PERFORMANCE, PERFORMANCE_DATASET_DIR)
random.seed(2018) # Need to set seed before loading data to be able to reload same data in the future
dataset_performance.load_data(SIZE_PERFORMANCE_DATA_SET, random_filter_keep_rate=0.5)

INFO:tensorflow:input_fn is compatible with the `Dataset` class.




In [19]:
dataset_performance.show_data()

Unnamed: 0,tokens,gender,label
0,"[in, her, role, ,, she, is, a, member, of, an,...",F,17
1,"[his, blog, www.donaldhtaylorjr.blogspot.com, ...",M,25
2,"[he, has, primarily, reported, for, the, atlan...",M,12
3,"[andrea, 's, area, of, expertise, is, in, whol...",F,25
4,"[dr., milane, was, trained, as, a, national, c...",F,25
5,"[he, is, also, visiting, associate, professor,...",M,25
6,"[her, research, focuses, on, the, trafficking,...",F,25
7,"[he, has, been, licensed, to, practice, law, i...",M,3
8,"[after, a, two-year, postdoctoral, fellowship,...",M,25
9,"[prior, to, teaching, ,, she, was, an, account...",F,31


In [20]:
dataset_performance.show_data().shape

(10000, 3)

In [21]:
dataset_performance.show_data().columns

Index([u'tokens', u'gender', u'label'], dtype='object')

In [22]:
CLASS_NAMES = range(33)

In [23]:
INPUT_DATA = 'gs://conversationai-models/biosbias/dataflow_dir/data-preparation-20190220165938/eval-00000-of-00003.tfrecord'
record_iterator = tf.python_io.tf_record_iterator(path=INPUT_DATA)
string_record = next(record_iterator)
example = tf.train.Example()
example.ParseFromString(string_record)
text = example.features.feature
print(example)

features {
  feature {
    key: "comment_text"
    value {
      bytes_list {
        value: " In her role, she is a member of an innovative team-based care model which has been recognized by Wall Street Journal and the Robert Wood Johnson Foundation. A process improvement leader with a passion for serving vulnerable populations, Amberly was recognized by her colleagues with the first Daisy Award for Extraordinary Nurses at Cambridge Health Alliance. Amberly holds a BS in Nursing from Valparaiso University and a Masters in Public Health from the University of Massachusetts Amherst. read more"
      }
    }
  }
  feature {
    key: "gender"
    value {
      bytes_list {
        value: "F"
      }
    }
  }
  feature {
    key: "title"
    value {
      int64_list {
        value: 17
      }
    }
  }
}



In [24]:
# Set recompute_predictions=False to save time if predictions are available.
dataset_performance.add_model_prediction_to_data(model, recompute_predictions=False, class_names=CLASS_NAMES)

INFO:tensorflow:Model is compatible with the `Dataset` instance.


In [25]:
def _load_predictions(pred_file):
    with file_io.FileIO(pred_file, 'r') as f:
      # prediction file needs to fit in memory.
      try:
        predictions = [json.loads(line) for line in f]
      except:
        predictions = []
    return predictions

model_name_tmp = MODEL_NAMES[0]
prediction_file = dataset_performance.get_path_prediction(model_name_tmp)
print(prediction_file)
prediction_file = os.path.join(prediction_file,
                                 'prediction.results-00000-of-00001')
print(len(_load_predictions(prediction_file)[0]['probabilities']))

gs://conversationai-models/nthain/tfrecords/performance_dataset_dir/prediction_data_tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738
33


### Post processing

In [26]:
test_performance_df = dataset_performance.show_data()

In [27]:
test_bias_df = test_performance_df.copy()

### Analyzing final results

In [28]:
test_performance_df.head()

Unnamed: 0,tokens,gender,label,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_0,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_1,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_2,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_3,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_4,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_5,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_6,...,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_23,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_24,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_25,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_26,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_27,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_28,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_29,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_30,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_31,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_32
0,"[in, her, role, ,, she, is, a, member, of, an,...",F,17,0.001687,1.814099e-11,0.002681,0.009853,0.004227,0.055716,0.003005,...,0.003351,0.013561,0.00204,0.001682,0.0004412969,6.086852000000001e-17,0.001606,0.001379,0.014635,3.2e-05
1,"[his, blog, www.donaldhtaylorjr.blogspot.com, ...",M,25,0.014774,2.716771e-13,0.005496,0.022347,0.003845,0.08448,9.6e-05,...,0.010309,0.001055,0.001062,0.006205,9.439933e-07,5.250679e-18,0.001204,0.00015,0.015252,0.000779
2,"[he, has, primarily, reported, for, the, atlan...",M,12,0.016779,8.870694e-16,0.001688,0.071343,0.00056,0.029823,3.2e-05,...,0.018767,0.022292,0.077598,0.033979,8.196229e-05,3.315851e-11,0.007313,0.002565,0.118167,0.001603
3,"[andrea, 's, area, of, expertise, is, in, whol...",F,25,0.017742,1.019689e-15,0.01715,0.052085,0.002097,0.052322,0.002627,...,0.00158,0.145462,0.000637,0.000337,0.0003909138,1.304484e-21,0.011515,0.000922,0.029867,1e-06
4,"[dr., milane, was, trained, as, a, national, c...",F,25,0.015531,1.783027e-12,0.196227,0.016471,0.00269,4e-05,0.001384,...,0.013445,0.003754,0.22009,0.081232,7.920414e-05,2.406181e-13,0.150817,0.014913,0.071632,0.000142


In [29]:
test_bias_df.head()

Unnamed: 0,tokens,gender,label,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_0,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_1,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_2,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_3,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_4,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_5,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738_6,...,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_23,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_24,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_25,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_26,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_27,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_28,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_29,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_30,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_31,tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828_32
0,"[in, her, role, ,, she, is, a, member, of, an,...",F,17,0.001687,1.814099e-11,0.002681,0.009853,0.004227,0.055716,0.003005,...,0.003351,0.013561,0.00204,0.001682,0.0004412969,6.086852000000001e-17,0.001606,0.001379,0.014635,3.2e-05
1,"[his, blog, www.donaldhtaylorjr.blogspot.com, ...",M,25,0.014774,2.716771e-13,0.005496,0.022347,0.003845,0.08448,9.6e-05,...,0.010309,0.001055,0.001062,0.006205,9.439933e-07,5.250679e-18,0.001204,0.00015,0.015252,0.000779
2,"[he, has, primarily, reported, for, the, atlan...",M,12,0.016779,8.870694e-16,0.001688,0.071343,0.00056,0.029823,3.2e-05,...,0.018767,0.022292,0.077598,0.033979,8.196229e-05,3.315851e-11,0.007313,0.002565,0.118167,0.001603
3,"[andrea, 's, area, of, expertise, is, in, whol...",F,25,0.017742,1.019689e-15,0.01715,0.052085,0.002097,0.052322,0.002627,...,0.00158,0.145462,0.000637,0.000337,0.0003909138,1.304484e-21,0.011515,0.000922,0.029867,1e-06
4,"[dr., milane, was, trained, as, a, national, c...",F,25,0.015531,1.783027e-12,0.196227,0.016471,0.00269,4e-05,0.001384,...,0.013445,0.003754,0.22009,0.081232,7.920414e-05,2.406181e-13,0.150817,0.014913,0.071632,0.000142


# Part 4: Run evaluation metrics

## Performance metrics

### Data Format

At this point, our performance data is in DataFrame df, with columns:

- label: True if the comment is Toxic, False otherwise.
- < model name >: One column per model, cells contain the score from that model.
You can run the analysis below on any data in this format. Subgroup labels can be generated via words in the text as done above, or come from human labels if you have them.

### Run AUC

In [30]:
import sklearn.metrics as metrics

In [31]:
test_performance_df.label.value_counts()

25    3295
3      890
22     661
12     542
26     507
23     494
17     481
31     427
30     343
7      268
2      265
18     209
16     202
24     197
29     194
10     185
6      156
0      141
8      102
5       87
20      67
4       58
32      50
19      41
9       39
11      37
27      32
21      30
Name: label, dtype: int64

In [32]:
test_performance_df['label'] == 3

0       False
1       False
2       False
3       False
4       False
5       False
6       False
7        True
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18      False
19       True
20      False
21      False
22      False
23      False
24      False
25      False
26      False
27      False
28       True
29      False
        ...  
9970    False
9971    False
9972    False
9973    False
9974     True
9975    False
9976    False
9977    False
9978    False
9979    False
9980    False
9981    False
9982    False
9983    False
9984    False
9985    False
9986    False
9987    False
9988    False
9989    False
9990    False
9991    False
9992    False
9993    False
9994    False
9995    False
9996    False
9997    False
9998    False
9999    False
Name: label, Length: 10000, dtype: bool

In [33]:
_model = 'tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738'
_class = 3
test_performance_df['{}_{}'.format(_model, _class)]

0       0.009853
1       0.022347
2       0.071343
3       0.052085
4       0.016471
5       0.101164
6       0.011855
7       0.001939
8       0.577954
9       0.128116
10      0.014246
11      0.022629
12      0.050127
13      0.205395
14      0.038603
15      0.045960
16      0.652514
17      0.099024
18      0.055800
19      0.167238
20      0.056128
21      0.073346
22      0.040896
23      0.046719
24      0.066602
25      0.015700
26      0.018788
27      0.099245
28      0.744404
29      0.054567
          ...   
9970    0.025056
9971    0.032513
9972    0.059166
9973    0.030145
9974    0.146219
9975    0.132243
9976    0.061952
9977    0.497093
9978    0.154263
9979    0.033800
9980    0.041427
9981    0.000079
9982    0.071002
9983    0.961150
9984    0.017224
9985    0.113003
9986    0.040686
9987    0.729384
9988    0.025192
9989    0.066657
9990    0.025502
9991    0.011763
9992    0.007214
9993    0.004737
9994    0.044174
9995    0.125944
9996    0.199613
9997    0.0188

In [34]:
auc_list = []
for _model in MODEL_NAMES:
    for _class in CLASS_NAMES:
        fpr, tpr, thresholds = metrics.roc_curve(
            test_performance_df['label'] == _class,
            test_performance_df['{}_{}'.format(_model, _class)])
        _auc = metrics.auc(fpr, tpr)
        auc_list.append(_auc)
        print ('Auc for class {} model {}: {}'.format(_class, _model, _auc))

Auc for class 0 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.472880379306
Auc for class 1 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: nan
Auc for class 2 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.494346987625
Auc for class 3 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.5094779166
Auc for class 4 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.579115768006
Auc for class 5 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.495869234756
Auc for class 6 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.468048349118
Auc for class 7 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.485770898896
Auc for class 8 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.491489665173
Auc for class 9 model t



Auc for class 30 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132748: 0.493638808206
Auc for class 31 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132748: 0.508299713945
Auc for class 32 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132748: 0.457780904523
Auc for class 0 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820: 0.496740926496
Auc for class 1 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820: nan
Auc for class 2 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820: 0.499153608357
Auc for class 3 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820: 0.499355443456
Auc for class 4 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820: 0.519405656255
Auc for class 5 model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820: 0.510566062676
Auc for class 6 mo

In [55]:
def get_class_from_col_name(col_name):
    pattern = r'^.*_(\d+)$'
    return int(re.search(pattern, col_name).group(1))

In [62]:
def find_best_class(df, model_name, class_names):
    model_class_names = ['{}_{}'.format(model_name, class_name) for class_name in class_names]
    sub_df = df[model_class_names]
    df['{}_class'.format(model_name)] = sub_df.idxmax(axis=1).apply(get_class_from_col_name)

In [63]:
for _model in MODEL_NAMES:
    find_best_class(test_performance_df, _model, CLASS_NAMES)

In [64]:
accuracy_list = []
for _model in MODEL_NAMES:
    is_correct = (test_performance_df['{}_class'.format(_model)] == test_performance_df['label'])
    _acc = sum(is_correct)/len(is_correct)
    accuracy_list.append(_acc)
    print ('Accuracy for model {}: {}'.format(_model, _acc))

Accuracy for model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132738: 0.0572
Accuracy for model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132748: 0.0639
Accuracy for model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132820: 0.0681
Accuracy for model tf_trainer_tf_gru_attention_multiclass_biosbias_glove:v_20190306_132828: 0.0623


## Unintended Bias Metrics

### Data Format
At this point, our bias data is in DataFrame df, with columns:

*   label: True if the comment is Toxic, False otherwise.
*   < model name >: One column per model, cells contain the score from that model.
*   < subgroup >: One column per identity, True if the comment mentions this identity.

You can run the analysis below on any data in this format. Subgroup labels can be 
generated via words in the text as done above, or come from human labels if you have them.


In [35]:
identity_terms_civil_included = []
for _term in input_fn_example.identity_terms_civil:
    if sum(test_bias_df[_term]) >= 20:
        print ('keeping {}'.format(_term))
        identity_terms_civil_included.append(_term)

KeyError: 'male'

In [None]:
test_bias_df['model_1'] = test_bias_df['tf_gru_attention_civil:v_20181109_164318']
test_bias_df['model_2'] = test_bias_df['tf_gru_attention_civil:v_20181109_164403']
test_bias_df['model_3'] = test_bias_df['tf_gru_attention_civil:v_20181109_164535']
test_bias_df['model_4'] = test_bias_df['tf_gru_attention_civil:v_20181109_164630']

In [None]:
MODEL_NAMES = ['model_1', 'model_2', 'model_3', 'model_4']

In [None]:
bias_metrics = model_bias_analysis.compute_bias_metrics_for_models(test_bias_df, identity_terms_civil_included, MODEL_NAMES, 'label')

In [None]:
model_bias_analysis.plot_auc_heatmap(bias_metrics, MODEL_NAMES)

In [None]:
model_bias_analysis.plot_aeg_heatmap(bias_metrics, MODEL_NAMES)