# Edge Probing Side-by-Side Examples

This notebook is designed to load predictions from two runs and mine for interesting win or loss examples.

In [None]:
import sys, os, re, json
from importlib import reload

from src.utils import utils

In [None]:
from tqdm import tqdm
import pandas as pd
import numpy as np

In [3]:
import analysis; reload(analysis)

<module 'analysis' from '/nfs/jsalt/home/iftenney/jiant_test/probing/analysis.py'>

In [4]:
def load_raw_preds(task_name, exp_type, split_name="val"):
    preds_path = f"{project_dir}/{exp_type}-{task_name}/run/{task_name}_{split_name}.json"
    return list(utils.load_json_data(preds_path))

def load_task_preds(task_name, exp_type, split_name="val"):
    run_dir = f"{project_dir}/{exp_type}-{task_name}/run"
    return analysis.Predictions.from_run(run_dir, task_name, split_name)

Use the cell below to configure what to load:

- `project_dir` should be the path to a directory of experiments
- `exp_types` should be the two experiment types to compare
- `task_name` is the task to look at
- `split_name` is the split (`val` or `test`) to look at

At minimum, you'll want to point `project_dir` to something available on your system.

This assumes that the project directory contains experiments named as `{exp_type}-{task_name}`, each containing a single run named `run`.

In [5]:
# project_dir = "/nfs/jsalt/exp/edges-20180926-elmofix"
# exp_types = ["elmo-chars", "elmo-full"]
# project_dir = "/nfs/jsalt/exp/edges-20190124-bert"
# exp_types = ["bert-base-uncased-lex", "bert-base-uncased-cat"]
project_dir = "/nfs/jsalt/home/iftenney/exp/bert_mix_20190129"
exp_types = ["bert-large-uncased-lex", "bert-large-uncased-mix"]

# task_name = "edges-srl-conll2012"
# task_name = "edges-spr2"
task_name = "edges-coref-ontonotes-conll"
split_name = "val"  # look at development / validation sets

We'll load both the raw predictions (records loaded from JSON), and also process them (with `load_task_preds`, which uses `analysis.Predictions`) into a long-form DataFrame. We can use the DataFrame to easily score groups of targets, and then retrieve the full records from the raw predictions based on the example index.

In [6]:
r0 = load_raw_preds(task_name, exp_types[0])
r1 = load_raw_preds(task_name, exp_types[1])

In [7]:
p0 = load_task_preds(task_name, exp_types[0])
p1 = load_task_preds(task_name, exp_types[1])

02/09/2019 02:25:52 - INFO - root -   Loading vocabulary from /nfs/jsalt/home/iftenney/exp/bert_mix_20190129/bert-large-uncased-lex-edges-coref-ontonotes-conll/vocab
02/09/2019 02:25:52 - INFO - allennlp.data.vocabulary -   Loading token dictionary from /nfs/jsalt/home/iftenney/exp/bert_mix_20190129/bert-large-uncased-lex-edges-coref-ontonotes-conll/vocab.
02/09/2019 02:25:52 - INFO - root -   Loading predictions from /nfs/jsalt/home/iftenney/exp/bert_mix_20190129/bert-large-uncased-lex-edges-coref-ontonotes-conll/run/edges-coref-ontonotes-conll_val.json
02/09/2019 02:25:53 - INFO - root -   Loading vocabulary from /nfs/jsalt/home/iftenney/exp/bert_mix_20190129/bert-large-uncased-mix-edges-coref-ontonotes-conll/vocab
02/09/2019 02:25:53 - INFO - allennlp.data.vocabulary -   Loading token dictionary from /nfs/jsalt/home/iftenney/exp/bert_mix_20190129/bert-large-uncased-mix-edges-coref-ontonotes-conll/vocab.
02/09/2019 02:25:53 - INFO - root -   Loading predictions from /nfs/jsalt/home/i

The cell below will score the predictions; this might take a couple minutes to run on larger datasets.

In [8]:
def score_by_example(df):
    # Score targets, but grouped on example index
    gb = df.groupby(by='ex_idx')
    records = []
    for key, idxs in tqdm(gb.groups.items()):
        sub_df = df.loc[idxs]
        record = analysis.Predictions.score_long_df(sub_df)
        record['ex_idx'] = key
        records.append(record)
    score_df = pd.DataFrame.from_records(records)

    score_df['precision'] = analysis.get_precision(score_df).fillna(value=1.0)
    score_df['recall'] = analysis.get_recall(score_df).fillna(value=1.0)
    score_df['f1'] = analysis.get_f1(score_df).fillna(value=0.0)
    
    return score_df

s0 = score_by_example(p0.target_df_long)
s1 = score_by_example(p1.target_df_long)

02/09/2019 02:25:54 - INFO - root -   Generating long-form target DataFrame. May be slow... 
02/09/2019 02:25:54 - INFO - root -   span2 detected; adding span_distance to long-form DataFrame.
02/09/2019 02:25:54 - INFO - root -   Done!
100%|██████████| 5044/5044 [00:07<00:00, 689.80it/s]
02/09/2019 02:26:01 - INFO - root -   Generating long-form target DataFrame. May be slow... 
02/09/2019 02:26:01 - INFO - root -   span2 detected; adding span_distance to long-form DataFrame.
02/09/2019 02:26:01 - INFO - root -   Done!
100%|██████████| 5044/5044 [00:07<00:00, 695.68it/s]


### Find examples with a large gain from base -> expt

We'll group targets by the input example (`ex_idx`) and look at the scores across the whole example. Each of the cells below should print a filtered set of example indices that might be worth looking at.

`mdf` merges the per-example scores from base and expt runs (i.e. `exp_types[0]` and `exp_types[1]`) so we can look for sentences where, for example, the lexical model does poorly but the full-context model gets most of the targets right.

In [11]:
mdf = pd.merge(s0, s1, how='inner', on='ex_idx', suffixes=("_base", "_expt"))

mdf['f1_delta'] = mdf["f1_expt"] - mdf["f1_base"]
mdf['abs_f1_delta'] = mdf['f1_delta'].map(np.abs)

In [12]:
# Find examples with more targets
mdf[mdf['tp_count_expt'] >= 3].sort_values(by="f1_delta", ascending=False).head(10)

Unnamed: 0,ex_idx,fn_count_base,fp_count_base,tn_count_base,tp_count_base,precision_base,recall_base,f1_base,fn_count_expt,fp_count_expt,tn_count_expt,tp_count_expt,precision_expt,recall_expt,f1_expt,f1_delta,abs_f1_delta
1485,1485,3,3,0,0,0.0,0.0,0.0,0,0,3,3,1.0,1.0,1.0,1.0,1.0
2496,2496,3,3,0,0,0.0,0.0,0.0,0,0,3,3,1.0,1.0,1.0,1.0,1.0
4167,4167,3,3,0,0,0.0,0.0,0.0,0,0,3,3,1.0,1.0,1.0,1.0,1.0
2298,2298,3,3,0,0,0.0,0.0,0.0,0,0,3,3,1.0,1.0,1.0,1.0,1.0
658,658,3,3,0,0,0.0,0.0,0.0,0,0,3,3,1.0,1.0,1.0,1.0,1.0
1302,1302,6,6,0,0,0.0,0.0,0.0,0,0,6,6,1.0,1.0,1.0,1.0,1.0
3862,3862,3,3,0,0,0.0,0.0,0.0,0,1,2,3,0.75,1.0,0.857143,0.857143,0.857143
4570,4570,5,5,1,1,0.166667,0.166667,0.166667,0,0,6,6,1.0,1.0,1.0,0.833333,0.833333
854,854,2,2,1,1,0.333333,0.333333,0.333333,0,0,3,3,1.0,1.0,1.0,0.666667,0.666667
2308,2308,2,2,1,1,0.333333,0.333333,0.333333,0,0,3,3,1.0,1.0,1.0,0.666667,0.666667


In [13]:
# Find examples with shorter sentences
edf = p0.example_df.copy()
edf['num_tokens'] = edf['text'].map(lambda s: len(s.split()))
edf = edf[['idx', 'num_tokens']]
fdf = mdf.merge(edf, left_on='ex_idx', right_on="idx")

mask = (fdf['num_tokens'] < 15) & (fdf['tp_count_expt'] >= 3)
fdf[mask].sort_values(by="f1_delta", ascending=False).head(30)

Defaulting to column, but this will raise an ambiguity error in a future version
  """


Unnamed: 0,ex_idx,fn_count_base,fp_count_base,tn_count_base,tp_count_base,precision_base,recall_base,f1_base,fn_count_expt,fp_count_expt,tn_count_expt,tp_count_expt,precision_expt,recall_expt,f1_expt,f1_delta,abs_f1_delta,idx,num_tokens
1485,1485,3,3,0,0,0.0,0.0,0.0,0,0,3,3,1.0,1.0,1.0,1.0,1.0,1485,12
658,658,3,3,0,0,0.0,0.0,0.0,0,0,3,3,1.0,1.0,1.0,1.0,1.0,658,14
1868,1868,4,4,2,2,0.333333,0.333333,0.333333,0,0,6,6,1.0,1.0,1.0,0.666667,0.666667,1868,14
2562,2562,2,2,1,1,0.333333,0.333333,0.333333,0,0,3,3,1.0,1.0,1.0,0.666667,0.666667,2562,14
2308,2308,2,2,1,1,0.333333,0.333333,0.333333,0,0,3,3,1.0,1.0,1.0,0.666667,0.666667,2308,11
2303,2303,4,4,2,2,0.333333,0.333333,0.333333,0,0,6,6,1.0,1.0,1.0,0.666667,0.666667,2303,14
1817,1817,2,2,1,1,0.333333,0.333333,0.333333,0,0,3,3,1.0,1.0,1.0,0.666667,0.666667,1817,10
2138,2138,2,2,1,1,0.333333,0.333333,0.333333,0,0,3,3,1.0,1.0,1.0,0.666667,0.666667,2138,13
2039,2039,2,2,1,1,0.333333,0.333333,0.333333,0,0,3,3,1.0,1.0,1.0,0.666667,0.666667,2039,13
1284,1284,2,2,1,1,0.333333,0.333333,0.333333,0,0,3,3,1.0,1.0,1.0,0.666667,0.666667,1284,13


In [14]:
# Find examples with no change
edf = p0.example_df.copy()
edf['num_tokens'] = edf['text'].map(lambda s: len(s.split()))
edf = edf[['idx', 'num_tokens']]
fdf = mdf.merge(edf, left_on='ex_idx', right_on="idx")

mask = (fdf['num_tokens'] < 20) 
mask &= (fdf['tp_count_expt'] <= 2)
# mask &= fdf["tp_count_base"] <= 1
mask &= fdf['f1_expt'] <= 0.5
fdf[mask].sort_values(by="abs_f1_delta", ascending=True).head(30)

Defaulting to column, but this will raise an ambiguity error in a future version
  """


Unnamed: 0,ex_idx,fn_count_base,fp_count_base,tn_count_base,tp_count_base,precision_base,recall_base,f1_base,fn_count_expt,fp_count_expt,tn_count_expt,tp_count_expt,precision_expt,recall_expt,f1_expt,f1_delta,abs_f1_delta,idx,num_tokens
24,24,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,24,9
3476,3476,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,3476,19
3207,3207,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,3207,14
2881,2881,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,2881,4
2730,2730,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,2730,13
2692,2692,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,2692,11
2113,2113,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,2113,14
1870,1870,2,2,1,1,0.333333,0.333333,0.333333,2,2,1,1,0.333333,0.333333,0.333333,0.0,0.0,1870,18
2536,2536,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,2536,12
1734,1734,1,1,0,0,0.0,0.0,0.0,1,1,0,0,0.0,0.0,0.0,0.0,0.0,1734,12


### Plot the diffs on these examples

The `ex_idx` column printed in each of the above cells is the index of that example in the lists of raw predictions. Enter it in the cell below to show the predictions (and ground truth) on base and expt for that example.

By default, it will print all predicted classes with `p(class) > 0.5`, and give their predicted probabilities in parentheses.

In [15]:
# ex_idx = 1868 # "And their important leaders ..."
# ex_idx = 1767 # "Jesus looked at the man ..."
# ex_idx = 2305 # "He would not stop to help him either ."
# ex_idx = 3476 # "It wasn't until Lee Teng - Hui ..."
# ex_idx = 4031 # only one local ringer
# ex_idx = 1734 # book of life
# ex_idx = 3066 # martha stewart
# ex_idx = 3866 # among its new customers
# ex_idx = 3651 # Tien stressed that the president ...
ex_idx = 2113
print(analysis.EdgeProbingExample(r0[ex_idx], label_vocab=p0.all_labels))
print(analysis.EdgeProbingExample(r1[ex_idx], label_vocab=p1.all_labels))

Text (14): the soldier was a religious man , one of his close help ##ers .

  span1: [ 0, 2)	"the soldier"
  span2: [ 9,10)	"his"
  label: (1)		 0
  pred:  		 1 (0.92)

Text (14): the soldier was a religious man , one of his close help ##ers .

  span1: [ 0, 2)	"the soldier"
  span2: [ 9,10)	"his"
  label: (1)		 0
  pred:  		 1 (0.88)

