# Edge Probing Predictions Sandbox

Use this notebook as a starting point for #datascience on Edge Probing predictions. The code below (from `probing/analysis.py`) will load predictions from a run, do some pre-processing for convenience, and expose two DataFrames for analysis.

We load the data into Pandas so it's easier to filter by various fields, and to select particular columns of interest (such as `labels.khot` and `preds.proba` for computing metrics). For an introduction to Pandas, see here: https://pandas.pydata.org/pandas-docs/stable/10min.html 

In [1]:
import sys, os, re, json
import itertools
import collections
from importlib import reload
import pandas as pd
import numpy as np
from sklearn import metrics

In [2]:
import analysis
reload(analysis)

run_dir = "/nfs/jsalt/home/iftenney/exp/edges-20180725/elmo-full-edges-spr2/run"
preds = analysis.Predictions.from_run(run_dir, 'edges-spr2', 'test')
print("Number of examples: %d" % len(preds.example_df))
print("Number of targets:  %d" % len(preds.target_df))

Number of examples: 276
Number of targets:  582


### Top-level example info

`preds.example_df` contains information on the top-level examples. Mostly, this just stores the input text and any metadata fields that were present in the original data. This is useful if you want to link the targets back to the text, but you shouldn't need it to compute most metrics.

In [3]:
preds.example_df.head()

Unnamed: 0_level_0,idx,info.grammatical,info.sent-id,info.sent_id,info.source,info.split,preds.proba,text
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,5.0,1008,1008,SPR2,test,"[[0.9560839533805847, 0.06530793756246567, 0.0...","In a timid voice , he says : &quot; If an airp..."
1,1,5.0,1009,1009,SPR2,test,"[[0.8448460102081299, 0.16005221009254456, 0.0...",&quot; Wonderful ! &quot; Winston beams .
2,2,5.0,1017,1017,SPR2,test,"[[0.9815192222595215, 0.02376113459467888, 0.0...",&quot; Our new lunar transportation system uti...
3,3,2.0,1023,1023,SPR2,test,"[[0.9837549328804016, 0.10073678940534592, 0.0...",They want to use LTS to tie into NASA &apos; s...
4,4,5.0,1024,1024,SPR2,test,"[[0.9833780527114868, 0.02780323289334774, 0.0...",&quot; We are so excited that the White House ...


### Target info and predictions

`preds.target_df` contains the per-target input fields (`span1`, `span2`, and `label`) as well as any metadata associated with individual targets. The `idx` column references a row in `example_df` that this target belongs to, if you need to recover the original text.

The loader code does some preprocessing for convenience. In particular, we add a `label.ids` column which maps the list-of-string `label` column into a list of integer ids for these targets, as well as `label.khot` which contains a K-hot encoding of these ids. 

Each entry in `label.khot` should align to the corresponding entry in `preds.proba`, which contains the model's predicted probabilities $\hat{y} \in [0,1]$ for each class.

For specific analysis, it might be easier to work with the wide and long forms of this DataFrame - see cells below.

In [4]:
preds.target_df.head()

Unnamed: 0,idx,info.is_pilot,info.pred_lemma,info.span1_text,info.span2_txt,label,preds.proba,span1,span2,label.ids,label.khot
0,0,False,say,says,he,"[awareness, existed_after, existed_before, exi...","[0.9560839533805847, 0.06530793756246567, 0.00...","[6, 7]","[5, 6]","[0, 6, 7, 8, 10, 15, 17, 19]","[1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, ..."
1,0,False,carry,carrying,winston peters,"[awareness, change_of_location, change_of_stat...","[0.8325857520103455, 0.8400908708572388, 0.158...","[12, 13]","[13, 15]","[0, 1, 4, 6, 7, 8, 10, 15, 17, 18, 19]","[1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, ..."
2,0,False,blow,blown,an airplane carrying winston peters,"[change_of_location, change_of_state, existed_...","[0.21247316896915436, 0.7873210310935974, 0.15...","[16, 17]","[10, 15]","[1, 3, 7, 8]","[0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, ..."
3,1,False,beam,beams,winston,"[awareness, change_of_state_continuous, existe...","[0.8448460102081299, 0.16005221009254456, 0.02...","[5, 6]","[4, 5]","[0, 4, 6, 7, 8, 10, 13, 15, 17, 18, 19]","[1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, ..."
4,2,False,tell,told,kistler,"[awareness, existed_after, existed_before, exi...","[0.9815192222595215, 0.02376113459467888, 0.01...","[30, 31]","[29, 30]","[0, 6, 7, 8, 10, 15, 17, 19]","[1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, ..."


### Wide and Long Data

For background on these views, see https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data

Here's a "wide" version of the data, with the usual metadata plus `2* num_labels` columns: `label.true.<label_name>` and `preds.proba.<label_name>` for each target class.

In [5]:
preds.target_df_wide.head()

Generating wide-form target DataFrame. May be slow...


Unnamed: 0,idx,info.is_pilot,info.pred_lemma,info.span1_text,info.span2_txt,span1,span2,label.true.awareness,label.true.change_of_location,label.true.change_of_possession,...,preds.proba.instigation,preds.proba.location_of_event,preds.proba.makes_physical_contact,preds.proba.partitive,preds.proba.predicate_changed_argument,preds.proba.sentient,preds.proba.stationary,preds.proba.volition,preds.proba.was_for_benefit,preds.proba.was_used
0,0,False,say,says,he,"[6, 7]","[5, 6]",1,0,0,...,0.93704,0.003339,0.003404,0.334562,0.002472,0.970853,0.003125,0.952037,0.173383,0.947076
1,0,False,carry,carrying,winston peters,"[12, 13]","[13, 15]",1,1,0,...,0.695675,0.002821,0.00418,0.141632,0.002505,0.734775,0.002232,0.594617,0.231695,0.937795
2,0,False,blow,blown,an airplane carrying winston peters,"[16, 17]","[10, 15]",0,1,0,...,0.215082,0.003516,0.00576,0.194907,0.00254,0.184872,0.002241,0.036405,0.105332,0.814549
3,1,False,beam,beams,winston,"[5, 6]","[4, 5]",1,0,0,...,0.937726,0.003944,0.004915,0.20567,0.00305,0.865843,0.002972,0.82597,0.469503,0.964462
4,2,False,tell,told,kistler,"[30, 31]","[29, 30]",1,0,0,...,0.907287,0.005821,0.005082,0.348461,0.003519,0.956151,0.003972,0.974265,0.537767,0.951819


We can fairly easily compute per-label metrics from the wide form, by selecting the appropriate pair of columns:

In [6]:
wide_df = preds.target_df_wide
scores_by_label = {}
for label in preds.all_labels:
    y_true = wide_df['label.true.' + label]
    y_pred = wide_df['preds.proba.' + label] >= 0.5
    score = metrics.f1_score(y_true=y_true, y_pred=y_pred)
    scores_by_label[label] = score
scores = pd.Series(scores_by_label)
print(scores)
print("Macro average F1: %.04f" % scores.mean())

awareness                     0.908507
change_of_location            0.254237
change_of_possession          0.048780
change_of_state               0.273224
change_of_state_continuous    0.515284
changes_possession            0.000000
existed_after                 0.955736
existed_before                0.918866
existed_during                0.987826
exists_as_physical            0.000000
instigation                   0.803185
location_of_event             0.000000
makes_physical_contact        0.000000
partitive                     0.068966
predicate_changed_argument    0.000000
sentient                      0.900840
stationary                    0.000000
volition                      0.857658
was_for_benefit               0.570136
was_used                      0.925069
dtype: float64
Macro average F1: 0.4494


  'precision', 'predicted', average, warn_for)


And here's a "long" version of the same, with a single `label` column, and one column each for `label.true` and `preds.proba` for that label:

In [7]:
preds.target_df_long.head()

Generating long-form target DataFrame. May be slow...


Unnamed: 0,idx,info.is_pilot,info.pred_lemma,info.span1_text,info.span2_txt,label,label.true,preds.proba,span1,span2
0,0,False,say,says,he,awareness,1,0.956084,"[6, 7]","[5, 6]"
1,0,False,say,says,he,partitive,0,0.334562,"[6, 7]","[5, 6]"
2,0,False,carry,carrying,winston peters,partitive,0,0.141632,"[12, 13]","[13, 15]"
3,0,False,blow,blown,an airplane carrying winston peters,partitive,0,0.194907,"[16, 17]","[10, 15]"
4,0,False,blow,blown,an airplane carrying winston peters,was_used,0,0.814549,"[16, 17]","[10, 15]"


We can easily get the set of labels available here:

In [8]:
preds.target_df_long.label.unique()

array(['awareness', 'partitive', 'was_used', 'predicate_changed_argument',
       'change_of_possession', 'sentient', 'stationary',
       'change_of_location', 'volition', 'was_for_benefit',
       'existed_after', 'change_of_state', 'existed_before',
       'existed_during', 'changes_possession', 'exists_as_physical',
       'instigation', 'change_of_state_continuous', 'location_of_event',
       'makes_physical_contact'], dtype=object)

And easily compute micro-averaged metrics by simply comparing the `label.true` and `preds.proba` columns:

In [9]:
from sklearn import metrics
long_df = preds.target_df_long
metrics.f1_score(y_true=long_df['label.true'], y_pred=(long_df['preds.proba'] >= 0.5))

0.8322035913901773