# Edge Probing Predictions Sandbox

Use this notebook as a starting point for #datascience on Edge Probing predictions. The code below (from `probing/analysis.py`) will load predictions from a run, do some pre-processing for convenience, and expose two DataFrames for analysis.

We load the data into Pandas so it's easier to filter by various fields, and to select particular columns of interest (such as `labels.khot` and `preds.proba` for computing metrics). For an introduction to Pandas, see here: https://pandas.pydata.org/pandas-docs/stable/10min.html 

In [1]:
import sys, os, re, json
import itertools
import collections
from importlib import reload
import pandas as pd
import numpy as np
from sklearn import metrics

The latest runs are here:

In [2]:
ls /nfs/jsalt/home/iftenney/exp/edges-20180913/

[0m[01;34mcove-edges-constituent-ontonotes[0m/
[01;34mcove-edges-coref-ontonotes-conll[0m/
[01;34mcove-edges-dep-labeling-ewt[0m/
[01;34mcove-edges-dpr[0m/
[01;34mcove-edges-ner-ontonotes[0m/
[01;34mcove-edges-spr1[0m/
[01;34mcove-edges-spr2[0m/
[01;34mcove-edges-srl-conll2012[0m/
[01;34melmo-chars-edges-constituent-ontonotes[0m/
[01;34melmo-chars-edges-coref-ontonotes-conll[0m/
[01;34melmo-chars-edges-dep-labeling-ewt[0m/
[01;34melmo-chars-edges-dpr[0m/
[01;34melmo-chars-edges-ner-ontonotes[0m/
[01;34melmo-chars-edges-spr1[0m/
[01;34melmo-chars-edges-spr2[0m/
[01;34melmo-chars-edges-srl-conll2012[0m/
[01;34melmo-full-edges-constituent-ontonotes[0m/
[01;34melmo-full-edges-coref-ontonotes-conll[0m/
[01;34melmo-full-edges-dep-labeling-ewt[0m/
[01;34melmo-full-edges-dpr[0m/
[01;34melmo-full-edges-ner-ontonotes[0m/
[01;34melmo-full-edges-spr1[0m/
[01;34melmo-full-edges-spr2[0m/
[01;34melmo-full-edges-srl-conll2012[0m/

The `elmo-chars` experiments probe the char CNN layer only (lexical baseline), while the `elmo-full` models use full ELMo with learned mixing weights. The run dir for each is just called "run" by default. 

In [11]:
import analysis
reload(analysis)

run_dir = "/nfs/jsalt/home/iftenney/exp/edges-20180913/elmo-full-edges-spr2/run"
preds = analysis.Predictions.from_run(run_dir, 'edges-spr2', 'test')
print("Number of examples: %d" % len(preds.example_df))
print("Number of total targets: %d" % len(preds.target_df))
print("Labels (%d total):" % len(preds.all_labels))
print(preds.all_labels)

Number of examples: 276
Number of total targets: 582
Labels (20 total):
['awareness', 'change_of_location', 'change_of_possession', 'change_of_state', 'change_of_state_continuous', 'changes_possession', 'existed_after', 'existed_before', 'existed_during', 'exists_as_physical', 'instigation', 'location_of_event', 'makes_physical_contact', 'partitive', 'predicate_changed_argument', 'sentient', 'stationary', 'volition', 'was_for_benefit', 'was_used']


### Top-level example info

`preds.example_df` contains information on the top-level examples. Mostly, this just stores the input text and any metadata fields that were present in the original data. This is useful if you want to link the targets back to the text, but you shouldn't need it to compute most metrics.

In [12]:
preds.example_df.head()

Unnamed: 0_level_0,idx,info.grammatical,info.sent-id,info.sent_id,info.source,info.split,text
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,5.0,1008,1008,SPR2,test,"In a timid voice , he says : &quot; If an airp..."
1,1,5.0,1009,1009,SPR2,test,&quot; Wonderful ! &quot; Winston beams .
2,2,5.0,1017,1017,SPR2,test,&quot; Our new lunar transportation system uti...
3,3,2.0,1023,1023,SPR2,test,They want to use LTS to tie into NASA &apos; s...
4,4,5.0,1024,1024,SPR2,test,&quot; We are so excited that the White House ...


### Target info and predictions

`preds.target_df` contains the per-target input fields (`span1`, `span2`, and `label`) as well as any metadata associated with individual targets. The `idx` column references a row in `example_df` that this target belongs to, if you need to recover the original text.

The loader code does some preprocessing for convenience. In particular, we add a `label.ids` column which maps the list-of-string `label` column into a list of integer ids for these targets, as well as `label.khot` which contains a K-hot encoding of these ids. 

Each entry in `label.khot` should align to the corresponding entry in `preds.proba`, which contains the model's predicted probabilities $\hat{y} \in [0,1]$ for each class.

For specific analysis, it might be easier to work with the wide and long forms of this DataFrame - see cells below.

In [13]:
preds.target_df.head()

Unnamed: 0,idx,info.is_pilot,info.pred_lemma,info.span1_text,info.span2_txt,label,preds.proba,span1,span2,label.ids,label.khot
0,0,False,say,says,he,"[awareness, existed_after, existed_before, exi...","[0.9507238268852234, 0.08021300286054611, 0.00...","(6, 7)","(5, 6)","[0, 6, 7, 8, 10, 15, 17, 19]","[1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, ..."
1,0,False,carry,carrying,winston peters,"[awareness, change_of_location, change_of_stat...","[0.8147344589233398, 0.8972967863082886, 0.146...","(12, 13)","(13, 15)","[0, 1, 4, 6, 7, 8, 10, 15, 17, 18, 19]","[1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, ..."
2,0,False,blow,blown,an airplane carrying winston peters,"[change_of_location, change_of_state, existed_...","[0.20997169613838196, 0.7638567686080933, 0.11...","(16, 17)","(10, 15)","[1, 3, 7, 8]","[0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, ..."
3,1,False,beam,beams,winston,"[awareness, change_of_state_continuous, existe...","[0.5660699605941772, 0.15035615861415863, 0.03...","(5, 6)","(4, 5)","[0, 4, 6, 7, 8, 10, 13, 15, 17, 18, 19]","[1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, ..."
4,2,False,tell,told,kistler,"[awareness, existed_after, existed_before, exi...","[0.9896626472473145, 0.022328440099954605, 0.0...","(30, 31)","(29, 30)","[0, 6, 7, 8, 10, 15, 17, 19]","[1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, ..."


### Wide and Long Data

For background on these views, see https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data

Here's a "wide" version of the data, with the usual metadata plus `2* num_labels` columns: `label.true.<label_name>` and `preds.proba.<label_name>` for each target class.

In [14]:
preds.target_df_wide.head()

Unnamed: 0,idx,info.is_pilot,info.pred_lemma,info.span1_text,info.span2_txt,span1,span2,label.true.awareness,label.true.change_of_location,label.true.change_of_possession,...,preds.proba.instigation,preds.proba.location_of_event,preds.proba.makes_physical_contact,preds.proba.partitive,preds.proba.predicate_changed_argument,preds.proba.sentient,preds.proba.stationary,preds.proba.volition,preds.proba.was_for_benefit,preds.proba.was_used
0,0,False,say,says,he,"(6, 7)","(5, 6)",1,0,0,...,0.93686,0.004401,0.003261,0.254228,0.002805,0.97576,0.003733,0.938958,0.143198,0.945751
1,0,False,carry,carrying,winston peters,"(12, 13)","(13, 15)",1,1,0,...,0.742384,0.006977,0.006286,0.07898,0.006507,0.667234,0.004935,0.652489,0.387525,0.932438
2,0,False,blow,blown,an airplane carrying winston peters,"(16, 17)","(10, 15)",0,1,0,...,0.173488,0.009159,0.013924,0.277387,0.00445,0.194129,0.004115,0.029275,0.124122,0.724787
3,1,False,beam,beams,winston,"(5, 6)","(4, 5)",1,0,0,...,0.918481,0.009543,0.007555,0.120318,0.00782,0.808103,0.007416,0.578732,0.544744,0.919867
4,2,False,tell,told,kistler,"(30, 31)","(29, 30)",1,0,0,...,0.92249,0.015724,0.011915,0.3147,0.009969,0.963026,0.009797,0.985373,0.749102,0.960411


We can fairly easily compute per-label metrics from the wide form, by selecting the appropriate pair of columns:

In [15]:
wide_df = preds.target_df_wide
scores_by_label = {}
for label in preds.all_labels:
    y_true = wide_df['label.true.' + label]
    y_pred = wide_df['preds.proba.' + label] >= 0.5
    score = metrics.f1_score(y_true=y_true, y_pred=y_pred)
    scores_by_label[label] = score
scores = pd.Series(scores_by_label)
print(scores)
print("Macro average F1: %.04f" % scores.mean())

awareness                     0.897436
change_of_location            0.251969
change_of_possession          0.048780
change_of_state               0.387097
change_of_state_continuous    0.595890
changes_possession            0.000000
existed_after                 0.951686
existed_before                0.919081
existed_during                0.987826
exists_as_physical            0.000000
instigation                   0.806565
location_of_event             0.000000
makes_physical_contact        0.000000
partitive                     0.055944
predicate_changed_argument    0.000000
sentient                      0.888519
stationary                    0.000000
volition                      0.845735
was_for_benefit               0.640316
was_used                      0.917910
dtype: float64
Macro average F1: 0.4597


  'precision', 'predicted', average, warn_for)


And here's a "long" version of the same, with a single `label` column, and one column each for `label.true` and `preds.proba` for that label:

In [16]:
preds.target_df_long.head()

Unnamed: 0,idx,label,label.true,preds.proba
0,0,awareness,1,0.950724
1,0,change_of_location,0,0.080213
2,0,change_of_possession,0,0.007079
3,0,change_of_state,0,0.093276
4,0,change_of_state_continuous,0,0.160939


We can easily get the set of labels available here:

In [17]:
preds.target_df_long.label.unique()

array(['awareness', 'change_of_location', 'change_of_possession',
       'change_of_state', 'change_of_state_continuous',
       'changes_possession', 'existed_after', 'existed_before',
       'existed_during', 'exists_as_physical', 'instigation',
       'location_of_event', 'makes_physical_contact', 'partitive',
       'predicate_changed_argument', 'sentient', 'stationary', 'volition',
       'was_for_benefit', 'was_used'], dtype=object)

And easily compute micro-averaged metrics by simply comparing the `label.true` and `preds.proba` columns:

In [18]:
from sklearn import metrics
long_df = preds.target_df_long
metrics.f1_score(y_true=long_df['label.true'], y_pred=(long_df['preds.proba'] >= 0.5))

0.8297897060532125