This notebook is the scratch paper I used while writing [`evaluate.py`](./evaluate.py). I used this whenever I needed an interactive console to try ideas and view results.

## Load and run `evaluate` module

In [1]:
from importlib import reload # recompile module and re-execute module-level code
import evaluate as ev; reload(ev)

<module 'evaluate' from 'D:\\This PC\\Documents\\Github\\nlp-sandbox\\enron\\evaluate.py'>

In [2]:
%run -i evaluate.py

Importing the model from model.pkl
Loading test data
notes.cleaned.tsv
Featurizing test data
Evaluating

Accuracy: 0.32044560943643513

Confusion matrix:
Layout
[[tn   fp]
 [fn   tp]]

[[ 393 1036]
 [   1   96]]

Classification report:
             precision    recall  f1-score   support

        neg       1.00      0.28      0.43      1429
        pos       0.08      0.99      0.16        97

avg / total       0.94      0.32      0.41      1526


Accuracy (on candidates with 1+ feature): 0.7192691029900332

Confusion matrix:
[[393 168]
 [  1  40]]

Classification report:
             precision    recall  f1-score   support

        neg       1.00      0.70      0.82       561
        pos       0.19      0.98      0.32        41

avg / total       0.94      0.72      0.79       602



**Improvement #1**: accuracy filtering above shows the model should learn to be more uncertain about candidates it has zero features on.
   - This could mean not requiring a VERB in order for a candidate to have features. (Although might be reasonable to scope this problem to tasks of this form first, nail it, then expand the definition)
   - This could mean not including training data with 0 features, with the idea that this will force the model to be more uncertain when it comes across 0-feature candidates in test. But is this a bad bias to introduce?

**Improvement #2**: re-train model using spacy v2

I decided to upgrade spacy in the middle of building this notebook (to use the `displacy` visualizer package later on). Went from `1.9.0` to `2.0.7`. Didn't expect it to affect much until I re-train my model. Wrong.

**Measure**|**v1.9.0**|**v2.0.7 (eval-only)**
-----|-----|-----
Accuracy|0.286|0.312
Accuracy on 1+ features|0.633|0.698
Confusion Matrix|[[ 356 1086] [ 4 80]]|[[ 393 1049] [ 1 83]]
Confusion Matrix on 1+ features|[[356 216] [ 4 24]]|[[393 181] [ 1 27]]

So I'm excited to see more performance boosts simply from upgrading Spacy!

_On the negative side, featurization is slower._

Other options after upgrade:
- Add [new morphology info](https://spacy.io/api/annotation#pos-tagging) as features (VerbForm, Tense, Aspect)

**TODO**: address label inbalance

## Using: get_sample_predictions

In [3]:
most_incorrect = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_INCORRECT, n=20)
most_incorrect

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
396,ENA Origination Support (Getting info from Tyc...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",pos,0.973595,0.026405,0.973595
926,wanted only to buy and sell conventional reins...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",pos,0.928333,0.071667,0.928333
1091,"end, after court hearings on the issue, the Dr...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.907394,0.092606,0.907394
1023,the market observations (as opposed to histori...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.881702,0.118298,0.881702
1121,late Tuesday accused 29 Enron officers and dir...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING...",pos,0.880221,0.119779,0.880221
777,"John Elder, creator of a grass-roots investmen...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'PARENT_DEP_ROOT': 1, 'PARENT_POS_...",pos,0.879704,0.120296,0.879704
167,"Managing Director, Enterprise Risk",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.877372,0.122628,0.877372
1282,use the,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos,0.873541,0.126459,0.873541
803,2. Decision support system: a model recommendi...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",pos,0.869085,0.130915,0.869085
196,Published: June 2000,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.860724,0.139276,0.860724


In [4]:
ev.print_full_df_column(most_incorrect, ['candidate_task', 'label'], True)

candidate_task	label
ENA Origination Support (Getting info from Tycholiz)--$$XXK	neg
wanted only to buy and sell conventional reinsurance, Mr. Sweeney	neg
end, after court hearings on the issue, the Drexel estate	neg
the market observations (as opposed to historical data).	neg
late Tuesday accused 29 Enron officers and directors of	neg
John Elder, creator of a grass-roots investment program to develop	neg
Managing Director, Enterprise Risk	neg
use the	neg
2. Decision support system: a model recommending transactions to a trader without	neg
Published: June 2000	neg
"payments as necessary ""in order to protect and maintain the"	neg
prove - then the judge can void the payments and order the	neg
3. Fully automated trading system, setting prices and executing trades on an electronic trading platform.	neg
to petition the bankruptcy court to cash in unused vacation	neg
massive directional trading and representing huge risks to investors. One way to counter this	neg
contracts based on cold wea

### Manual annotations

**candidate\_task**|**label**|**annotation**
:-----:|:-----:|:-----:
ENA Origination Support (Getting info from Tycholiz)--$$XXK|neg|[3(VBG)]
New York Times on the Web, please contact Alyson|neg|[1]
Racer at alyson@nytimes.com or visit our online media|neg|[1]
wanted only to buy and sell conventional reinsurance, Mr. Sweeney|neg|[3(VBD;VB;VB)]
visit http://www.adamsmark.com/resv/rescheck.html.|neg|[1]
freeze: bridge|neg|[1]
use the following link for enrononline: www.enrononline.com|neg|[1]
end, after court hearings on the issue, the Drexel estate|neg|[2]
seek to buy replacement coverage to offset depletions in their|neg|[1][2]
For general information about NYTimes.com, write to|neg|[1][2]
the market observations (as opposed to historical data).|neg|[3(VBN)]
execute the agreement and return it to me via fax no. (713) 646-3490.  I will|neg|[1]
bring these issues to the table, in a meeting involving all the concerned parties|neg|[1]
late Tuesday accused 29 Enron officers and directors of|neg|[2][3(VBD)]
John Elder, creator of a grass-roots investment program to develop|neg|[2][3(VB)]
Get FREE shipping on orders of $75 or more at Starbucks.com|neg|[1]
Managing Director, Enterprise Risk|neg|[3(VBG)]
use the|neg|[1][2]
2. Decision support system: a model recommending transactions to a trader without|neg|[3(VBG)]
convince the external parties (stock analysts, creditors,  credit rating agencies) about|neg|[1]

**Annotation Key**:
- [1] = Reasonable task
- [2] = Sentence fragment
- [3(\*)] = Overweighted VERB (VB\* forms)


([Markdown Table Generator](https://jakebathman.github.io/Markdown-Table-Generator/))

#### Improvements
- **#3** Add feature around the specific kind of VERB (present vs past, etc)
- **#4** Might be useful to classify the greater body of text (e.g., article, paragraph, ?) as something worth further parsing by line - to help avoid sentence fragmentation or pulling "reasonable tasks" out of irrelevant text.

In [5]:
ev.render_displacy(most_incorrect, ev.DisplacyStyles.dep)

['NNP', 'NNP', 'NNP', '-LRB-', 'VBG', 'NN', 'IN', 'NN']
-----



['VBD', 'RB', 'TO', 'VB', 'CC', 'VB', 'JJ', 'NN', ',', 'NNP', 'NNP']
-----



['VB', ',', 'IN', 'NN', 'NNS', 'IN', 'DT', 'NN', ',', 'DT', 'NNP', 'NN']
-----



['DT', 'NN', 'NNS', '-LRB-', 'IN', 'VBN', 'IN', 'JJ', 'NNS', '-RRB-', '.']
-----



['JJ', 'NNP', 'VBD', 'CD', 'NNP', 'NNS', 'CC', 'NNS', 'IN']
-----



['NNP', 'NNP', ',', 'NN', 'IN', 'DT', 'NN', 'HYPH', 'NNS', 'NN', 'NN', 'TO', 'VB']
-----



['VBG', 'NNP', ',', 'NN', 'NN']
-----



['VB', 'DT']
-----



['LS', '.', 'NN', 'NN', 'NN', ':', 'DT', 'NN', 'VBG', 'NNS', 'IN', 'DT', 'NN', 'IN']
-----



['VBN', ':', 'NNP', 'CD']
-----



['NNS', 'IN', 'JJ', "''", 'IN', 'NN', 'TO', 'VB', 'CC', 'VB', 'DT']
-----



['VB', ':', 'RB', 'DT', 'NN', 'MD', 'VB', 'DT', 'NNS', 'CC', 'VB', 'DT']
-----



['LS', '.', 'RB', 'VBN', 'NN', 'NN', ',', 'VBG', 'NNS', 'CC', 'VBG', 'NNS', 'IN', 'DT', 'JJ', 'NN', 'NN', '.']
-----



['TO', 'VB', 'DT', 'NN', 'NN', 'TO', 'VB', 'IN', 'JJ', 'NN']
-----



['JJ', 'JJ', 'NN', 'CC', 'VBG', 'JJ', 'NNS', 'IN', 'NNS', '.', 'CD', 'NN', 'TO', 'VB', 'DT']
-----



['NNS', 'VBN', 'IN', 'JJ', 'NN', '.']
-----



['CD', 'NNS', 'VBN', 'IN', 'NN', '-LRB-', 'JJ', 'NN', 'IN', 'NNP', '-RRB-', '.']
-----



['JJ', 'NNS', 'VBN', 'IN', 'DT', 'NN', ':']
-----



['NNS', 'IN', 'RB', 'CD', 'NNS', ',', 'VBG', 'IN', 'JJ', 'NNS']
-----



['UH', 'VB', 'PRP$', 'NN', 'CD', 'HYPH', 'CD']
-----



In [6]:
most_incorrect_pos = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_INCORRECT_POS, n=20)
most_incorrect_pos

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
248,I need your dues by COB today!,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",neg,0.44212,0.55788,0.55788


**Minor observation**: I've experimented with "commitment" and "request" models trained on email data that would have easily labeled this correctly.

In [7]:
most_correct = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_CORRECT, n=20)
most_correct

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
83,5. A shark is the only fish that can blink wit...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.02173,0.97827,0.97827
894,"Currently, the trading that is taking place ty...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.022262,0.977738,0.977738
109,cop and Ernie the taxi driver in Frank Capra's...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.022798,0.977202,0.977202
612,"The tulips are too excitable, it is winter here.",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",neg,0.02371,0.97629,0.97629
1045,Another aspect of this problem is prioritizati...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg,0.026688,0.973312,0.973312
1001,The most beautiful thing we can experience is ...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.029022,0.970978,0.970978
121,"35. ""Stewardesses"" is the longest word that is...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.030214,0.969786,0.969786
1027,trading desks is perfect. One problem that is ...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_ADJ': 2, 'FOLLOWING...",neg,0.033225,0.966775,0.966775
1000,Imagination is more important than knowledge...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",neg,0.034947,0.965053,0.965053
1032,One way to address the problem is to implement...,<ProbDist with 2 samples>,neg,"{'VERB': 4, 'FOLLOWING_POS_DET': 2, 'FOLLOWING...",neg,0.035199,0.964801,0.964801


In [8]:
ev.print_full_df_column(most_correct, ['candidate_task', 'label'], True)

candidate_task	label
5. A shark is the only fish that can blink with both eyes.	neg
Currently, the trading that is taking place typically involves	neg
"cop and Ernie the taxi driver in Frank Capra's ""It's a Wonderful Life."""	neg
The tulips are too excitable, it is winter here.	neg
Another aspect of this problem is prioritization of different projects. In my view	neg
"The most beautiful thing we can experience is the mysterious.  It is	neg
the source of all true art and science."	neg
"35. ""Stewardesses"" is the longest word that is typed with only the left"	neg
trading desks is perfect. One problem that is persistent in some portfolios is the disconnect	neg
Imagination is more important than knowledge...	neg
One way to address the problem is to implement a module that compares systematically	neg
What is missing is the industrial type process that guarantees the quality	neg
building is an American flag.	neg
The old man steps up to the tee and hits the ball. It goes sailing	neg
2. I ha

In [9]:
most_correct_pos = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_CORRECT_POS, n=20)
most_correct_pos

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
1134,"New York Times on the Web, please contact Alyson",<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.93827,0.06173,0.93827
1517,"install v4.0 build 23, follow old instructions...",<ProbDist with 2 samples>,pos,"{'VERB': 4, 'FOLLOWING_POS_NOUN': 3, 'FOLLOWIN...",pos,0.937159,0.062841,0.937159
1135,Racer at alyson@nytimes.com or visit our onlin...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.93368,0.06632,0.93368
273,Check last years tax return for RRSP contribut...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.923122,0.076878,0.923122
33,visit http://www.adamsmark.com/resv/rescheck.h...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.912509,0.087491,0.912509
1255,Call Ambrosia ? Monday morning,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.912509,0.087491,0.912509
1328,freeze: bridge,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.911395,0.088605,0.911395
605,Send ED-mail to Ed.,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",pos,0.910623,0.089377,0.910623
1280,use the following link for enrononline: www.en...,<ProbDist with 2 samples>,pos,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos,0.908801,0.091199,0.908801
163,Check for ML,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.905871,0.094129,0.905871


In [10]:
most_uncertain = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_UNCERTAIN, n=20)
most_uncertain

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
648,Even through the gift paper I could hear them ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_PRON': 1, 'FOLLOWIN...",neg,0.499242,0.500758,0.500758
727,very interested in this guy and would be ready...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.500941,0.499059,0.500941
1108,"to certain employees, and not proceed with the...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.502288,0.497712,0.502288
1097,up as to whether there was fair consideration ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.502583,0.497417,0.502583
559,Freeman indicated that the Power Authority has...,<ProbDist with 2 samples>,neg,"{'VERB': 5, 'FOLLOWING_POS_ADP': 2, 'FOLLOWING...",pos,0.502932,0.497068,0.502932
786,"For your convenience, this workshop will be he...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.49622,0.50378,0.50378
998,You cannot simultaneously prevent and prepare ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_CCONJ': 1, 'FOLLOWI...",neg,0.495866,0.504134,0.504134
903,"of Trade, which began trading the options in 1...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.493854,0.506146,0.506146
380,maine impossible to get to...next option?,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.510187,0.489813,0.510187
632,"Their smiles catch onto my skin, little smilin...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.510244,0.489756,0.510244


In [11]:
ev.print_full_df_column(most_uncertain, ['candidate_task', 'pred', 'label'], True)

candidate_task	pred	label
Even through the gift paper I could hear them breathe	neg	neg
very interested in this guy and would be ready to bring him over	pos	neg
to certain employees, and not proceed with the payments	pos	neg
up as to whether there was fair consideration given for the	pos	neg
Freeman indicated that the Power Authority has signed letters of intent to purchase output from 14 biomass facilities in the Central Valley, as well as 400 MW generated by wind.	pos	neg
For your convenience, this workshop will be held at three different times	neg	neg
You cannot simultaneously prevent and prepare for war.	neg	neg
of Trade, which began trading the options in 1996.	neg	neg
maine impossible to get to...next option?	pos	pos
Their smiles catch onto my skin, little smiling hooks.	pos	neg
paste into this e-mail system it happens automatically.	pos	neg
a mild winter, and it would be able to use the futures to hedge a	neg	neg
Keeping it all together, through the change.	neg	neg
tmpl_name=sto

In [12]:
ev.confusion_matrix(most_uncertain[['candidate_task', 'prob_dist', 'label', 'features']].itertuples(index=False))

array([[11,  8],
       [ 0,  1]])

#### Q: What TP candidates (actual tasks) didn't have features?

In [13]:
all_correct_pos = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_CORRECT_POS)

In [14]:
len(all_correct_pos)

96

In [15]:
no_features_mask = all_correct_pos['features'] == {}
tp_no_features = all_correct_pos.iloc[np.where(no_features_mask)]
tp_no_features

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
746,McCormack,<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
747,Article about the group,<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
604,"Dentist on Jan 25, 7:45.",<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
753,"1, Li Xiao",<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
754,2. Accom,<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
1172,Krishna's letters,<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
755,3. Barbara Weather,<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
764,10:00 Sat,<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
984,Headcount,<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698
1171,EES presentations to Rhys,<ProbDist with 2 samples>,pos,{},pos,0.560698,0.439302,0.560698


**Improvement #5**: Manual re-labeling due to my "split by new line" decision is necessary for more accurate results. All sample datasets have instances of mis-labeled data.

In [16]:
ev.render_displacy(tp_no_features, [ev.DisplacyStyles.dep, ev.DisplacyStyles.ent])

['NN']
-----



['NN', 'IN', 'DT', 'NN']
-----



['NN', 'IN', '', 'NNP', 'CD', ',', 'CD', '.']
-----



['LS', ',', 'NNP', 'NNP']
-----



['LS', '.', 'ADD']
-----



['NNP', 'POS', 'NNS']
-----



['LS', '.', 'NNP', 'NNP']
-----



['CD', '', 'NNP']
-----



['NNP']
-----



['NNP', 'NNS', 'IN', 'NNP']
-----



['NNS']
-----



['JJ', 'NN', '.']
-----



['NNS', ',']
-----



['NNP']
-----



['NNP', 'IN', 'CD', 'NN']
-----



['NN', 'IN', '', 'NNP', 'CD', ',', 'CD', '.']
-----



['NN', 'IN', 'NN']
-----



['NN']
-----



['LS', '.', 'NNP', 'NNP']
-----



['$', 'CD', '', 'IN', 'NNP', 'NNP']
-----



[':', 'NNP', ',', 'NNP', 'NNS']
-----



[':', 'NNP', ',', 'NNP', 'NNP']
-----



['HYPH', 'NNP', ',', 'NNP', 'NNP']
-----



['$', 'CD']
-----



['IN', 'NNP']
-----



['$', 'CD']
-----



['IN', 'NNP']
-----



['CD']
-----



['LS', '.', 'NNP', 'NNP']
-----



['LS', '.', 'NN']
-----



['NNP', 'NNP', 'NNP', 'NNP', 'CD', 'NN']
-----



['NN', 'NNS']
-----



['NNS', 'IN', 'DT', 'NN', '.']
-----



['NNP', ',']
-----



['NN']
-----



['NN', 'CC', 'NN', 'NN']
-----



['NNP', 'IN', 'NNP', '.']
-----



['NN', 'CD', 'NNP', 'NN', 'IN', 'NNP']
-----



['NN', 'NN']
-----



['NNP']
-----



['LS', '.', 'NN']
-----



['NNP']
-----



['NNP']
-----



['NN', 'IN', 'NN']
-----



['NNP', 'NN']
-----



['LS', '.', 'NNP', '', 'IN', 'RB']
-----



['LS', '.', 'NN', 'NN', 'NNP', 'NNP']
-----



['NNP', 'NNP']
-----



['NNP']
-----



['JJ', 'HYPH', 'NN']
-----



['NN']
-----



['NNP', '', 'CD', 'NN', '_SP', 'PRP', 'MD']
-----



['NN', 'IN', 'NNP']
-----



['CD', 'SYM', 'CD']
-----



['NNP', 'NNP']
-----



['NNP', 'NNP']
-----



### Observations

- Model is bad at:
    - Incomplete sentences/fragments
    - Tasks without verbs (noun phrases)
    - Taking verb tense into account
- Model is good at:
    - Finding tasks in weird contexts
- Model is most uncertain about: 

## Writing: get_sample_predictions

Understanding code from [fast.ai 2018: lesson 1](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson1.ipynb) for getting various types of sample predictions.

`def rand_by_mask(mask, n): return np.random.choice(np.where(mask)[0], n, replace=False)`

    np.random.choice(a, size=None, replace=True, p=None)

    np.where(condition[, x, y])

`def rand_by_correct(preds, is_correct): return rand_by_mask((preds.preds == preds.labels)==is_correct)`
    
    np.where((preds.preds == preds.labels)==is_correct))

Goals:
1. Get `preds` into a form so that `rand_by_mask` works on it
2. Get `preds` into a form so that we can find what is correct vs incorrect

### What is `preds`?

In [17]:
type(preds)

list

In [18]:
type(preds[1]), preds[1]

(tuple,
 ('http://gasfundy.corp.enron.com/gas/framework/default.asp',
  <ProbDist with 2 samples>,
  'neg',
  {}))

where `tuple == (candidate_task, prediction_dist, label, features)`

In [19]:
[(p[0], p[1].max(), p[2], p[3]) for p in preds][1]

('http://gasfundy.corp.enron.com/gas/framework/default.asp', 'pos', 'neg', {})

In [20]:
type(preds[1][1]), preds[1][1]

(nltk.probability.DictionaryProbDist, <ProbDist with 2 samples>)

In [21]:
import nltk
??nltk.probability.DictionaryProbDist

In [22]:
preds[1][1].samples()

dict_keys(['neg', 'pos'])

In [23]:
preds[1][1].max()

'pos'

In [24]:
preds[1][1].prob('pos'), preds[1][1].prob('neg')

(0.56069809800114434, 0.43930190199885577)

### Form of `preds` that works with `rand_by_mask(mask, n)`?

For `np.random.choice`, first arg must be 1-dimensional

In [25]:
import numpy as np
np_preds = np.array(preds)
np_preds.shape

(1526, 4)

`preds` as it stands will not work. Instead, we can feed in the length of `preds` and let `np.random.choice` select random indices of `preds`.

In [26]:
n = 5 # number of results
mask = len(np_preds)

In [27]:
np_random = np_preds[np.random.choice(mask, n)]

In [28]:
np_random

array([['http://www.science-finance.fr/publications.html',
        <ProbDist with 2 samples>, 'neg', {}],
       ['They pass the way gulls pass inland in their white caps,',
        <ProbDist with 2 samples>, 'neg',
        {'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING_POSTAG_DT': 1, 'PARENT_DEP_ROOT': 1, 'PARENT_POS_VERB': 1, 'PARENT_POSTAG_VBP': 1, 'CHILD_DEP_NSUBJ': 2, 'CHILD_POS_PRON': 1, 'CHILD_POSTAG_PRP': 1, 'CHILD_DEP_DOBJ': 1, 'CHILD_POS_NOUN': 2, 'CHILD_POSTAG_NN': 1, 'CHILD_DEP_PUNCT': 1, 'CHILD_POS_PUNCT': 1, 'CHILD_POSTAG_,': 1, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING_POSTAG_RB': 1, 'PARENT_DEP_DOBJ': 1, 'PARENT_POS_NOUN': 1, 'PARENT_POSTAG_NN': 1, 'CHILD_POSTAG_NNS': 1, 'CHILD_DEP_ADVMOD': 1, 'CHILD_POS_ADV': 1, 'CHILD_POSTAG_RB': 1, 'CHILD_DEP_PREP': 1, 'CHILD_POS_ADP': 1, 'CHILD_POSTAG_IN': 1}],
       ['Houston Chronicle Online', <ProbDist with 2 samples>, 'neg', {}],
       ['ID:         mcconnellmarkr', <ProbDist with 2 samples>, 'neg', {}],
       ['https://www.admissio

In [29]:
len(np_random)

5

### Form of `preds` that works with `rand_by_correct(preds, is_correct)`?

`preds` must be broken down into predictions and labels. So first, let's get the predictions.

Remember that each `tuple` in `preds` == `(candidate_task, prediction_dist, label, features)`

In [30]:
[p[1].max() for p in preds][:n]

['pos', 'pos', 'pos', 'pos', 'pos']

Labels can be retrieved similarly

In [31]:
[p[2] for p in preds][:n]

['neg', 'neg', 'neg', 'neg', 'neg']

Noticing that `preds` might be more readable as a pandas dataframe. So let's convert it.

In [32]:
import pandas as pd

preds_df = pd.DataFrame.from_records(preds, columns=['candidate_task', 'prob_dist', 'label', 'features'])
preds_df['pred'] = [p.max() for p in preds_df['prob_dist']]

In [33]:
preds_df[:n]

Unnamed: 0,candidate_task,prob_dist,label,features,pred
0,https://www4.rsweb.com/61045/,<ProbDist with 2 samples>,neg,{},pos
1,http://gasfundy.corp.enron.com/gas/framework/d...,<ProbDist with 2 samples>,neg,{},pos
2,212 836 5030,<ProbDist with 2 samples>,neg,{},pos
3,http://fundamentals.corp.enron.com/main.asp,<ProbDist with 2 samples>,neg,{},pos
4,adams 30 631,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING...",pos


In [34]:
[p.max() for p in preds_df['prob_dist']][:n]

['pos', 'pos', 'pos', 'pos', 'pos']

In [35]:
preds_df['label'][:n].values

array(['neg', 'neg', 'neg', 'neg', 'neg'], dtype=object)

Put this all together to write our own 

#### `rand_by_correct(preds, n=5, is_correct=True)`

In [36]:
def rand_by_mask(mask, n):
    return np.random.choice(np.where(mask)[0], n, replace=False)

def rand_by_correct(preds_df, n = 5, is_correct = True):
    return preds_df.iloc[rand_by_mask(([p.max() for p in preds_df['prob_dist']] == preds_df['label'].values)==is_correct, n)]

In [37]:
# correct
rand_by_correct(preds_df)

Unnamed: 0,candidate_task,prob_dist,label,features,pred
995,The most incomprehensible thing about the worl...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",neg
1012,crirticism is to emphasize the quality of Enro...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_PART': 1, 'FOLLOWIN...",neg
1094,means it probably does become an issue in the ...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_PRON': 1, 'FOLLOWIN...",neg
16,"5) In Section 9(a), the non - compete clause i...",<ProbDist with 2 samples>,neg,"{'VERB': 11, 'FOLLOWING_POS_VERB': 1, 'FOLLOWI...",neg
104,20. An ostrich's eye is bigger than its brain.,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",neg


In [38]:
# incorrect
rand_by_correct(preds_df, is_correct=False)

Unnamed: 0,candidate_task,prob_dist,label,features,pred
737,for a week to discuss the weather forecasting ...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos
433,Feb-04,<ProbDist with 2 samples>,neg,{},pos
1438,ID: Enr161,<ProbDist with 2 samples>,neg,{},pos
1164,http://www.stoft.com/x/cal/index.shtml,<ProbDist with 2 samples>,neg,{},pos
281,EB4054,<ProbDist with 2 samples>,neg,{},pos


Cool! Adding more goals... this time around getting the 'most correct/incorrect'

`def most_by_mask(mask, mult):
    idxs = np.where(mask)[0]
    return idxs[np.argsort(mult * probs[idxs])[:4]]`
    
    numpy.argsort(a, axis=-1, kind='quicksort', order=None)

`def most_by_correct(y, is_correct): 
    mult = -1 if (y==1)==is_correct else 1
    return most_by_mask(((preds == data.val_y)==is_correct) & (data.val_y == y), mult)`
    
    mult: makes it so that if 'correct' == 'lowest probabilities', the argsort sees lower == more correct. else, higher == more correct

Goals:
1. Get `preds` into a form so that `most_by_mask` works on it
2. Get `preds` into a form so that we can find what is _most_ correct vs _most_ incorrect

Unlike the `Cats vs Dogs` predictions, NLTK provides separate probability distributions for each possible label.

Sorting is easy if we break up the values within the `prob_dist` column

In [39]:
[p.prob('pos') for p in preds_df['prob_dist']][:n]

[0.56069809800114434,
 0.56069809800114434,
 0.56069809800114434,
 0.56069809800114434,
 0.59312608626826946]

In [40]:
[p.prob('neg') for p in preds_df['prob_dist']][:n]

[0.43930190199885577,
 0.43930190199885577,
 0.43930190199885577,
 0.43930190199885577,
 0.40687391373173049]

In [41]:
preds_df['pos_dist'] = [p.prob('pos') for p in preds_df['prob_dist']]
preds_df['neg_dist'] = [p.prob('neg') for p in preds_df['prob_dist']]

In [42]:
preds_df[:n]

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
0,https://www4.rsweb.com/61045/,<ProbDist with 2 samples>,neg,{},pos,0.560698,0.439302
1,http://gasfundy.corp.enron.com/gas/framework/d...,<ProbDist with 2 samples>,neg,{},pos,0.560698,0.439302
2,212 836 5030,<ProbDist with 2 samples>,neg,{},pos,0.560698,0.439302
3,http://fundamentals.corp.enron.com/main.asp,<ProbDist with 2 samples>,neg,{},pos,0.560698,0.439302
4,adams 30 631,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING...",pos,0.593126,0.406874


In [43]:
preds_df.sort_values('pos_dist', ascending=False)[:5]

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
396,ENA Origination Support (Getting info from Tyc...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",pos,0.973595,0.026405
1134,"New York Times on the Web, please contact Alyson",<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.93827,0.06173
1517,"install v4.0 build 23, follow old instructions...",<ProbDist with 2 samples>,pos,"{'VERB': 4, 'FOLLOWING_POS_NOUN': 3, 'FOLLOWIN...",pos,0.937159,0.062841
1135,Racer at alyson@nytimes.com or visit our onlin...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.93368,0.06632
926,wanted only to buy and sell conventional reins...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",pos,0.928333,0.071667


In [44]:
preds_df.sort_values('neg_dist', ascending=False)[:5]

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
83,5. A shark is the only fish that can blink wit...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.02173,0.97827
894,"Currently, the trading that is taking place ty...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.022262,0.977738
109,cop and Ernie the taxi driver in Frank Capra's...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.022798,0.977202
612,"The tulips are too excitable, it is winter here.",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",neg,0.02371,0.97629
1045,Another aspect of this problem is prioritizati...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg,0.026688,0.973312


In [45]:
preds[1][1].samples()

dict_keys(['neg', 'pos'])

Drumroll...

#### most_by_correct(preds_df, label, n=5, is_correct=True)

In [46]:
def most_by_correct(preds_df, label, n=5, is_correct=True):
    label_options = list(preds_df['prob_dist'][0].samples())
    # note the 2-class assumption
    other_label = label_options[abs(label_options.index(label)-1)]
    # if is_correct, label dist is what matters; if incorrect, look at other_label dist
    dist_column = label + '_dist' if is_correct == True else other_label + '_dist'
    
    correct_mask = ([p.max() for p in preds_df['prob_dist']] == preds_df['label'].values)==is_correct
    target_label_mask = preds_df['label'].values == label
    
    return preds_df.iloc[np.where(correct_mask & target_label_mask)].sort_values(dist_column, ascending=False)[:n]

In [47]:
most_by_correct(preds_df, 'pos')

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
1134,"New York Times on the Web, please contact Alyson",<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.93827,0.06173
1517,"install v4.0 build 23, follow old instructions...",<ProbDist with 2 samples>,pos,"{'VERB': 4, 'FOLLOWING_POS_NOUN': 3, 'FOLLOWIN...",pos,0.937159,0.062841
1135,Racer at alyson@nytimes.com or visit our onlin...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.93368,0.06632
273,Check last years tax return for RRSP contribut...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.923122,0.076878
33,visit http://www.adamsmark.com/resv/rescheck.h...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.912509,0.087491


In [48]:
most_by_correct(preds_df, 'neg')

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
83,5. A shark is the only fish that can blink wit...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.02173,0.97827
894,"Currently, the trading that is taking place ty...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.022262,0.977738
109,cop and Ernie the taxi driver in Frank Capra's...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.022798,0.977202
612,"The tulips are too excitable, it is winter here.",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",neg,0.02371,0.97629
1045,Another aspect of this problem is prioritizati...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg,0.026688,0.973312


In [49]:
most_by_correct(preds_df, 'pos', is_correct=False)

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
248,I need your dues by COB today!,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",neg,0.44212,0.55788


In [50]:
most_by_correct(preds_df, 'neg', is_correct=False)

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
396,ENA Origination Support (Getting info from Tyc...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",pos,0.973595,0.026405
926,wanted only to buy and sell conventional reins...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",pos,0.928333,0.071667
1091,"end, after court hearings on the issue, the Dr...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.907394,0.092606
1023,the market observations (as opposed to histori...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.881702,0.118298
1121,late Tuesday accused 29 Enron officers and dir...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING...",pos,0.880221,0.119779


Final `get_sample_predictions` goal: get most uncertain

`most_uncertain = np.argsort(np.abs(probs -0.5))[:4]`

Basically, we're going to sort by probabilities closest to 0.5 (on either side)

#### most_uncertain(preds_df, label, n=5)

In [51]:
def most_uncertain(preds_df, label, n=5):
    dist_column = label + '_dist'
    return preds_df.iloc[np.argsort(np.abs(preds_df[dist_column]-0.5))][:n]

In [52]:
most_uncertain(preds_df, 'pos', 10)

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
648,Even through the gift paper I could hear them ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_PRON': 1, 'FOLLOWIN...",neg,0.499242,0.500758
727,very interested in this guy and would be ready...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.500941,0.499059
1108,"to certain employees, and not proceed with the...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.502288,0.497712
1097,up as to whether there was fair consideration ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.502583,0.497417
559,Freeman indicated that the Power Authority has...,<ProbDist with 2 samples>,neg,"{'VERB': 5, 'FOLLOWING_POS_ADP': 2, 'FOLLOWING...",pos,0.502932,0.497068
786,"For your convenience, this workshop will be he...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.49622,0.50378
998,You cannot simultaneously prevent and prepare ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_CCONJ': 1, 'FOLLOWI...",neg,0.495866,0.504134
903,"of Trade, which began trading the options in 1...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.493854,0.506146
380,maine impossible to get to...next option?,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.510187,0.489813
632,"Their smiles catch onto my skin, little smilin...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.510244,0.489756


In [53]:
most_uncertain(preds_df, 'neg', 10)

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
648,Even through the gift paper I could hear them ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_PRON': 1, 'FOLLOWIN...",neg,0.499242,0.500758
727,very interested in this guy and would be ready...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.500941,0.499059
1108,"to certain employees, and not proceed with the...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.502288,0.497712
1097,up as to whether there was fair consideration ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.502583,0.497417
559,Freeman indicated that the Power Authority has...,<ProbDist with 2 samples>,neg,"{'VERB': 5, 'FOLLOWING_POS_ADP': 2, 'FOLLOWING...",pos,0.502932,0.497068
786,"For your convenience, this workshop will be he...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.49622,0.50378
998,You cannot simultaneously prevent and prepare ...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_CCONJ': 1, 'FOLLOWI...",neg,0.495866,0.504134
903,"of Trade, which began trading the options in 1...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.493854,0.506146
380,maine impossible to get to...next option?,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.510187,0.489813
632,"Their smiles catch onto my skin, little smilin...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.510244,0.489756


Turns out, both `label`s return the same list

## Writing: aggregate_features

how to correlate features to (in)correctness

Example set of feature dictionaries...

In [54]:
most_incorrect[['features']]

Unnamed: 0,features
396,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN..."
926,"{'VERB': 3, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING..."
1091,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI..."
1023,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING..."
1121,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING..."
777,"{'VERB': 1, 'PARENT_DEP_ROOT': 1, 'PARENT_POS_..."
167,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI..."
1282,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING..."
803,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN..."
196,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI..."


Convert one feature dictionary to dataframe...

In [55]:
most_incorrect[['features']].iloc[1].values[0]

{'CHILD_DEP_ADVMOD': 1,
 'CHILD_DEP_AUX': 1,
 'CHILD_DEP_CC': 1,
 'CHILD_DEP_CONJ': 1,
 'CHILD_DEP_DOBJ': 1,
 'CHILD_DEP_NPADVMOD': 1,
 'CHILD_DEP_PUNCT': 1,
 'CHILD_DEP_XCOMP': 1,
 'CHILD_POSTAG_,': 1,
 'CHILD_POSTAG_CC': 1,
 'CHILD_POSTAG_NN': 1,
 'CHILD_POSTAG_NNP': 1,
 'CHILD_POSTAG_RB': 1,
 'CHILD_POSTAG_TO': 1,
 'CHILD_POSTAG_VB': 2,
 'CHILD_POS_ADV': 1,
 'CHILD_POS_CCONJ': 1,
 'CHILD_POS_NOUN': 1,
 'CHILD_POS_PART': 1,
 'CHILD_POS_PROPN': 1,
 'CHILD_POS_PUNCT': 1,
 'CHILD_POS_VERB': 2,
 'FOLLOWING_POSTAG_CC': 1,
 'FOLLOWING_POSTAG_JJ': 1,
 'FOLLOWING_POSTAG_RB': 1,
 'FOLLOWING_POS_ADJ': 1,
 'FOLLOWING_POS_ADV': 1,
 'FOLLOWING_POS_CCONJ': 1,
 'PARENT_DEP_ROOT': 2,
 'PARENT_DEP_XCOMP': 1,
 'PARENT_POSTAG_VB': 1,
 'PARENT_POSTAG_VBD': 2,
 'PARENT_POS_VERB': 3,
 'VERB': 3}

In [56]:
f_dict = pd.DataFrame.from_dict(most_incorrect[['features']].iloc[1].values[0], orient='index').reset_index()
f_dict

Unnamed: 0,index,0
0,VERB,3
1,FOLLOWING_POS_ADV,1
2,FOLLOWING_POSTAG_RB,1
3,PARENT_DEP_ROOT,2
4,PARENT_POS_VERB,3
5,PARENT_POSTAG_VBD,2
6,CHILD_DEP_XCOMP,1
7,CHILD_POS_VERB,2
8,CHILD_POSTAG_VB,2
9,CHILD_DEP_PUNCT,1


In [57]:
f_dict.columns = ['feature', 'count']
f_dict

Unnamed: 0,feature,count
0,VERB,3
1,FOLLOWING_POS_ADV,1
2,FOLLOWING_POSTAG_RB,1
3,PARENT_DEP_ROOT,2
4,PARENT_POS_VERB,3
5,PARENT_POSTAG_VBD,2
6,CHILD_DEP_XCOMP,1
7,CHILD_POS_VERB,2
8,CHILD_POSTAG_VB,2
9,CHILD_DEP_PUNCT,1


Aggregating two feature dataframes into one by feature occurrence count...

In [58]:
pd.concat([f_dict, f_dict]).groupby('feature').sum().reset_index()

Unnamed: 0,feature,count
0,CHILD_DEP_ADVMOD,2
1,CHILD_DEP_AUX,2
2,CHILD_DEP_CC,2
3,CHILD_DEP_CONJ,2
4,CHILD_DEP_DOBJ,2
5,CHILD_DEP_NPADVMOD,2
6,CHILD_DEP_PUNCT,2
7,CHILD_DEP_XCOMP,2
8,"CHILD_POSTAG_,",2
9,CHILD_POSTAG_CC,2


Putting it all together...

In [59]:
f_dicts = []
for i in range(len(most_incorrect)):
    f_dict = pd.DataFrame.from_dict(most_incorrect[['features']].iloc[i].values[0], orient='index').reset_index()
    f_dict.columns = ['feature', 'count']
    f_dicts.append(f_dict)
f_dicts

[               feature  count
 0                 VERB      1
 1   FOLLOWING_POS_NOUN      1
 2  FOLLOWING_POSTAG_NN      1
 3      PARENT_DEP_ROOT      1
 4     PARENT_POS_PROPN      1
 5    PARENT_POSTAG_NNP      1
 6       CHILD_DEP_DOBJ      1
 7       CHILD_POS_NOUN      1
 8      CHILD_POSTAG_NN      1,                 feature  count
 0                  VERB      3
 1     FOLLOWING_POS_ADV      1
 2   FOLLOWING_POSTAG_RB      1
 3       PARENT_DEP_ROOT      2
 4       PARENT_POS_VERB      3
 5     PARENT_POSTAG_VBD      2
 6       CHILD_DEP_XCOMP      1
 7        CHILD_POS_VERB      2
 8       CHILD_POSTAG_VB      2
 9       CHILD_DEP_PUNCT      1
 10      CHILD_POS_PUNCT      1
 11       CHILD_POSTAG_,      1
 12   CHILD_DEP_NPADVMOD      1
 13      CHILD_POS_PROPN      1
 14     CHILD_POSTAG_NNP      1
 15  FOLLOWING_POS_CCONJ      1
 16  FOLLOWING_POSTAG_CC      1
 17     CHILD_DEP_ADVMOD      1
 18        CHILD_POS_ADV      1
 19      CHILD_POSTAG_RB      1
 20        CHILD_D

In [60]:
pd.concat(f_dicts).groupby('feature').sum().sort_values('count', ascending=False).reset_index()

Unnamed: 0,feature,count
0,VERB,29
1,PARENT_DEP_ROOT,23
2,PARENT_POS_VERB,15
3,CHILD_DEP_DOBJ,13
4,CHILD_POSTAG_IN,12
5,CHILD_POS_ADP,12
6,CHILD_POS_NOUN,12
7,PARENT_POS_NOUN,11
8,CHILD_DEP_PREP,11
9,PARENT_POSTAG_VB,10


#### aggregate_features(preds_df, n=0)

In [61]:
def aggregate_features(preds_df, n=0):
    f_dicts = []
    for i in range(len(most_incorrect)):
        f_dict = pd.DataFrame.from_dict(preds_df[['features']].iloc[i].values[0], orient='index').reset_index()
        f_dict.columns = ['feature', 'count']
        f_dicts.append(f_dict)
    
    agg = pd.concat(f_dicts).groupby('feature').sum().sort_values('count', ascending=False).reset_index()
    
    if n <= 0:
        return agg
    else:
        return agg[:n]

In [62]:
aggregate_features(most_incorrect)

Unnamed: 0,feature,count
0,VERB,29
1,PARENT_DEP_ROOT,23
2,PARENT_POS_VERB,15
3,CHILD_DEP_DOBJ,13
4,CHILD_POSTAG_IN,12
5,CHILD_POS_ADP,12
6,CHILD_POS_NOUN,12
7,PARENT_POS_NOUN,11
8,CHILD_DEP_PREP,11
9,PARENT_POSTAG_VB,10


In [63]:
aggregate_features(most_incorrect, 10)

Unnamed: 0,feature,count
0,VERB,29
1,PARENT_DEP_ROOT,23
2,PARENT_POS_VERB,15
3,CHILD_DEP_DOBJ,13
4,CHILD_POSTAG_IN,12
5,CHILD_POS_ADP,12
6,CHILD_POS_NOUN,12
7,PARENT_POS_NOUN,11
8,CHILD_DEP_PREP,11
9,PARENT_POSTAG_VB,10


In [64]:
aggregate_features(most_correct, 10)

Unnamed: 0,feature,count
0,VERB,49
1,PARENT_POS_VERB,37
2,CHILD_DEP_NSUBJ,35
3,CHILD_POS_NOUN,35
4,PARENT_DEP_ROOT,34
5,CHILD_POSTAG_NN,28
6,PARENT_POSTAG_VBZ,28
7,CHILD_POS_PUNCT,24
8,CHILD_DEP_PUNCT,24
9,CHILD_POS_ADJ,16
