This notebook is the scratch paper I used while writing [`evaluate.py`](./evaluate.py). I used this whenever I needed an interactive console to try ideas and view results.

## Load and run `evaluate` module

In [1]:
from importlib import reload # recompile module and re-execute module-level code
import evaluate as ev; reload(ev)

<module 'evaluate' from 'D:\\This PC\\Documents\\Github\\nlp-sandbox\\enron\\evaluate.py'>

In [2]:
%run -i evaluate.py

Importing the model from model.v2.pkl
Loading test data
notes.cleaned.tsv
Featurizing test data
Evaluating

Accuracy: 0.2961992136304063

Confusion matrix:
Layout
[[tn   fp]
 [fn   tp]]

[[ 357 1072]
 [   2   95]]

Classification report:
             precision    recall  f1-score   support

        neg       0.99      0.25      0.40      1429
        pos       0.08      0.98      0.15        97

avg / total       0.94      0.30      0.38      1526


Accuracy (on candidates with 1+ feature): 0.6578073089700996

Confusion matrix:
[[357 204]
 [  2  39]]

Classification report:
             precision    recall  f1-score   support

        neg       0.99      0.64      0.78       561
        pos       0.16      0.95      0.27        41

avg / total       0.94      0.66      0.74       602



**Improvement #1**: accuracy filtering above shows the model should learn to be more uncertain about candidates it has zero features on.
   - This could mean not requiring a VERB in order for a candidate to have features. (Although might be reasonable to scope this problem to tasks of this form first, nail it, then expand the definition)
   - This could mean not including training data with 0 features, with the idea that this will force the model to be more uncertain when it comes across 0-feature candidates in test. But is this a bad bias to introduce?

**Improvement #2**: re-train model using spacy v2

I decided to upgrade spacy in the middle of building this notebook (to use the `displacy` visualizer package later on). Went from `1.9.0` to `2.0.7`. Didn't expect it to affect much until I re-train my model. Wrong.

**Measure**|**v1.9.0**|**v2.0.7 (eval-only)**|**v2.0.7 (re-train)**
-----|-----|-----|-----
Accuracy|0.286|0.312|0.296
Accuracy on 1+ features|0.633|0.698|0.658
Confusion Matrix|[[ 356 1086] [ 4 80]]|[[ 393 1049] [ 1 83]]|[[ 357 1072] [   2   95]]
Confusion Matrix on 1+ features|[[356 216] [ 4 24]]|[[393 181] [ 1 27]]|[[357 204] [  2  39]]

_On the negative side, featurization is slower._

Additional options after upgrade:
- Add [new morphology info](https://spacy.io/api/annotation#pos-tagging) as features (VerbForm, Tense, Aspect)

**TODO**: address label inbalance

## Using: get_sample_predictions

In [3]:
most_incorrect = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_INCORRECT, n=20)
most_incorrect

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
1282,use the,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos,0.963347,0.036653,0.963347
529,Visit,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'PARENT_DEP_ROOT': 1, 'PARENT_POS_...",pos,0.960683,0.039317,0.960683
1091,"end, after court hearings on the issue, the Dr...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.955679,0.044321,0.955679
396,ENA Origination Support (Getting info from Tyc...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",pos,0.949537,0.050463,0.949537
1121,late Tuesday accused 29 Enron officers and dir...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING...",pos,0.942597,0.057403,0.942597
1031,systematically fail.,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.932591,0.067409,0.932591
926,wanted only to buy and sell conventional reins...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",pos,0.925261,0.074739,0.925261
167,"Managing Director, Enterprise Risk",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.902369,0.097631,0.902369
752,Please note your account 00235424-0015,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",pos,0.901583,0.098417,0.901583
196,Published: June 2000,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.889471,0.110529,0.889471


In [4]:
ev.print_full_df_column(most_incorrect, ['candidate_task', 'label'], True)

candidate_task	label
use the	neg
Visit	neg
end, after court hearings on the issue, the Drexel estate	neg
ENA Origination Support (Getting info from Tycholiz)--$$XXK	neg
late Tuesday accused 29 Enron officers and directors of	neg
systematically fail.	neg
wanted only to buy and sell conventional reinsurance, Mr. Sweeney	neg
Managing Director, Enterprise Risk	neg
Please note your account 00235424-0015	neg
Published: June 2000	neg
stock options contracts can be hedged with IBM stock, Mr. Lane	neg
Sink out of sight, and the water went over my head.	neg
and to build balanced portfolios.	neg
take over the job of administering the plan.	neg
Certificate Issued: 07/16/2001  96 FERC ? 61,078 (2001)	neg
If you are experiencing problems with today's issue, you may log in to the publisher's site to view the issue here <http://ftenergyusa.com/login_form.asp?FinalURL=/gasdaily/default.asp>. User Name: ENET3318 Password: enron	neg
Just days before Enron (news/quote) filed for bankruptcy	neg
If the fact

### Manual annotations

**candidate\_task**|**label**|**annotation**
:-----:|:-----:|:-----:
ENA Origination Support (Getting info from Tycholiz)--$$XXK|neg|[3(VBG)]
New York Times on the Web, please contact Alyson|neg|[1]
Racer at alyson@nytimes.com or visit our online media|neg|[1]
wanted only to buy and sell conventional reinsurance, Mr. Sweeney|neg|[3(VBD;VB;VB)]
visit http://www.adamsmark.com/resv/rescheck.html.|neg|[1]
freeze: bridge|neg|[1]
use the following link for enrononline: www.enrononline.com|neg|[1]
end, after court hearings on the issue, the Drexel estate|neg|[2]
seek to buy replacement coverage to offset depletions in their|neg|[1][2]
For general information about NYTimes.com, write to|neg|[1][2]
the market observations (as opposed to historical data).|neg|[3(VBN)]
execute the agreement and return it to me via fax no. (713) 646-3490.  I will|neg|[1]
bring these issues to the table, in a meeting involving all the concerned parties|neg|[1]
late Tuesday accused 29 Enron officers and directors of|neg|[2][3(VBD)]
John Elder, creator of a grass-roots investment program to develop|neg|[2][3(VB)]
Get FREE shipping on orders of $75 or more at Starbucks.com|neg|[1]
Managing Director, Enterprise Risk|neg|[3(VBG)]
use the|neg|[1][2]
2. Decision support system: a model recommending transactions to a trader without|neg|[3(VBG)]
convince the external parties (stock analysts, creditors,  credit rating agencies) about|neg|[1]

**Annotation Key**:
- [1] = Reasonable task
- [2] = Sentence fragment
- [3(\*)] = Overweighted VERB (VB\* forms)


([Markdown Table Generator](https://jakebathman.github.io/Markdown-Table-Generator/))

#### Improvements
- **#3** Add feature around the specific kind of VERB (present vs past, etc)
- **#4** Might be useful to classify the greater body of text (e.g., article, paragraph, ?) as something worth further parsing by line - to help avoid sentence fragmentation or pulling "reasonable tasks" out of irrelevant text.

In [5]:
ev.render_displacy(most_incorrect, ev.DisplacyStyles.dep)

['VB', 'DT']
-----



['VB']
-----



['VB', ',', 'IN', 'NN', 'NNS', 'IN', 'DT', 'NN', ',', 'DT', 'NNP', 'NN']
-----



['NNP', 'NNP', 'NNP', '-LRB-', 'VBG', 'NN', 'IN', 'NN']
-----



['JJ', 'NNP', 'VBD', 'CD', 'NNP', 'NNS', 'CC', 'NNS', 'IN']
-----



['RB', 'VB', '.']
-----



['VBD', 'RB', 'TO', 'VB', 'CC', 'VB', 'JJ', 'NN', ',', 'NNP', 'NNP']
-----



['VBG', 'NNP', ',', 'NN', 'NN']
-----



['UH', 'VB', 'PRP$', 'NN', 'CD', 'HYPH', 'CD']
-----



['VBN', ':', 'NNP', 'CD']
-----



['NN', 'NNS', 'NNS', 'MD', 'VB', 'VBN', 'IN', 'NNP', 'NN', ',', 'NNP', 'NNP']
-----



['VB', 'IN', 'IN', 'NN', ',', 'CC', 'DT', 'NN', 'VBD', 'IN', 'PRP$', 'NN', '.']
-----



['CC', 'TO', 'VB', 'JJ', 'NNS', '.']
-----



['VB', 'RP', 'DT', 'NN', 'IN', 'VBG', 'DT', 'NN', '.']
-----



['NNP', 'VBD', ':', 'CD', '', 'CD', 'NN', '.', 'CD', '-LRB-', 'CD', '-RRB-']
-----



['IN', 'PRP', 'VBP', 'VBG', 'NNS', 'IN', 'NN', 'POS', 'NN', ',', 'PRP', 'MD', 'VB', 'RP', 'IN', 'DT', 'NN', 'POS', 'NN', 'TO', 'VB', 'DT', 'NN', 'RB', 'XX', 'ADD', 'SYM', 'RB', 'SYM', 'NN', 'NNP', 'NNP', ':', 'NNP', 'NNP', ':', 'NN']
-----



['RB', 'NNS', 'IN', 'NNP', '-LRB-', 'NN', 'SYM', 'NN', '-RRB-', 'VBD', 'IN', 'NN']
-----



['IN', 'DT', 'NNS', 'VBP', 'RB', 'VB', 'DT', 'NN', ',', 'VB', 'DT', 'NNS', '.']
-----



['TO', 'VB', 'DT', 'NN', 'NN', 'TO', 'VB', 'IN', 'JJ', 'NN']
-----



['NNP', 'VBD', 'NNP', '``', 'NNP', "''", 'NNS', 'IN', 'NNP', 'NNP']
-----



In [6]:
most_incorrect_pos = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_INCORRECT_POS, n=20)
most_incorrect_pos

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
380,maine impossible to get to...next option?,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",neg,0.330498,0.669502,0.669502
191,Start project with Elene for Mark,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg,0.480333,0.519667,0.519667


In [7]:
most_correct = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_CORRECT, n=20)
most_correct

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
1215,Call Grant about Tony,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.996769,0.003231,0.996769
762,Bring AER,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.995642,0.004358,0.995642
33,visit http://www.adamsmark.com/resv/rescheck.h...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.991685,0.008315,0.991685
1255,Call Ambrosia ? Monday morning,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.991685,0.008315,0.991685
1045,Another aspect of this problem is prioritizati...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg,0.013304,0.986696,0.986696
121,"35. ""Stewardesses"" is the longest word that is...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.01595,0.98405,0.98405
83,5. A shark is the only fish that can blink wit...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.01672,0.98328,0.98328
88,building is an American flag.,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.019714,0.980286,0.980286
990,Common sense is the collection of prejudices a...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.022955,0.977045,0.977045
163,Check for ML,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.973474,0.026526,0.973474


In [8]:
ev.print_full_df_column(most_correct, ['candidate_task', 'label'], True)

candidate_task	label
Call Grant about Tony	pos
Bring AER	pos
visit http://www.adamsmark.com/resv/rescheck.html.	pos
Call Ambrosia ? Monday morning	pos
Another aspect of this problem is prioritization of different projects. In my view	neg
"35. ""Stewardesses"" is the longest word that is typed with only the left"	neg
5. A shark is the only fish that can blink with both eyes.	neg
building is an American flag.	neg
Common sense is the collection of prejudices acquired by age eighteen.	neg
Check for ML	pos
Like an eye between two white lids that will not shut.	neg
"cop and Ernie the taxi driver in Frank Capra's ""It's a Wonderful Life."""	neg
Send check to Lisa	pos
800   357  4410  ext  6422  Amy regardinding Telefonica	pos
The tulips are too excitable, it is winter here.	neg
"The release of atomic energy has not created a new problem. It has	neg
merely made more urgent the necessity of solving an existing one."	neg
There were reports of further attacks overnight on the Kajaki Dam area, whe

In [9]:
most_correct_pos = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_CORRECT_POS, n=20)
most_correct_pos

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
1215,Call Grant about Tony,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.996769,0.003231,0.996769
762,Bring AER,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.995642,0.004358,0.995642
33,visit http://www.adamsmark.com/resv/rescheck.h...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.991685,0.008315,0.991685
1255,Call Ambrosia ? Monday morning,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.991685,0.008315,0.991685
163,Check for ML,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.973474,0.026526,0.973474
696,Send check to Lisa,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",pos,0.971138,0.028862,0.971138
459,800 357 4410 ext 6422 Amy regardinding T...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.96901,0.03099,0.96901
411,Get Kerri a gift.,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.964902,0.035098,0.964902
1328,freeze: bridge,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.964522,0.035478,0.964522
1016,"convince the external parties (stock analysts,...",<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos,0.961552,0.038448,0.961552


In [10]:
most_uncertain = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_UNCERTAIN, n=20)
most_uncertain

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
965,"Currently, most of the trades are between secu...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.500492,0.499508,0.500492
573,Davis has taken heat for locking up too much e...,<ProbDist with 2 samples>,neg,"{'VERB': 6, 'FOLLOWING_POS_VERB': 3, 'FOLLOWIN...",neg,0.498548,0.501452,0.501452
614,"I am learning peacefulness, lying by myself qu...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",pos,0.501745,0.498255,0.501745
721,about Australian markets and was greatly impre...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",neg,0.496814,0.503186,0.503186
1411,Incentive fee (in main contract) defined to on...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.504848,0.495152,0.504848
1317,trader I see an opportunity to sell soybeans a...,<ProbDist with 2 samples>,neg,"{'VERB': 5, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos,0.505635,0.494365,0.505635
901,"The oldest of the insurance-related, exchange-...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",neg,0.49251,0.50749,0.50749
116,31. The microwave was invented after a researc...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.491751,0.508249,0.508249
904,"Initially, there was substantial interest in t...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",neg,0.491579,0.508421,0.508421
909,"so far this year, he said. Last year, there wa...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.512387,0.487613,0.512387


In [11]:
ev.print_full_df_column(most_uncertain, ['candidate_task', 'pred', 'label'], True)

candidate_task	pred	label
Currently, most of the trades are between securities dealers	pos	neg
Davis has taken heat for locking up too much electricity at high prices. Similarly, the Department of Water Resources has been criticized for amateurish purchasing practices.	neg	neg
I am learning peacefulness, lying by myself quietly	pos	neg
about Australian markets and was greatly impressed with the	neg	neg
Incentive fee (in main contract) defined to only include the sale of gas.	pos	neg
trader I see an opportunity to sell soybeans and buy corn. I am thinking	pos	neg
The oldest of the insurance-related, exchange-based derivative	neg	neg
31. The microwave was invented after a researcher walked by radar tube and a	neg	neg
Initially, there was substantial interest in the options, but the soft	neg	neg
so far this year, he said. Last year, there was increased interest in	pos	neg
"I know not with what weapons World War III will be fought, but World	neg
War IV will be fought with sticks and stones

In [12]:
ev.confusion_matrix(most_uncertain[['candidate_task', 'prob_dist', 'label', 'features']].itertuples(index=False))

array([[10,  9],
       [ 1,  0]])

#### Q: What TP candidates (actual tasks) didn't have features?

In [13]:
all_correct_pos = ev.get_sample_predictions(preds_df, sample_type=ev.SamplePredictions.MOST_CORRECT_POS)

In [14]:
len(all_correct_pos)

95

In [15]:
no_features_mask = all_correct_pos['features'] == {}
tp_no_features = all_correct_pos.iloc[np.where(no_features_mask)]
tp_no_features

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist,max_dist
984,Headcount,<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
1253,RT,<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
764,10:00 Sat,<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
1485,Monday at 11:30 Tara,<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
1252,"Thanks,",<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
755,3. Barbara Weather,<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
754,2. Accom,<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
753,"1, Li Xiao",<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
747,Article about the group,<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029
746,McCormack,<ProbDist with 2 samples>,pos,{},pos,0.522029,0.477971,0.522029


**Improvement #5**: Manual re-labeling due to my "split by new line" decision is necessary for more accurate results. All sample datasets have instances of mis-labeled data.

In [16]:
ev.render_displacy(tp_no_features, [ev.DisplacyStyles.dep, ev.DisplacyStyles.ent])

['NNP']
-----



['NNP']
-----



['CD', '', 'NNP']
-----



['NNP', 'IN', 'CD', 'NN']
-----



['NNS', ',']
-----



['LS', '.', 'NNP', 'NNP']
-----



['LS', '.', 'ADD']
-----



['LS', ',', 'NNP', 'NNP']
-----



['NN', 'IN', 'DT', 'NN']
-----



['NN']
-----



['JJ', 'NN', '.']
-----



['NNP', 'NNS', 'IN', 'NNP']
-----



['NN', 'IN', '', 'NNP', 'CD', ',', 'CD', '.']
-----



['NN', 'IN', '', 'NNP', 'CD', ',', 'CD', '.']
-----



['NNP', 'POS', 'NNS']
-----



['NN', 'IN', 'NN']
-----



['NNS']
-----



['NN']
-----



['$', 'CD', '', 'IN', 'NNP', 'NNP']
-----



[':', 'NNP', ',', 'NNP', 'NNS']
-----



[':', 'NNP', ',', 'NNP', 'NNP']
-----



['HYPH', 'NNP', ',', 'NNP', 'NNP']
-----



['$', 'CD']
-----



['IN', 'NNP']
-----



['$', 'CD']
-----



['IN', 'NNP']
-----



['CD']
-----



['LS', '.', 'NNP', 'NNP']
-----



['LS', '.', 'NN']
-----



['NNP', 'NNP', 'NNP', 'NNP', 'CD', 'NN']
-----



['NN', 'NNS']
-----



['NNS', 'IN', 'DT', 'NN', '.']
-----



['NNP', ',']
-----



['NN']
-----



['NN', 'CC', 'NN', 'NN']
-----



['NNP', 'IN', 'NNP', '.']
-----



['NN', 'CD', 'NNP', 'NN', 'IN', 'NNP']
-----



['NN', 'NN']
-----



['LS', '.', 'NN', 'NN', 'NNP', 'NNP']
-----



['LS', '.', 'NNP', 'NNP']
-----



['LS', '.', 'NN']
-----



['NNP']
-----



['NNP']
-----



['NN', 'IN', 'NN']
-----



['NNP', 'NN']
-----



['LS', '.', 'NNP', '', 'IN', 'RB']
-----



['NNP']
-----



['NNP', 'NNP']
-----



['NNP', 'NNP']
-----



['NNP']
-----



['JJ', 'HYPH', 'NN']
-----



['NN']
-----



['NNP', '', 'CD', 'NN', '_SP', 'PRP', 'MD']
-----



['NN', 'IN', 'NNP']
-----



['CD', 'SYM', 'CD']
-----



['NNP', 'NNP']
-----



### Observations

- Model is bad at:
    - Incomplete sentences/fragments
    - Tasks without verbs (noun phrases)
    - Taking verb tense into account
- Model is good at:
    - Finding tasks in weird contexts
- Model is most uncertain about: 

## Writing: get_sample_predictions

Understanding code from [fast.ai 2018: lesson 1](https://github.com/fastai/fastai/blob/master/courses/dl1/lesson1.ipynb) for getting various types of sample predictions.

`def rand_by_mask(mask, n): return np.random.choice(np.where(mask)[0], n, replace=False)`

    np.random.choice(a, size=None, replace=True, p=None)

    np.where(condition[, x, y])

`def rand_by_correct(preds, is_correct): return rand_by_mask((preds.preds == preds.labels)==is_correct)`
    
    np.where((preds.preds == preds.labels)==is_correct))

Goals:
1. Get `preds` into a form so that `rand_by_mask` works on it
2. Get `preds` into a form so that we can find what is correct vs incorrect

### What is `preds`?

In [17]:
type(preds)

list

In [18]:
type(preds[1]), preds[1]

(tuple,
 ('http://gasfundy.corp.enron.com/gas/framework/default.asp',
  <ProbDist with 2 samples>,
  'neg',
  {}))

where `tuple == (candidate_task, prediction_dist, label, features)`

In [19]:
[(p[0], p[1].max(), p[2], p[3]) for p in preds][1]

('http://gasfundy.corp.enron.com/gas/framework/default.asp', 'pos', 'neg', {})

In [20]:
type(preds[1][1]), preds[1][1]

(nltk.probability.DictionaryProbDist, <ProbDist with 2 samples>)

In [21]:
import nltk
??nltk.probability.DictionaryProbDist

In [22]:
preds[1][1].samples()

dict_keys(['neg', 'pos'])

In [23]:
preds[1][1].max()

'pos'

In [24]:
preds[1][1].prob('pos'), preds[1][1].prob('neg')

(0.52202910464797891, 0.47797089535202103)

### Form of `preds` that works with `rand_by_mask(mask, n)`?

For `np.random.choice`, first arg must be 1-dimensional

In [25]:
import numpy as np
np_preds = np.array(preds)
np_preds.shape

(1526, 4)

`preds` as it stands will not work. Instead, we can feed in the length of `preds` and let `np.random.choice` select random indices of `preds`.

In [26]:
n = 5 # number of results
mask = len(np_preds)

In [27]:
np_random = np_preds[np.random.choice(mask, n)]

In [28]:
np_random

array([['http://www.science-finance.fr/publications.html',
        <ProbDist with 2 samples>, 'neg', {}],
       ['They pass the way gulls pass inland in their white caps,',
        <ProbDist with 2 samples>, 'neg',
        {'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING_POSTAG_DT': 1, 'PARENT_DEP_ROOT': 1, 'PARENT_POS_VERB': 1, 'PARENT_POSTAG_VBP': 1, 'CHILD_DEP_NSUBJ': 2, 'CHILD_POS_PRON': 1, 'CHILD_POSTAG_PRP': 1, 'CHILD_DEP_DOBJ': 1, 'CHILD_POS_NOUN': 2, 'CHILD_POSTAG_NN': 1, 'CHILD_DEP_PUNCT': 1, 'CHILD_POS_PUNCT': 1, 'CHILD_POSTAG_,': 1, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING_POSTAG_RB': 1, 'PARENT_DEP_DOBJ': 1, 'PARENT_POS_NOUN': 1, 'PARENT_POSTAG_NN': 1, 'CHILD_POSTAG_NNS': 1, 'CHILD_DEP_ADVMOD': 1, 'CHILD_POS_ADV': 1, 'CHILD_POSTAG_RB': 1, 'CHILD_DEP_PREP': 1, 'CHILD_POS_ADP': 1, 'CHILD_POSTAG_IN': 1}],
       ['Houston Chronicle Online', <ProbDist with 2 samples>, 'neg', {}],
       ['ID:         mcconnellmarkr', <ProbDist with 2 samples>, 'neg', {}],
       ['https://www.admissio

In [29]:
len(np_random)

5

### Form of `preds` that works with `rand_by_correct(preds, is_correct)`?

`preds` must be broken down into predictions and labels. So first, let's get the predictions.

Remember that each `tuple` in `preds` == `(candidate_task, prediction_dist, label, features)`

In [30]:
[p[1].max() for p in preds][:n]

['pos', 'pos', 'pos', 'pos', 'neg']

Labels can be retrieved similarly

In [31]:
[p[2] for p in preds][:n]

['neg', 'neg', 'neg', 'neg', 'neg']

Noticing that `preds` might be more readable as a pandas dataframe. So let's convert it.

In [32]:
import pandas as pd

preds_df = pd.DataFrame.from_records(preds, columns=['candidate_task', 'prob_dist', 'label', 'features'])
preds_df['pred'] = [p.max() for p in preds_df['prob_dist']]

In [33]:
preds_df[:n]

Unnamed: 0,candidate_task,prob_dist,label,features,pred
0,https://www4.rsweb.com/61045/,<ProbDist with 2 samples>,neg,{},pos
1,http://gasfundy.corp.enron.com/gas/framework/d...,<ProbDist with 2 samples>,neg,{},pos
2,212 836 5030,<ProbDist with 2 samples>,neg,{},pos
3,http://fundamentals.corp.enron.com/main.asp,<ProbDist with 2 samples>,neg,{},pos
4,adams 30 631,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING...",neg


In [34]:
[p.max() for p in preds_df['prob_dist']][:n]

['pos', 'pos', 'pos', 'pos', 'neg']

In [35]:
preds_df['label'][:n].values

array(['neg', 'neg', 'neg', 'neg', 'neg'], dtype=object)

Put this all together to write our own 

#### `rand_by_correct(preds, n=5, is_correct=True)`

In [36]:
def rand_by_mask(mask, n):
    return np.random.choice(np.where(mask)[0], n, replace=False)

def rand_by_correct(preds_df, n = 5, is_correct = True):
    return preds_df.iloc[rand_by_mask(([p.max() for p in preds_df['prob_dist']] == preds_df['label'].values)==is_correct, n)]

In [37]:
# correct
rand_by_correct(preds_df)

Unnamed: 0,candidate_task,prob_dist,label,features,pred
15,4) What happens to the Restricted Stock grant ...,<ProbDist with 2 samples>,neg,"{'VERB': 9, 'FOLLOWING_POS_ADP': 2, 'FOLLOWING...",neg
787,and locations in the Houston area. Choose the...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",neg
1064,who had dealings with the company.,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg
1152,"our website,I would like you to chair one of t...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PRON': 1, 'FOLLOWIN...",neg
966,"themselves, but, eventually, the contracts wil...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg


In [38]:
# incorrect
rand_by_correct(preds_df, is_correct=False)

Unnamed: 0,candidate_task,prob_dist,label,features,pred
1111,"payments as necessary ""in order to protect and...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_CCONJ': 1, 'FOLLOWI...",pos
470,PaineWebber (713) 654 0371 Rocky Emery ...,<ProbDist with 2 samples>,neg,{},pos
606,http://www.evite.com/reminders?li=reg,<ProbDist with 2 samples>,neg,{},pos
798,"Greg,",<ProbDist with 2 samples>,neg,{},pos
191,Start project with Elene for Mark,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg


Cool! Adding more goals... this time around getting the 'most correct/incorrect'

`def most_by_mask(mask, mult):
    idxs = np.where(mask)[0]
    return idxs[np.argsort(mult * probs[idxs])[:4]]`
    
    numpy.argsort(a, axis=-1, kind='quicksort', order=None)

`def most_by_correct(y, is_correct): 
    mult = -1 if (y==1)==is_correct else 1
    return most_by_mask(((preds == data.val_y)==is_correct) & (data.val_y == y), mult)`
    
    mult: makes it so that if 'correct' == 'lowest probabilities', the argsort sees lower == more correct. else, higher == more correct

Goals:
1. Get `preds` into a form so that `most_by_mask` works on it
2. Get `preds` into a form so that we can find what is _most_ correct vs _most_ incorrect

Unlike the `Cats vs Dogs` predictions, NLTK provides separate probability distributions for each possible label.

Sorting is easy if we break up the values within the `prob_dist` column

In [39]:
[p.prob('pos') for p in preds_df['prob_dist']][:n]

[0.52202910464797891,
 0.52202910464797891,
 0.52202910464797891,
 0.52202910464797891,
 0.33223657166405279]

In [40]:
[p.prob('neg') for p in preds_df['prob_dist']][:n]

[0.47797089535202103,
 0.47797089535202103,
 0.47797089535202103,
 0.47797089535202103,
 0.66776342833594737]

In [41]:
preds_df['pos_dist'] = [p.prob('pos') for p in preds_df['prob_dist']]
preds_df['neg_dist'] = [p.prob('neg') for p in preds_df['prob_dist']]

In [42]:
preds_df[:n]

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
0,https://www4.rsweb.com/61045/,<ProbDist with 2 samples>,neg,{},pos,0.522029,0.477971
1,http://gasfundy.corp.enron.com/gas/framework/d...,<ProbDist with 2 samples>,neg,{},pos,0.522029,0.477971
2,212 836 5030,<ProbDist with 2 samples>,neg,{},pos,0.522029,0.477971
3,http://fundamentals.corp.enron.com/main.asp,<ProbDist with 2 samples>,neg,{},pos,0.522029,0.477971
4,adams 30 631,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING...",neg,0.332237,0.667763


In [43]:
preds_df.sort_values('pos_dist', ascending=False)[:5]

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
1215,Call Grant about Tony,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.996769,0.003231
762,Bring AER,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.995642,0.004358
1255,Call Ambrosia ? Monday morning,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.991685,0.008315
33,visit http://www.adamsmark.com/resv/rescheck.h...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.991685,0.008315
163,Check for ML,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.973474,0.026526


In [44]:
preds_df.sort_values('neg_dist', ascending=False)[:5]

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
1045,Another aspect of this problem is prioritizati...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg,0.013304,0.986696
121,"35. ""Stewardesses"" is the longest word that is...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.01595,0.98405
83,5. A shark is the only fish that can blink wit...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.01672,0.98328
88,building is an American flag.,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.019714,0.980286
990,Common sense is the collection of prejudices a...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.022955,0.977045


In [45]:
preds[1][1].samples()

dict_keys(['neg', 'pos'])

Drumroll...

#### most_by_correct(preds_df, label, n=5, is_correct=True)

In [46]:
def most_by_correct(preds_df, label, n=5, is_correct=True):
    label_options = list(preds_df['prob_dist'][0].samples())
    # note the 2-class assumption
    other_label = label_options[abs(label_options.index(label)-1)]
    # if is_correct, label dist is what matters; if incorrect, look at other_label dist
    dist_column = label + '_dist' if is_correct == True else other_label + '_dist'
    
    correct_mask = ([p.max() for p in preds_df['prob_dist']] == preds_df['label'].values)==is_correct
    target_label_mask = preds_df['label'].values == label
    
    return preds_df.iloc[np.where(correct_mask & target_label_mask)].sort_values(dist_column, ascending=False)[:n]

In [47]:
most_by_correct(preds_df, 'pos')

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
1215,Call Grant about Tony,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.996769,0.003231
762,Bring AER,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.995642,0.004358
33,visit http://www.adamsmark.com/resv/rescheck.h...,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.991685,0.008315
1255,Call Ambrosia ? Monday morning,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI...",pos,0.991685,0.008315
163,Check for ML,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.973474,0.026526


In [48]:
most_by_correct(preds_df, 'neg')

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
1045,Another aspect of this problem is prioritizati...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg,0.013304,0.986696
121,"35. ""Stewardesses"" is the longest word that is...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.01595,0.98405
83,5. A shark is the only fish that can blink wit...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.01672,0.98328
88,building is an American flag.,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.019714,0.980286
990,Common sense is the collection of prejudices a...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",neg,0.022955,0.977045


In [49]:
most_by_correct(preds_df, 'pos', is_correct=False)

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
380,maine impossible to get to...next option?,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",neg,0.330498,0.669502
191,Start project with Elene for Mark,<ProbDist with 2 samples>,pos,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",neg,0.480333,0.519667


In [50]:
most_by_correct(preds_df, 'neg', is_correct=False)

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
1282,use the,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos,0.963347,0.036653
529,Visit,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'PARENT_DEP_ROOT': 1, 'PARENT_POS_...",pos,0.960683,0.039317
1091,"end, after court hearings on the issue, the Dr...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.955679,0.044321
396,ENA Origination Support (Getting info from Tyc...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN...",pos,0.949537,0.050463
1121,late Tuesday accused 29 Enron officers and dir...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING...",pos,0.942597,0.057403


Final `get_sample_predictions` goal: get most uncertain

`most_uncertain = np.argsort(np.abs(probs -0.5))[:4]`

Basically, we're going to sort by probabilities closest to 0.5 (on either side)

#### most_uncertain(preds_df, label, n=5)

In [51]:
def most_uncertain(preds_df, label, n=5):
    dist_column = label + '_dist'
    return preds_df.iloc[np.argsort(np.abs(preds_df[dist_column]-0.5))][:n]

In [52]:
most_uncertain(preds_df, 'pos', 10)

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
965,"Currently, most of the trades are between secu...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.500492,0.499508
573,Davis has taken heat for locking up too much e...,<ProbDist with 2 samples>,neg,"{'VERB': 6, 'FOLLOWING_POS_VERB': 3, 'FOLLOWIN...",neg,0.498548,0.501452
614,"I am learning peacefulness, lying by myself qu...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",pos,0.501745,0.498255
721,about Australian markets and was greatly impre...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",neg,0.496814,0.503186
1411,Incentive fee (in main contract) defined to on...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.504848,0.495152
1317,trader I see an opportunity to sell soybeans a...,<ProbDist with 2 samples>,neg,"{'VERB': 5, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos,0.505635,0.494365
901,"The oldest of the insurance-related, exchange-...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",neg,0.49251,0.50749
116,31. The microwave was invented after a researc...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.491751,0.508249
904,"Initially, there was substantial interest in t...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",neg,0.491579,0.508421
909,"so far this year, he said. Last year, there wa...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.512387,0.487613


In [53]:
most_uncertain(preds_df, 'neg', 10)

Unnamed: 0,candidate_task,prob_dist,label,features,pred,pos_dist,neg_dist
965,"Currently, most of the trades are between secu...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.500492,0.499508
573,Davis has taken heat for locking up too much e...,<ProbDist with 2 samples>,neg,"{'VERB': 6, 'FOLLOWING_POS_VERB': 3, 'FOLLOWIN...",neg,0.498548,0.501452
614,"I am learning peacefulness, lying by myself qu...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",pos,0.501745,0.498255
721,about Australian markets and was greatly impre...,<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING...",neg,0.496814,0.503186
1411,Incentive fee (in main contract) defined to on...,<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_ADP': 1, 'FOLLOWING...",pos,0.504848,0.495152
1317,trader I see an opportunity to sell soybeans a...,<ProbDist with 2 samples>,neg,"{'VERB': 5, 'FOLLOWING_POS_DET': 1, 'FOLLOWING...",pos,0.505635,0.494365
901,"The oldest of the insurance-related, exchange-...",<ProbDist with 2 samples>,neg,"{'VERB': 2, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",neg,0.49251,0.50749
116,31. The microwave was invented after a researc...,<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_VERB': 1, 'FOLLOWIN...",neg,0.491751,0.508249
904,"Initially, there was substantial interest in t...",<ProbDist with 2 samples>,neg,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING...",neg,0.491579,0.508421
909,"so far this year, he said. Last year, there wa...",<ProbDist with 2 samples>,neg,"{'VERB': 3, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI...",pos,0.512387,0.487613


Turns out, both `label`s return the same list

## Writing: aggregate_features

how to correlate features to (in)correctness

Example set of feature dictionaries...

In [54]:
most_incorrect[['features']]

Unnamed: 0,features
1282,"{'VERB': 1, 'FOLLOWING_POS_DET': 1, 'FOLLOWING..."
529,"{'VERB': 1, 'PARENT_DEP_ROOT': 1, 'PARENT_POS_..."
1091,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI..."
396,"{'VERB': 1, 'FOLLOWING_POS_NOUN': 1, 'FOLLOWIN..."
1121,"{'VERB': 1, 'FOLLOWING_POS_NUM': 1, 'FOLLOWING..."
1031,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI..."
926,"{'VERB': 3, 'FOLLOWING_POS_ADV': 1, 'FOLLOWING..."
167,"{'VERB': 1, 'FOLLOWING_POS_PROPN': 1, 'FOLLOWI..."
752,"{'VERB': 1, 'FOLLOWING_POS_ADJ': 1, 'FOLLOWING..."
196,"{'VERB': 1, 'FOLLOWING_POS_PUNCT': 1, 'FOLLOWI..."


Convert one feature dictionary to dataframe...

In [55]:
most_incorrect[['features']].iloc[1].values[0]

{'PARENT_DEP_ROOT': 1, 'PARENT_POSTAG_VB': 1, 'PARENT_POS_VERB': 1, 'VERB': 1}

In [56]:
f_dict = pd.DataFrame.from_dict(most_incorrect[['features']].iloc[1].values[0], orient='index').reset_index()
f_dict

Unnamed: 0,index,0
0,VERB,1
1,PARENT_DEP_ROOT,1
2,PARENT_POS_VERB,1
3,PARENT_POSTAG_VB,1


In [57]:
f_dict.columns = ['feature', 'count']
f_dict

Unnamed: 0,feature,count
0,VERB,1
1,PARENT_DEP_ROOT,1
2,PARENT_POS_VERB,1
3,PARENT_POSTAG_VB,1


Aggregating two feature dataframes into one by feature occurrence count...

In [58]:
pd.concat([f_dict, f_dict]).groupby('feature').sum().reset_index()

Unnamed: 0,feature,count
0,PARENT_DEP_ROOT,2
1,PARENT_POSTAG_VB,2
2,PARENT_POS_VERB,2
3,VERB,2


Putting it all together...

In [59]:
f_dicts = []
for i in range(len(most_incorrect)):
    f_dict = pd.DataFrame.from_dict(most_incorrect[['features']].iloc[i].values[0], orient='index').reset_index()
    f_dict.columns = ['feature', 'count']
    f_dicts.append(f_dict)
f_dicts

[               feature  count
 0                 VERB      1
 1    FOLLOWING_POS_DET      1
 2  FOLLOWING_POSTAG_DT      1
 3      PARENT_DEP_ROOT      1
 4      PARENT_POS_VERB      1
 5     PARENT_POSTAG_VB      1
 6       CHILD_DEP_DOBJ      1
 7        CHILD_POS_DET      1
 8      CHILD_POSTAG_DT      1,             feature  count
 0              VERB      1
 1   PARENT_DEP_ROOT      1
 2   PARENT_POS_VERB      1
 3  PARENT_POSTAG_VB      1,                 feature  count
 0                  VERB      1
 1   FOLLOWING_POS_PUNCT      1
 2    FOLLOWING_POSTAG_,      1
 3       PARENT_DEP_ROOT      1
 4       PARENT_POS_VERB      1
 5      PARENT_POSTAG_VB      1
 6       CHILD_DEP_PUNCT      1
 7       CHILD_POS_PUNCT      1
 8        CHILD_POSTAG_,      1
 9        CHILD_DEP_PREP      1
 10        CHILD_POS_ADP      1
 11      CHILD_POSTAG_IN      1
 12       CHILD_DEP_CONJ      1
 13       CHILD_POS_NOUN      1
 14      CHILD_POSTAG_NN      1,                feature  count
 0     

In [60]:
pd.concat(f_dicts).groupby('feature').sum().sort_values('count', ascending=False).reset_index()

Unnamed: 0,feature,count
0,VERB,31
1,PARENT_POS_VERB,28
2,PARENT_DEP_ROOT,27
3,CHILD_POS_NOUN,19
4,PARENT_POSTAG_VB,18
5,CHILD_DEP_DOBJ,14
6,CHILD_POS_PUNCT,13
7,CHILD_DEP_PUNCT,13
8,CHILD_POS_VERB,12
9,CHILD_POS_ADP,12


#### aggregate_features(preds_df, n=0)

In [61]:
def aggregate_features(preds_df, n=0):
    f_dicts = []
    for i in range(len(most_incorrect)):
        f_dict = pd.DataFrame.from_dict(preds_df[['features']].iloc[i].values[0], orient='index').reset_index()
        f_dict.columns = ['feature', 'count']
        f_dicts.append(f_dict)
    
    agg = pd.concat(f_dicts).groupby('feature').sum().sort_values('count', ascending=False).reset_index()
    
    if n <= 0:
        return agg
    else:
        return agg[:n]

In [62]:
aggregate_features(most_incorrect)

Unnamed: 0,feature,count
0,VERB,31
1,PARENT_POS_VERB,28
2,PARENT_DEP_ROOT,27
3,CHILD_POS_NOUN,19
4,PARENT_POSTAG_VB,18
5,CHILD_DEP_DOBJ,14
6,CHILD_POS_PUNCT,13
7,CHILD_DEP_PUNCT,13
8,CHILD_POS_VERB,12
9,CHILD_POS_ADP,12


In [63]:
aggregate_features(most_incorrect, 10)

Unnamed: 0,feature,count
0,VERB,31
1,PARENT_POS_VERB,28
2,PARENT_DEP_ROOT,27
3,CHILD_POS_NOUN,19
4,PARENT_POSTAG_VB,18
5,CHILD_DEP_DOBJ,14
6,CHILD_POS_PUNCT,13
7,CHILD_DEP_PUNCT,13
8,CHILD_POS_VERB,12
9,CHILD_POS_ADP,12


In [64]:
aggregate_features(most_correct, 10)

Unnamed: 0,feature,count
0,VERB,35
1,PARENT_POS_VERB,27
2,CHILD_POS_NOUN,26
3,PARENT_DEP_ROOT,25
4,CHILD_POSTAG_NN,23
5,CHILD_DEP_PUNCT,18
6,CHILD_POS_PUNCT,18
7,CHILD_DEP_NSUBJ,15
8,CHILD_DEP_DOBJ,14
9,CHILD_POSTAG_.,13
