# First End to End Pipeline for Homework 1

In this notebook, we show our first end to end pipeline for homework 1. Here we take the 200 input articles from the training set and create a pipeline to use machine learning to extract person names from the articles. We then use our results to debug and imporve the pipeline for the next run.

In [1]:
import sys
sys.path.append('/u/p/m/pmartinkus/Documents/CS_838')

import HW1 as hw
from sklearn import tree, svm
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings('ignore')

## Generate Examples
First we need to generate all the possible examples. We use the function gen_examples() found in the ExampleGen.py file. Each example consists of a candidate name, its posistion within the article, the article number, the previous word in the document, the following word, whether or not the last word in the candidate is possesive (defined by ending with an 'apostrophe s'), and the label (whether or not the candidate is a true name). We have limited the maximum length of a candidate at four words. The examples dataframe contains all the generated examples for the training set (200 pseudo-randomly chosen articles) and the test dataframe contains all the generated examples for the test set to be used later.

In [2]:
examples, test = hw.gen_examples(4)
examples.head()

Unnamed: 0,String,Position,Article,Previous,Following,Possessive,Label
0,There,0,0,,are,False,0
1,There are,0,0,,perhaps,False,0
2,There are perhaps,0,0,,few,False,0
3,There are perhaps few,0,0,,women,False,0
4,are,1,0,There,perhaps,False,0


## Pruning Step
Now that we have generated all possible candidat names (with a maximum of four words), we need to remove candidates that are obviously not actual names.

We can see below that we have generated 180,589 examples, but there are only 1,216 actual names in the documents. With only about 0.6% of the candidates representing positive examples, the learning models will have a difficult time learning anything useful.

In [3]:
print('Number of candidates generated: ' + str(len(examples)))
print('Number of actual names: ' + str(len(examples[examples['Label'] == 1])))

Number of candidates generated: 180761
Number of actual names: 1216


To solve this problem we will use the function block_tables() found in the Blocking.py file. This function removes candidates that we have deemed to be obviously false.

The current blocking scheme removes any tuple where any word the candidate name is not capitalzed, and any tuple where there is punctuation within the candidate name (Note: punctuation before or after the name does not count, we only look for punctuation within the name).

After blocking, we have reduced the number of examples to only 11,659 and roughly 10% of the candidates are actual names.

In [4]:
blocker = hw.Blocking()
C = blocker.block_table(examples)
C.head()

Unnamed: 0,String,Position,Article,Previous,Following,Possessive,Label
0,There,0,0,,are,False,0
1,Monica,10,0,than,Lewinsky,False,0
2,Monica Lewinsky,10,0,than,and,False,1
3,Lewinsky,11,0,Monica,and,False,0
4,Mona,13,0,and,Charen.,False,0


In [5]:
print('Number of candidates after pruning: ' + str(len(C)))

Number of candidates after pruning: 11659


Next we want to debug our pruning scheme to make sure that we have not removed any actual names. Below, we can see that we have removed 8 out of the 1216 names with our blocking scheme. We believe that it is ok for the blocker to remove only 0.66% of the names.

In [6]:
blocker.debug(examples)

Unnamed: 0,String,Position,Article,Previous,Following,Possessive,Label
0,Guillermo del Toro,481,38,business.,for,False,1
1,Molnar-Banffy,117,44,chairman,Kata,False,1
2,Moon Jae-in,157,62,President,attempts,True,1
3,Moon Jae-in,423,62,President,said,False,1
4,"Martin Luther King, Jr.",200,70,cited,teachings,True,1
5,Bashar al Assad,156,115,President,regime,True,1
6,Mansour Al-Otaibi,55,261,President,said,False,1
7,Alex van der Zwaan,73,274,report,pleaded,False,1


## Generate the Feature Vectors

Next, we can create the feature vectors. We have created several features to try and identity real names including whether the name ends in a possessive 'apostrphe s,' wether the previous word is capitalized, whether the following word is capitalized, the length of the word, etc.

We also have created four lists of words: 
1. Prev_Whitelist: A list of words that commonly appear before actual names
2. Prev_Blacklist: A list of words that rarly appear before actual names
3. Follow_Whitelist: A list of words that commonly appear after actual names
4. Follow_Whitelist: A list of words that rarly appear after actual names
    
We then created a feature for each list representing whether or not the previous/following words are in the lists. To genrate those lists, we used the gen_rule_data() function. This creates list of the frequency that each word that appears previous to and after a actual candidate and an incorrect candidate name.

In [7]:
prev, following = hw.gen_rule_data(C, 4)

In [8]:
prev.sort_values('Correct_Count', ascending=False).head()

Unnamed: 0,Word,Count,Correct_Count,False_Count
39,president,140,70,73
155,said,107,34,76
126,that,82,33,52
105,sen.,70,32,38
19,the,1196,31,1168


In [9]:
features = hw.FeatureGen()
feature_vecs = features.gen_features(C)
feature_vecs.head()

Unnamed: 0,String,Position,Article,Previous,Following,Possessive,Prev_Caps,Prev_Whitelist,Prev_Blacklist,Follow_Whitelist,Follow_Blacklist,Length,Avg_Length,Prev_Length,Follow_Length,Num_Words,Label
0,There,0,0,,are,False,False,False,False,False,False,5,5.0,0,3,1,0
1,Monica,10,0,than,Lewinsky,False,False,False,False,False,False,6,6.0,4,8,1,0
2,Monica Lewinsky,10,0,than,and,False,False,False,False,False,False,15,7.0,4,3,2,1
3,Lewinsky,11,0,Monica,and,False,True,False,False,False,False,8,8.0,6,3,1,0
4,Mona,13,0,and,Charen.,False,False,False,False,False,False,4,4.0,3,7,1,0


In [10]:
len(feature_vecs)

11659

## Cross Validation of ML Models

Once we created the features, we were able to use our five ML models. The models include Decision Tree, Random Forest, SVM, Logistic Regression, and Linear Regression. The first four models are classifiers from scikit learn. Since the linear regression model from scikit learn is not a classifier, we created a wrapper that sets a threshold and classifies tuples based on whether the predicted number is greater than or less than or equal to the threshold.

In [11]:
hw.cross_val(feature_vecs, 'precision')

Unnamed: 0,Classifier,Run 1,Run 2,Run 3,Run 4,Run 5,Run 6,Run 7,Run 8,Run 9,Run 10,Average
0,Decision Tree,0.6,0.547945,0.483146,0.510417,0.538462,0.534091,0.566667,0.583333,0.474227,0.433962,0.527225
1,Random Forest,0.689189,0.588235,0.494118,0.62766,0.64557,0.590909,0.609195,0.609756,0.602564,0.564706,0.60219
2,SVM,0.897436,0.818182,0.862069,0.894737,0.815789,0.923077,0.857143,0.861111,0.8125,0.763158,0.85052
3,Logistic Regression,0.807692,0.642857,0.714286,0.609756,0.566667,0.678571,0.657143,0.636364,0.538462,0.666667,0.651846
4,Linear Regression,0.754717,0.641509,0.630435,0.608696,0.594203,0.689655,0.57377,0.64,0.571429,0.647059,0.635147


In [12]:
hw.cross_val(feature_vecs, 'recall')

Unnamed: 0,Classifier,Run 1,Run 2,Run 3,Run 4,Run 5,Run 6,Run 7,Run 8,Run 9,Run 10,Average
0,Decision Tree,0.348837,0.375,0.344262,0.348148,0.398305,0.429907,0.490566,0.41129,0.348148,0.383333,0.38778
1,Random Forest,0.387597,0.428571,0.344262,0.422222,0.440678,0.457944,0.471698,0.354839,0.296296,0.4,0.400411
2,SVM,0.271318,0.241071,0.204918,0.251852,0.262712,0.336449,0.283019,0.25,0.192593,0.241667,0.25356
3,Logistic Regression,0.162791,0.160714,0.122951,0.185185,0.144068,0.17757,0.216981,0.112903,0.103704,0.15,0.153687
4,Linear Regression,0.310078,0.303571,0.237705,0.311111,0.347458,0.373832,0.330189,0.258065,0.237037,0.275,0.298404


We can see that the SVM model has the highest precision, but it suffers from extremely low recall. Since the Random Forest model has the somewhat high precision and a reasonable recall, we will choose this model for debugging purposes.

## Debugging the Random Forest

We will now split or training data into another training set and a test set. Using this data, we will train on the new training set and test the model on the new test set. With the output, we can analyze the false positives and negatives to better understand the limitation of this model

In [13]:
train_data, test_data, train_labels, test_labels = hw.split_train_test(feature_vecs, 0.5)

features = ['Possessive', 'Prev_Caps', 'Prev_Whitelist', 'Prev_Blacklist', 'Follow_Whitelist',
                'Follow_Blacklist', 'Length', 'Avg_Length', 'Prev_Length', 'Follow_Length',
                'Num_Words']

print('Length of training set: ' + str(len(train_data)))
print('Length of test set: ' + str(len(test_data)))

Length of training set: 5829
Length of test set: 5830


In [14]:
rf = RandomForestClassifier()
rf.fit(train_data[features], train_labels)
predictions = rf.predict(test_data[features])
results = test_data.copy()
results['Label'] = test_labels
results['Prediction'] = predictions
precision, recall, f1 = hw.evaluate(results)
print('Precision: ' + str(precision))
print('Recall: ' + str(recall))
print('F1: ' + str(f1))

Precision: 0.5891089108910891
Recall: 0.38636363636363635
F1: 0.4666666666666667


In [15]:
# False Positives
fp = results[results['Label'] == 0]
fp[fp['Prediction'] == 1].iloc[0:5]

Unnamed: 0,String,Position,Article,Previous,Following,Possessive,Prev_Caps,Prev_Whitelist,Prev_Blacklist,Follow_Whitelist,Follow_Blacklist,Length,Avg_Length,Prev_Length,Follow_Length,Num_Words,Label,Prediction
4874,Kushner,170,95,for,attorney,True,False,False,False,False,False,7,7.0,3,8,1,0,1
1861,President Bush,339,38,denounced,during,False,False,False,False,False,False,14,6.5,9,6,2,0,1
3761,Trump,177,71,with,at,False,False,False,False,False,False,5,5.0,4,2,1,0,1
2214,Dale,153,48,said,"Lee,",False,False,False,False,False,False,4,4.0,4,4,1,0,1
3076,Putin,27,60,attacks.,ordered,False,False,False,False,False,False,5,5.0,8,7,1,0,1


In [16]:
# False Negatives
fn = results[results['Label'] == 1]
fn[fn['Prediction'] == 0].iloc[0:5]

Unnamed: 0,String,Position,Article,Previous,Following,Possessive,Prev_Caps,Prev_Whitelist,Prev_Blacklist,Follow_Whitelist,Follow_Blacklist,Length,Avg_Length,Prev_Length,Follow_Length,Num_Words,Label,Prediction
640,Kelly,99,9,"month,",directed,False,False,False,False,False,False,5,5.0,6,8,1,1,0
3964,Rubio,253,73,on,face,True,False,False,True,False,False,5,5.0,2,4,1,1,0
2707,Mitch McConnell,74,55,Leader,are,False,True,True,False,False,False,15,7.0,6,3,2,1,0
9997,Ivanka Trump,109,233,"administration,",visit,True,False,False,False,False,False,12,5.5,15,5,2,1,0
7583,Fintiklis,3,168,between,and,False,False,False,False,False,False,9,9.0,7,3,1,1,0


Next, we will try the reverse of above. Instead we will train on the test set and then test on the training set. Here we will get a new set of false positives and negatives to examine.

In [17]:
rf = RandomForestClassifier()
rf.fit(test_data[features], test_labels)
predictions = rf.predict(train_data[features])
results = train_data.copy()
results['Label'] = train_labels
results['Prediction'] = predictions
precision, recall, f1 = hw.evaluate(results)
print('Precision: ' + str(precision))
print('Recall: ' + str(recall))
print('F1: ' + str(f1))

Precision: 0.566350710900474
Recall: 0.40371621621621623
F1: 0.4714003944773176


In [18]:
# False Positives
fp = results[results['Label'] == 0]
fp[fp['Prediction'] == 1].iloc[0:5]

Unnamed: 0,String,Position,Article,Previous,Following,Possessive,Prev_Caps,Prev_Whitelist,Prev_Blacklist,Follow_Whitelist,Follow_Blacklist,Length,Avg_Length,Prev_Length,Follow_Length,Num_Words,Label,Prediction
10816,School District,221,254,Tri-Valley,--,False,True,False,False,False,False,15,7.0,10,2,2,0,1
4201,People,82,77,people.,have,False,False,False,False,False,False,6,6.0,7,4,1,0,1
10682,Kathy,63,248,spokewoman,Harben,False,False,False,False,False,False,5,5.0,10,6,1,0,1
9395,Party,138,213,Communist,"Politburo,",True,True,False,False,False,False,5,5.0,9,10,1,0,1
4198,These,51,77,"advertisers.""",people,False,False,False,False,False,False,5,5.0,13,6,1,0,1


In [19]:
# False Negatives
fn = results[results['Label'] == 1]
fn[fn['Prediction'] == 0].iloc[0:5]

Unnamed: 0,String,Position,Article,Previous,Following,Possessive,Prev_Caps,Prev_Whitelist,Prev_Blacklist,Follow_Whitelist,Follow_Blacklist,Length,Avg_Length,Prev_Length,Follow_Length,Num_Words,Label,Prediction
9954,Guarino,57,232,"street,",saw,False,False,False,False,False,False,7,7.0,7,3,1,1,0
1338,Peterson,66,26,Mr.,wishes,False,True,True,False,False,False,8,8.0,3,6,1,1,0
9437,Song,117,214,"flexible,""",said.,False,False,False,False,True,False,4,4.0,10,5,1,1,0
7919,Jeff Sessions,28,175,A.G.,asking,False,True,False,False,False,False,13,6.0,4,6,2,1,0
6896,Rubio,177,147,way,is,False,False,False,False,False,False,5,5.0,3,2,1,1,0


We noticed that many of the incorrect tuples seems to make little sense. Many have true listed for the some of the right features, but then the lebel is incorrect anyway. We believe this had to do with the features relating to the length of words. We tested the model without the following features: 'Avg_Length', 'Prev_Length', and 'Follow_Length'. We found that without these features we were able to increase precision by about 0.15, but it lowered recall by about 0.5.

In [20]:
rf_new = RandomForestClassifier()

new_features = ['Possessive', 'Prev_Caps', 'Prev_Whitelist', 'Prev_Blacklist', 'Follow_Whitelist',
                 'Follow_Blacklist', 'Length', 'Num_Words']
rf_new.fit(train_data[new_features], train_labels)
predictions = rf_new.predict(test_data[new_features])

results_new = test_data.copy()
results_new['Label'] = test_labels
results_new['Prediction'] = predictions
precision, recall, f1 = hw.evaluate(results_new)
print('Precision: ' + str(precision))
print('Recall: ' + str(recall))
print('F1: ' + str(f1))

Precision: 0.75
Recall: 0.33116883116883117
F1: 0.45945945945945943


### What We Learned from Debugging:

We found some common themes among incorrect predictions:
1. We included a feature that shows whether the previous word is capitalized, but we forgot to add a similar rule, follow_caps, to show whether the following word is capitalized.
2. Many words were marked as names because they were the first word in a sentence. We need to add a feature that recognizes whether a candidate is the first word in a sentence (or at least followed by a period). We also might need a feature to recognize whether the previous word is the first word in a sentence as well. We could also add a similar feature for following words.
3. While some names do include a period (example: George W. Bush), the model often incorrectly label two capitalized words surrounding a perod as a name (Example: "... interview with Cohen. Burr..." just so happens one name ended the sentence and a different name began the next sentence). We might want a feature that shows if the name contains a period.
4. We found some mistakes in labeling that we will bo back into the raw documents and fix.
5. Including the length of the name and the words before and after it greatly affects the model. When we remove these features, we fould that we were able to increase precision, but reduce recall. This could be useful later if we need to increase precision and the recall is high enough that we can afford to reduce it.

We also found some possible post-processing rules:
1. Many of the false positives include words like 'President' or other commonly capitalized words such as days of the week. We could include a post-processing rule that will switch the label of any positive predictions that include such words.