# Business Insider Named Entity Recognition

Author: Kathleen Young

Date created: Wed Mar 7 15:15:27 2018

This notebook includes the code to create a named entity recognition (NER) system for a corpus of scraped Business Insider articles. It trains and applies a CRF model to identify CEO and company names. It also uses regular expressions to identify percents.

For more details, please see my report Business_Insider_NER.pdf.

### Import necessary packages

In [1]:
import os
import re
import pandas as pd
import nltk
from nltk.chunk import tree2conlltags
import sklearn_crfsuite
from sklearn import metrics
from sklearn.metrics import classification_report
import eli5
import more_itertools as mit
from collections import Counter

### Read the Business Insider data

The data was scraped from the Business Insider website and includes 730 articles from 2013 and 2014. It is unstructured, natural language data.

In [2]:
# Create the corpus from the articles
corpus = ""
file_dir = "C:/Users/kathleenyoung/Documents/Winter 2018/IEMS 308 Data Science/HW3/articles"
files = os.listdir(file_dir)
for fname in files:
    with open(file_dir + '/' + fname, 'r', encoding='latin-1') as infile:
        corpus += infile.read()

We were provided with examples of CEO names, company names, and percentages to tag the corpus before using supervised learning/classification.

Some of the CEO names were first and last name, while some were only last names. To make tagging the corpus easier, I manually split up the first and last names and made one list of single names.

Note: There were some issues with the data. I removed these from the list: "American Apparel," "American Express," and "Federal Reserve," as they're clearly not CEO names. There very well may have been other flaws I did not catch.

In [3]:
# Get the CEO tags
ceo_tags = pd.read_csv("C:/Users/vn0a6r3/Documents/Tredence/Projects/IEMS 308 Data Science/HW3/TrainingData/ceo.csv", encoding='latin-1', header=None)
ceo_tags = ceo_tags[0].tolist()

# Note there were some issues with this data.

Similar to the CEO names, the company names from above could range from one word to six words. However, splitting up all the names and putting them into a single list didn't work as well as it did for the CEO names, because company names can include very common words that might occur on their own or in other contexts like "United" or "American."

To more accurately tag the data, I split up the company data into four sets, depending on the number of words in the company name. By being able to check that a company word is surrounded by the right number of other company words, it won't overlabel the corpus for words like "American."

In [4]:
# Get the company tags

# One-word company names
company_tags_1 = pd.read_csv("C:/Users/vn0a6r3/Documents/Tredence/Projects/IEMS 308 Data Science/HW3/TrainingData/companies_1.csv", encoding='latin-1', header=None)
company_tags_1 = company_tags_1[0].tolist()
# Two-word
company_tags_2 = pd.read_csv("C:/Users/vn0a6r3/Documents/Tredence/Projects/IEMS 308 Data Science/HW3/TrainingData/companies_2.csv", encoding='latin-1', header=None)
company_tags_2 = company_tags_2[0].tolist()
# Three-word
company_tags_3 = pd.read_csv("C:/Users/vn0a6r3/Documents/Tredence/Projects/IEMS 308 Data Science/HW3/TrainingData/companies_3.csv", encoding='latin-1', header=None)
company_tags_3 = company_tags_3[0].tolist()
# Four-word and beyond
company_tags_4plus = pd.read_csv("C:/Users/vn0a6r3/Documents/Tredence/Projects/IEMS 308 Data Science/HW3/TrainingData/companies_4plus.csv", encoding='latin-1', header=None)
company_tags_4plus = company_tags_4plus[0].tolist()

### Preprocessing

Right now, the corpus is one extremely long string of all 730 Business Insider articles. In order to prepare the corpus for analysis, I used the `nltk` natural language processing package to tokenize sentences, words, add part-of-speech tags, and create noun phrase chunks for IOB tagging. For more information on IOB tagging, see this [article](https://nlpforhackers.io/tag/iob-tagging/).

In [5]:
# Get sentences
sents = nltk.sent_tokenize(corpus)

In [6]:
# Function that word tokenizes and pos tags a sentence
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [7]:
# Word tokenize and pos tag the corpus
tok_tag = []
for sent in sents: tok_tag.append(preprocess(sent))

In [8]:
# Noun phrase chunk for one or more proper nouns
pattern = 'NP: {<NNP>{1,}}'
cp = nltk.RegexpParser(pattern)
cs = []
for sent in tok_tag: cs.append(cp.parse(sent))

In [9]:
# IOB tag
iob_tagged = []
for sent in cs: iob_tagged.append(tree2conlltags(sent))

In order to apply any machine learning, we need a tagged corpus. The following code loops through the words in the corpus and tags it as CEO or COMPANY if the word is a CEO name or a company name from the training data, respectively. If it is neither, it is marked "O."

In [10]:
# For loop that tags the corpus with the CEO and company name data
tagged_corpus = []

# Loop through the iob_tagged corpus
for sent in iob_tagged:
    iob_update = []
    
    # Loop through each sentence in the corpus
    for index, word in enumerate(sent):
        
        # Check if the IOB tag is B-NP (beginning of the noun phrase)
        if (word[2] == 'B-NP'):
            
            # If it's a CEO name, change the tag to CEO and append to iob_update
            if (word[0] in ceo_tags):
                changed_tag = (word[0], word[1], 'CEO')
                iob_update.append(changed_tag)
            
            # If statements to ensure that if a word is a company name word, it is followed by
            # the appropriate number of company name words
            
            # If so, change the tag to COMPANY and append to iob_update
            elif (word[0] in company_tags_1) and (sent[index + 1][2] != "I-NP"):
                changed_tag = (word[0], word[1], 'COMPANY')
                iob_update.append(changed_tag)
            elif (word[0] in company_tags_2) and (sent[index + 1][2] == "I-NP") and (sent[index + 1][0] in company_tags_2):
                changed_tag = (word[0], word[1], 'COMPANY')
                iob_update.append(changed_tag)
            elif (index + 2 < len(sent)) and (word[0] in company_tags_3) and (sent[index + 1][2] == "I-NP") and (sent[index + 1][0] in company_tags_3) and (sent[index + 2][2] == "I-NP") and (sent[index + 2][0] in company_tags_3):
                changed_tag = (word[0], word[1], 'COMPANY')
                iob_update.append(changed_tag)
            elif (index + 3 < len(sent)) and (word[0] in company_tags_4plus) and (sent[index + 1][2] == "I-NP") and (sent[index + 1][0] in company_tags_4plus) and (sent[index + 2][2] == "I-NP") and (sent[index + 2][0] in company_tags_4plus) and (sent[index + 3][2] == "I-NP") and (sent[index + 3][0] in company_tags_4plus):
                changed_tag = (word[0], word[1], 'COMPANY')
                iob_update.append(changed_tag)
                
            # If the B-NP word is not a company or CEO name, label it "O", indicating outside of a tag.
            else:
                iob_update.append((word[0], word[1], 'O'))
        
        # If the IOB tag is a I-NP (inside the noun phrase)
        elif (word[2] == 'I-NP'):
            
            # If it's a CEO name, change the tag to CEO and append to iob_update
            if (word[0] in ceo_tags):
                changed_tag = (word[0], word[1], 'CEO')
                iob_update.append(changed_tag)
                
            # Like above, this ensures that if a word is a company name word,
            # it is surrounded by the appropriate number of company name words.
            
            # If so, change the tag to COMPANY and append
            elif (word[0] in company_tags_2) and (sent[index - 1][2] == "B-NP") and (sent[index - 1][0] in company_tags_2):
                changed_tag = (word[0], word[1], 'COMPANY')
                iob_update.append(changed_tag)
            elif (index - 2 >= 0) and (word[0] in company_tags_3) and (sent[index - 2][2] == "B-NP") and (sent[index - 2][0] in company_tags_3) and (sent[index - 1][2] == "I-NP") and (sent[index - 1][0] in company_tags_3):
                changed_tag = (word[0], word[1], 'COMPANY')
                iob_update.append(changed_tag)
            elif (index - 3 >= 0) and (word[0] in company_tags_4plus) and (sent[index - 3][2] == "B-NP") and (sent[index - 3][0] in company_tags_4plus) and (sent[index - 2][2] == "B-NP") and (sent[index - 2][0] in company_tags_4plus) and (sent[index - 1][2] == "I-NP") and (sent[index - 1][0] in company_tags_4plus):
                changed_tag = (word[0], word[1], 'COMPANY')
                iob_update.append(changed_tag)
            
            # If the I-NP word is not a company or CEO name, label it "O"
            else:
                iob_update.append((word[0], word[1], 'O'))
        
        # Else append the word as it is
        else:
            iob_update.append(word)
            
    # Append the sentence to the tagged corpus
    tagged_corpus.append(iob_update)

### Get features

The following function gets the features and labels from the tagged corpus that will be used to train the model.

In [11]:
# Function that get features from the tagged corpus
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        # Last three letters
        'word[-3:]': word[-3:],
        # Binary all uppercase
        'word.isupper()': word.isupper(),
        # Binary starts with a capitalized letter
        'word.istitle()': word.istitle(),
        # Binary is digit
        'word.isdigit()': word.isdigit(),
        # POS tag
        'postag': postag,
        # First two letters of POS tag
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            # Similar to the features above, but for the previous word
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        # Binary beginning of sentence
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            # Similar to the features above, but for the next word
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        # Binary end of sentence
        features['EOS'] = True

    return features

In [12]:
# Function that gets features from a sentence
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

In [13]:
# Function that gets the labels from a sentence
def sent2labels(sent):
    return [label for token, postag, label in sent]

In [14]:
# Function that gets the word, POS tag, and label from a sentence
def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [15]:
# Get the features and labels from the tagged_corpus
features = [sent2features(s) for s in tagged_corpus]
labels = [sent2labels(s) for s in tagged_corpus]

### CRF model

CRF, or Conditional Random Fields, is a model commonly used for named entity recognition. CRF models do well in prediction tasks where sequence is important. They take contextual information and states of the neighbors to predict the current condition. This [scikit learn CRF tutorial](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html) was extremely helpful in building the model. Here is more information on [Conditional Random Fields](https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541).

In [16]:
# Create the CRF model
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=False,
)

As per the assignment, this model will be used to extract CEO and company names from the corpus. Because the model will not be used on other data sets I'm not concerned about overfitting, so I will train the model on the entire corpus and use the model on that same corpus instead of splitting into testing and training sets.

In [17]:
# Train CRF model
crf.fit(features, labels)

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=False, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=20,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

In [18]:
# Use the model to make predictions on the corpus
pred = crf.predict(features)

### Evaluating the CRF model

In [19]:
# Flatten the predictions, corpus, tag data
pred_lst = [item for sublist in pred for item in sublist]
tagged_corpus_lst = [item for sublist in tagged_corpus for item in sublist]
tags_lst = [item[2] for item in tagged_corpus_lst]

The report shows that when the model predicts a CEO or company name, it is often right--80% of the time for CEOs and 81% of the time for companies. However, the recall statistics show that there are many named entities that the model overlooks. This is reflected in the f1-score as well.

In [20]:
# Create a classification report
classes = ['CEO', 'COMPANY', 'O']
print(metrics.classification_report(tags_lst, pred_lst, labels = classes))

              precision    recall  f1-score   support

         CEO       0.80      0.01      0.01    195782
     COMPANY       0.81      0.02      0.05    138828
           O       0.98      1.00      0.99  16492032

   micro avg       0.98      0.98      0.98  16826642
   macro avg       0.86      0.34      0.35  16826642
weighted avg       0.98      0.98      0.97  16826642



In [21]:
# Check the state features
def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-8s %s" % (weight, label, attr))

Here we can see the top positive and negative predicters for the model:

In [22]:
print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(30))
print("\nTop negative:")
print_state_features(Counter(crf.state_features_).most_common()[-30:])

Top positive:
2.303508 O        bias
1.901535 O        postag:NN
1.717225 O        postag[:2]:VB
1.686105 O        BOS
1.547076 COMPANY  postag:NNP
1.498622 O        EOS
1.381504 O        postag:.
1.381504 O        postag[:2]:.
1.359706 O        postag:IN
1.359706 O        postag[:2]:IN
1.327784 O        word.lower():.
1.327784 O        word[-3:]:.
1.265324 O        postag:DT
1.265324 O        postag[:2]:DT
1.012902 O        postag[:2]:JJ
0.931343 O        postag:NNS
0.912922 O        postag:JJ
0.768445 CEO      postag:NNP
0.765750 CEO      word.istitle()
0.741570 O        word.lower():the
0.737990 O        postag[:2]:RB
0.703370 O        postag[:2]:PR
0.698356 O        postag:RB
0.560838 O        postag:PRP
0.545203 O        word.lower():,
0.545203 O        word[-3:]:,
0.545203 O        postag:,
0.545203 O        postag[:2]:,
0.504696 O        -1:postag[:2]:NN
0.496773 O        postag:CC

Top negative:
-0.214271 COMPANY  +1:postag:CD
-0.214271 COMPANY  +1:postag[:2]:CD
-0.220206 O    

And here are the top features for each label individually:

In [25]:
# Inspect model weights
eli5.show_weights(crf, top=10)

From \ To,CEO,COMPANY,O
CEO,-0.177,-1.009,-0.171
COMPANY,-0.739,0.186,-0.666
O,-0.129,-0.311,1.876

Weight?,Feature,Unnamed: 2_level_0
Weight?,Feature,Unnamed: 2_level_1
Weight?,Feature,Unnamed: 2_level_2
+0.768,postag:NNP,
+0.766,word.istitle(),
+0.351,word[-3:]:erg,
+0.344,word[-3:]:son,
+0.325,word.lower():bloomberg,
+0.259,+1:postag:VBD,
… 26394 more positive …,… 26394 more positive …,
… 6893 more negative …,… 6893 more negative …,
-0.257,+1:postag:NN,
-0.533,word.isupper(),

Weight?,Feature
+0.768,postag:NNP
+0.766,word.istitle()
+0.351,word[-3:]:erg
+0.344,word[-3:]:son
+0.325,word.lower():bloomberg
+0.259,+1:postag:VBD
… 26394 more positive …,… 26394 more positive …
… 6893 more negative …,… 6893 more negative …
-0.257,+1:postag:NN
-0.533,word.isupper()

Weight?,Feature
+1.547,postag:NNP
+0.383,postag[:2]:NN
+0.336,word.lower():apple
+0.307,word.istitle()
+0.294,word[-3:]:gan
+0.293,word[-3:]:ple
+0.261,word.lower():reuters
… 7844 more positive …,… 7844 more positive …
… 5043 more negative …,… 5043 more negative …
-0.576,-1:postag[:2]:NN

Weight?,Feature
+2.304,bias
+1.902,postag:NN
+1.717,postag[:2]:VB
+1.686,BOS
+1.499,EOS
+1.382,postag:.
+1.382,postag[:2]:.
+1.360,postag[:2]:IN
+1.360,postag:IN
… 199042 more positive …,… 199042 more positive …


### Extracting the CEO and company names

The assigment is to return lists of the CEO and company names from the model.

In [26]:
# Create a dataframe from the lists
d = {"Word":tagged_corpus_lst, "Predictions":pred_lst}
d_df = pd.DataFrame(d)

In [27]:
# CEO names
df_ceo = d_df[d_df['Predictions'] == 'CEO']

In [28]:
# Company names
df_company = d_df[d_df['Predictions'] == 'COMPANY']

The model can only tag single words as companies or CEOs. If the model tagged two or more words in a row as CEO, it likely indicates a first and last name. The same goes for multi-word company names.

The following code checks if two or more tags have consecutive indicies. In our list of extracted entities, we want these to appear as one item on the list, not two or more. So, we combine them into one word and append them to the extracted entities lists.

In [29]:
# Combine compound names

# Get and flatten the list of indicies where CEO names are
ceo_ind = list(df_ceo.index.values)
ceo_grouped = [list(group) for group in mit.consecutive_groups(ceo_ind)]

In [30]:
# Create compound words from the CEO names with consecutive indicies
ceo_full = ""
ceos = []
for ceo in ceo_grouped:
    ceo_full = ""
    for ind in ceo:
        ceo_full = ceo_full + str(tagged_corpus_lst[ind][0]) + " "
    ceos.append(ceo_full)

In [31]:
# Get and flatten list of indicies for company names
company_ind = list(df_company.index.values)
company_grouped = [list(group) for group in mit.consecutive_groups(company_ind)]

In [32]:
# Same as above, create compound words from the company names with consecutive indicies
company_full = ""
companies = []
for company in company_grouped:
    company_full = ""
    for ind in company:
        company_full = company_full + str(tagged_corpus_lst[ind][0]) + " "
    companies.append(company_full)

### Extracting percentages

Instead of training a model to recognize percents, I used a rules-based method. It is cleaner and more accurate, because all representations of percentages in the training data could be covered with regular expressions.

In [33]:
# Create regular expressions for percent formats and find them in the corpus
reA = re.findall('[0-9-+]+?[0-9.]+%', corpus)
reB = re.findall('[0-9-+]+?[0-9.]+\spercent', corpus)
re1 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?one\spercent', corpus)
re2 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?two\spercent', corpus)
re3 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?three\spercent', corpus)
re4 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?four(?:teen)?\spercent', corpus)
re5 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?five\spercent', corpus)
re6 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?six(?:teen)?\spercent', corpus)
re7 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?seven(?:teen)?\spercent', corpus)
re8 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?eight\spercent', corpus)
re9 = re.findall('(?:twenty|thirty|forty|fifty|sixty|seventy|eighty|ninety)?-?nine(?:teen)?\spercent', corpus)
re10 = re.findall('ten\spercent', corpus)
re11 = re.findall('eleven\spercent', corpus)
re12 = re.findall('twelve\spercent', corpus)
re13 = re.findall('thirteen\spercent', corpus)
re15 = re.findall('fifteen\spercent', corpus)
re18 = re.findall('eighteen\spercent', corpus)
re20 = re.findall('twenty\spercent', corpus)
re30 = re.findall('thirty\spercent', corpus)
re40 = re.findall('forty\spercent', corpus)
re50 = re.findall('fifty\spercent', corpus)
re60 = re.findall('sixty\spercent', corpus)
re70 = re.findall('seventy\spercent', corpus)
re80 = re.findall('eighty\spercent', corpus)
re90 = re.findall('ninety\spercent', corpus)
re100 = re.findall('one-?\s?hundred\spercent', corpus)
retst = re.findall('nineteen\spercent', corpus)

In [34]:
# Append all of the percents into a list
percents = reA + reB + re1 + re2 + re3 + re4 + re5 + re6 + re7 + re8 + re9 + re10 + re11 + re12 + re13 + re15 + re18 + re20 + re30 + re40 + re50 + re60 + re70 + re80 + re90 + re100

### Create CSVs with the extracted entities

In [35]:
# Output CSV files for each of the lists (CEOs, companies, percents)
pd.DataFrame(ceos).to_csv("extracted_ceos.csv")
pd.DataFrame(companies).to_csv("extracted_companies.csv")
pd.DataFrame(percents).to_csv("extracted_percents.csv")