# Lab 8: Named entity recognition

- Jacob Eisenstein
- For Georgia Tech CS8803-CSS, Fall 2017

In this project, you'll use Stanford's CoreNLP tagger to tag names of people, places, and organizations in the abolitionist newspaper The Liberator.

You can download the software here:

https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip

Next, unzip it:

In [1]:
! unzip stanford-ner-2017-06-09.zip

unzip:  cannot find or open stanford-ner-2017-06-09.zip, stanford-ner-2017-06-09.zip.zip or stanford-ner-2017-06-09.zip.ZIP.


In [2]:
from nltk.tag.stanford import StanfordNERTagger
from glob import glob
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
import os
from collections import Counter

Let's build a tagger object. The first argument is the location of the model file, the second argument is the location of the jar file. Both should have been extracted from the zipfile you downloaded.

In [3]:
tagger = StanfordNERTagger('stanford-ner-2017-06-09/classifiers/english.conll.4class.distsim.crf.ser.gz',
                           path_to_jar='stanford-ner-2017-06-09/stanford-ner.jar')

The StanfordTokenizer will be deprecated in version 3.2.5.
Please use [91mnltk.tag.corenlp.CoreNLPPOSTagger[0m or [91mnltk.tag.corenlp.CoreNLPNERTagger[0m instead.
  super(StanfordNERTagger, self).__init__(*args, **kwargs)


Let's run it. The input is a sequence of tokens. Here we'll just use string split for tokenization.

In [4]:
example = 'Colonel Mustard was in Druid Hills , with the President of the Coca Cola Corporation .'.split()

In [5]:
tagger.tag(example)

[('Colonel', 'O'),
 ('Mustard', 'PERSON'),
 ('was', 'O'),
 ('in', 'O'),
 ('Druid', 'LOCATION'),
 ('Hills', 'LOCATION'),
 (',', 'O'),
 ('with', 'O'),
 ('the', 'O'),
 ('President', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('Coca', 'ORGANIZATION'),
 ('Cola', 'ORGANIZATION'),
 ('Corporation', 'ORGANIZATION'),
 ('.', 'O')]

The output is a labeling of each token. The tag 'O' means 'outside' of any entity name.

Here is a simple function that extracts names from this output.

In [6]:
def get_entities(tagger_output):
    current_entity = []
    entities = []
    for token,tag in tagger_output:
        if tag != 'O':
            current_entity.append((token,tag))
        else:
            if current_entity != []:
                entities.append(current_entity)
                current_entity = []
    return ['%s_%s'%(' '.join([tok for tok,tag in entity]),entity[0][1]) for entity in entities]

Let's run the function.

In [7]:
get_entities(tagger.tag(example))

['Mustard_PERSON',
 'Druid Hills_LOCATION',
 'Coca Cola Corporation_ORGANIZATION']

Let's try a harder one

In [8]:
hard_example = 'I told Lucia Coca Cola was bad for her teeth .'.split()

In [9]:
tagger.tag(hard_example)

[('I', 'O'),
 ('told', 'O'),
 ('Lucia', 'ORGANIZATION'),
 ('Coca', 'ORGANIZATION'),
 ('Cola', 'ORGANIZATION'),
 ('was', 'O'),
 ('bad', 'O'),
 ('for', 'O'),
 ('her', 'O'),
 ('teeth', 'O'),
 ('.', 'O')]

In [10]:
get_entities(tagger.tag(hard_example))

['Lucia Coca Cola_ORGANIZATION']

In [11]:
with_complementizer = 'I told Lucia that Coca Cola was bad for her teeth .'.split()

In [12]:
get_entities(tagger.tag(with_complementizer))

['Lucia_ORGANIZATION', 'Coca Cola_ORGANIZATION']

And this is why you should use complementizers when you write.

# Tagging full documents

To do better word segmentation, make sure you have downloaded the `punkt` tokenization model from nltk.

In [13]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/jacob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Now let's try tagging a document from The Liberator.

Link or copy this directory in from Lab 7 if necessary.

In [19]:
! ln -s ../lab7/liberator-stories/

In [20]:
filename = 'liberator-stories/Issue of April 01, 1853/story006.txt'

In [21]:
tagged_lines = []
with open(filename) as fin:
    for line in fin:
        tagged_lines.append(tagger.tag(word_tokenize(line)))

In [22]:
print(tagged_lines[0][:10])

[('Southern', 'MISC'), ('slaveholders', 'O'), ('have', 'O'), ('a', 'O'), ('passion', 'O'), ('for', 'O'), ('mischiefframed', 'O'), ('into', 'O'), ('law', 'O'), (',', 'O')]


In [23]:
get_entities(tagged_lines[0])

['Southern_MISC',
 'North_LOCATION',
 'Illinois_LOCATION',
 'Virginia_LOCATION',
 'Old Dominion_MISC',
 'Illinoisan_LOCATION',
 'Virginia_LOCATION',
 'Illinois_LOCATION',
 'Northern_ORGANIZATION',
 'question-In_MISC',
 'Democracy_ORGANIZATION',
 'Commonwealth_ORGANIZATION']

**Your turn**: Get a story from today's news, copy it into the variable below, and extract the named entities. Skim the first few lines of the story yourself to see if it's correct. 

In [48]:
your_story = """
paste here
"""

In [53]:
# here's my output
get_entities(tagger.tag(word_tokenize(your_story)))

['Africa_LOCATION',
 'Trump_PERSON',
 'David T. Johnson_PERSON',
 'Cowanda Jones-Johnson_PERSON',
 'The Washington Post_ORGANIZATION',
 'White House_LOCATION',
 'Johnson_PERSON',
 'Myeshia Johnson_PERSON',
 'Johnson_PERSON',
 'Frederica S. Wilson_PERSON',
 'Trump_PERSON',
 'Johnson_PERSON',
 'Trump_PERSON',
 'Trump_PERSON',
 'Wilson_PERSON',
 'Wilson_PERSON',
 'Trump_PERSON']

## Counting entities

We can keep count of all these different entities, using a `Counter` object.

In [175]:
from collections import Counter

In [176]:
Counter(get_entities(tagged_lines[0]))

Counter({'Commonwealth_ORGANIZATION': 1,
         'Democracy_ORGANIZATION': 1,
         'Illinois_LOCATION': 2,
         'Illinoisan_LOCATION': 1,
         'North_LOCATION': 1,
         'Northern_ORGANIZATION': 1,
         'Old Dominion_MISC': 1,
         'Southern_MISC': 1,
         'Virginia_LOCATION': 2,
         'question-In_MISC': 1})

A cool thing about counters is that you can add them up, which makes it easy to keep a running count.

In [177]:
counter1 = Counter(get_entities(tagged_lines[0]))
counter1 += Counter(get_entities(tagged_lines[0]))

In [178]:
counter1

Counter({'Commonwealth_ORGANIZATION': 2,
         'Democracy_ORGANIZATION': 2,
         'Illinois_LOCATION': 4,
         'Illinoisan_LOCATION': 2,
         'North_LOCATION': 2,
         'Northern_ORGANIZATION': 2,
         'Old Dominion_MISC': 2,
         'Southern_MISC': 2,
         'Virginia_LOCATION': 4,
         'question-In_MISC': 2})

We'll use this to incrementally build a counter as we process all stories in a single edition.

Another useful trick is to build a counter from a list. 

In [179]:
the_list = ['a','b','a','a','c','b']

In [180]:
Counter(the_list)

Counter({'a': 3, 'b': 2, 'c': 1})

**Your turn** 

- Count the number of times a person is mentioned in your news story.
- Count the total number of different person names that are mentioned.

# Comparing entities across multiple texts

Now let's compare the named entities that are mentioned in two specific editions.

In [181]:
edition1 = 'liberator-stories/Issue of November 01, 1850'
edition2 = 'liberator-stories/Issue of November 11, 1859'

Here's a function to compute a running count of entities across the lines and stories of an issue.

In [65]:
def get_entity_counts(directory,show_progress=False):
    entity_counts = Counter()
    for filename in glob(os.path.join(directory,'story*txt')):
        with open (filename) as fin:
            if show_progress: print(filename)
            for i,line in enumerate(fin):
                if len(line)>10:
                    output = tagger.tag(word_tokenize(line))
                    entity_counts += Counter(get_entities(output))
    return entity_counts

In [68]:
counts1 = get_entity_counts(edition1,show_progress=True)

liberator-stories/Issue of November 01, 1850/story007.txt
liberator-stories/Issue of November 01, 1850/story000.txt
liberator-stories/Issue of November 01, 1850/story002.txt
liberator-stories/Issue of November 01, 1850/story003.txt
liberator-stories/Issue of November 01, 1850/story006.txt
liberator-stories/Issue of November 01, 1850/story001.txt
liberator-stories/Issue of November 01, 1850/story004.txt
liberator-stories/Issue of November 01, 1850/story005.txt
liberator-stories/Issue of November 01, 1850/story009.txt
liberator-stories/Issue of November 01, 1850/story008.txt


In [77]:
counts2 = get_entity_counts(edition2,show_progress=True)

liberator-stories/Issue of November 11, 1859/story007.txt
liberator-stories/Issue of November 11, 1859/story000.txt
liberator-stories/Issue of November 11, 1859/story002.txt
liberator-stories/Issue of November 11, 1859/story003.txt
liberator-stories/Issue of November 11, 1859/story006.txt
liberator-stories/Issue of November 11, 1859/story001.txt
liberator-stories/Issue of November 11, 1859/story004.txt
liberator-stories/Issue of November 11, 1859/story005.txt
liberator-stories/Issue of November 11, 1859/story009.txt
liberator-stories/Issue of November 11, 1859/story008.txt


In [182]:
counts1.most_common(10)

[('Massachusetts_LOCATION', 14),
 ('Constitution_ORGANIZATION', 13),
 ('United States_LOCATION', 9),
 ('Senate_ORGANIZATION', 9),
 ('God_PERSON', 7),
 ('California_LOCATION', 6),
 ('Whittier_LOCATION', 6),
 ('ERRIEN_ORGANIZATION', 6),
 ('BERRIEN_ORGANIZATION', 6),
 ('North_LOCATION', 5)]

In [183]:
counts2.most_common(10)

[('Harper_PERSON', 13),
 ('Brown_PERSON', 12),
 ('North_LOCATION', 11),
 ('South_LOCATION', 9),
 ('Phillips_PERSON', 9),
 ('Virginia_LOCATION', 8),
 ('Republican_MISC', 7),
 ('Seward_PERSON', 6),
 ('Beecher_PERSON', 4),
 ('Giddings_ORGANIZATION', 4)]

As predicted, the John Brown's raid of Harper's Ferry dominates the news in late 1859, even though the NER system mistakenly label "Harper" as a person. Another key difference is "North" and "South" seem to play a bigger role in the 1859 data, which might make sense, as the war in only a year and a half away.

**Your turn** What are the ten most frequently-mentioned organizations, across **both** newspaper issues?

In [None]:
# your code here

In [85]:
org_counts.most_common(10)

[('Constitution_ORGANIZATION', 14),
 ('Senate_ORGANIZATION', 9),
 ('ERRIEN_ORGANIZATION', 6),
 ('BERRIEN_ORGANIZATION', 6),
 ('Congress_ORGANIZATION', 6),
 ('Fugitive Slave Bill_ORGANIZATION', 5),
 ('Giddings_ORGANIZATION', 4),
 ('Supreme Court_ORGANIZATION', 4),
 ('Union_ORGANIZATION', 4),
 ('Legislature_ORGANIZATION', 3)]

# Large-scale comparison

Now let's do a large-scale comparison over the years of the dataset. Because the NER system is a little slow, I ran it overnight on our server. You can load in the output as shown:

In [87]:
! tar xzf liberator-nes.tgz

In [90]:
with open('TheLiberator/Issue of April 15, 1859/story001.ne') as fin:
    for line in fin:
        print (line.rstrip())

States_LOCATION
States_LOCATION
SWERED_ORGANIZATION
Unioncan_LOCATION
States_LOCATION
ONSTITUTION_ORGANIZATION


**Your turn** Read this same file into a counter

In [184]:
# your code here

Now let's build counters for all stories in a year. The following function should help.

In [107]:
get_files_for_year = lambda year : glob('TheLiberator/Issue*%d/story*.ne'%(year))

In [109]:
get_files_for_year(1852)[:5]

['TheLiberator/Issue of November 05, 1852/story041.ne',
 'TheLiberator/Issue of November 05, 1852/story006.ne',
 'TheLiberator/Issue of November 05, 1852/story032.ne',
 'TheLiberator/Issue of November 05, 1852/story020.ne',
 'TheLiberator/Issue of November 05, 1852/story048.ne']

**Your turn** Implement the following function, which should return a counter of entity names for all stories in a given year. Don't forget the two tricks about Counters that I showed you above.

In [119]:
def get_entity_counts_for_year(year):
    counts = Counter()
    # add code here
    return counts

Test your function

In [136]:
get_entity_counts_for_year(1859).most_common(10)

[('Massachusetts_LOCATION', 150),
 ('God_PERSON', 136),
 ('Boston_LOCATION', 127),
 ('South_LOCATION', 77),
 ('Brown_PERSON', 76),
 ('United States_LOCATION', 70),
 ('Constitution_ORGANIZATION', 68),
 ('North_LOCATION', 66),
 ('New York_LOCATION', 62),
 ('Oregon_LOCATION', 58)]

Now let's print the top entities for every year

In [141]:
for year in range(1846,1866):
    print(year,end=': ')
    print(' '.join([name.split('_')[0] 
                    for name,count 
                    in get_entity_counts_for_year(year).most_common(10)]))

1846: American God Alliance Rev United States Christian America England Boston Massachusetts
1847: God American Mexico Boston United States Christian Douglass England Union South
1848: God South North Senate Boston Union Congress American House United States
1849: God South Boston American Congress Christian New York North United States Southern
1850: God South Boston Webster North Congress United States Massachusetts Union Senate
1851: God American Boston America England United States Union Constitution States North
1852: God Boston American South Constitution New York America Pierce Washington Parker
1853: God Boston Mann American Constitution United States Christian Massachusetts England New York
1854: God Boston Congress American South United States Constitution Ohio North States
1855: God Union South American North Constitution Boston United States Massachusetts New York
1856: Kansas God South Boston North United States Union American Massachusetts New York
1857: God South Boston 

# Pointwise mutual information 

The top named appear to be dominated by a few recurring items: God, American, South, etc.

Remember that we addressed this problem before by using pointwise mutual information (PMI). As a reminder, here is the formula:

\begin{equation}
\text{PMI}(i,j) = \log \frac{P(i,j)}{P(i)P(j)} = \log \frac{P(i \mid j) P(j)}{P(i) P(j)} = \log P(i \mid j) - \log P(i)
\end{equation}

We can compute this directly from the Counter objects. First let's compute the bottom term, $\log P(i)$, which is the probability of each entity name, over all years in the dataset.

In [185]:
total_counts = Counter()
for year in range(1846,1866):
    total_counts += get_entity_counts_for_year(year)

In [186]:
import numpy as np # we need this for log

In [188]:
log_pi = {name:np.log(count) for name,count in total_counts.items()}

In [200]:
[(name,log_pi[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', 7.9069154886785871),
 ('South_LOCATION', 7.4442486494967053),
 ('Boston_LOCATION', 7.4193805829186923),
 ('North_LOCATION', 7.2640301428995295),
 ('American_MISC', 7.1808311990445555)]

These are the log counts. Note that they are positive -- this means that they can't be log probabilities! $(p(x) <= 1 \Leftrightarrow \log p(x) <= 0)$. 

We'll fix this by subtracting $\log N$, where $N$ is the sum of all counts.

In [248]:
tot_log_N = np.log(sum(total_counts.values()))
log_pi = {name:np.log(count) - tot_log_N
          for name,count in total_counts.items()}

In [249]:
[(name,log_pi[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', -3.9701877038264808),
 ('South_LOCATION', -4.4328545430083626),
 ('Boston_LOCATION', -4.4577226095863756),
 ('North_LOCATION', -4.6130730496055383),
 ('American_MISC', -4.6962719934605124)]

In [251]:
# exps of log probabilities should sum to one
sum(np.exp(val) for val in log_pi.values())

1.0000000000010483

Better! 

Now, to compute $\log P(i \mid j)$, you just need to do the same operation, but with the counter for each specific year.

**Your turn**: fill in the function below, which should return a dict of names and log probabilites.

In [225]:
def get_log_pij(year):
    counts = get_entity_counts_for_year(year)
    # your code here
    return #something

In [227]:
# desired output
[(name,get_log_pij(1859)[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', -4.2038142706300592),
 ('South_LOCATION', -4.7726637345124274),
 ('Boston_LOCATION', -4.2722820699075204),
 ('North_LOCATION', -4.9268144143396864),
 ('American_MISC', -5.0560261458196925)]

In [232]:
# check that the probabilities still sum to one
sum(np.exp(val) for val in get_log_pij(1859).values())

0.99999999999992628

Now we can compute the PMI. Note that we need only compute PMI for names the appear in the year; it's undefined $(\log 0)$ for other years.

In [1]:
def get_PMI(year):
    log_pij = get_log_pij(year)
    pmi = {name:log_pij_name- log_pi[name] 
           for name,log_pij_name 
           in log_pij.items()}           
    return pmi

In [221]:
[(name,get_PMI(1859)[name]) for name,count in total_counts.most_common(5)]

[('God_PERSON', -0.22630052671150569),
 ('South_LOCATION', -0.32690578666815728),
 ('Boston_LOCATION', 0.19328371713988091),
 ('North_LOCATION', -0.2987034873696075),
 ('American_MISC', -0.34265971899987946)]

## Top names by PMI per year

Now we'll call your function to get the top names by PMI for each year. 

We'll focus on names among the 500 most common overall.

In [245]:
top1k = [name for name,count in tot_counts.most_common(500)]

In [247]:
for year in range(1846,1866):
    pmi_year = get_PMI(year)
    pmi_year_filtered = {name:pmi for name,pmi in pmi_year.items() if name in top1k}
    top_names = sorted(pmi_year_filtered,key=pmi_year_filtered.get,reverse=True)[:5]
    print(year,'\t'.join(name.split('_')[0] for name in top_names))

1846 Stephen C. Phillips	National Anti-SlaveryStandard	Bingham	Rio Grande	Habeas Corpus
1847 HORN	Douglass	Frederick	Richmond Enquirer	Essex
1848 Berrien	Corwin	Southerner	Rio Grande	Hale
1849 Cooke	Assembly	Southern	Washington Union	Essex
1850 James Hamlet	Slave Trade Bill	Legislature of the State	Haynau	Berrien
1851 Fugitive SlaveBill	This Committee	Duncan	Fugitive Slave	Fugitive Slave Law
1852 Hungary	Fugitive SlaveLaw	Supreme Court of the UnitedStates	Essex	N.Y. Express
1853 Robespierre	HORACE MANN	Haynau	Horace Mann	Ido
1854 Atchinson	Surrey	Likethe	Boston Daily Advertiser	Court House
1855 OUGLAS	Richmond Enquirer	Fugitive SlaveBill	Slave Bill	Burns
1856 North Fifth	Chicago Press	Frederick ( Md	Buffalo Express	Thisquestion
1857 Charles F. Adams	ILLIAM LLERY HANNING	Stephens	Likethe	State of Virginia
1858 Abolitionistsof	State Constitutions	Slaverywill	Gospel of Christ	Charleston Courier
1859 Ifit	Brewster	WENDELLPHILLIPS	Arsenal	John Brown
1860 Statelaws	WILLIAM ELLERY CHANNING	Sa

These are honestly not so great -- lots of typos and capitalization issues. 

There are also a lot of mentions of other newspapers, which might make sense, depending on when those newspapers ran, and whether Liberator frequently borrowed from them.

Future work on this data could play with smoothing, try TF-IDF instead of PMI, or work with the count of unique stories or issues in which each name appears, rather than the raw counts.