---
Exercises: Named Entity Recognition (NER)
===

aka, PERSON Classification
----

![](http://cdn4.teehunter.com/wp-content/uploads/2015/10/wheres-waldo-via-the-telegraph.jpg)

__Where's Waldo in text?__

In [1]:
reset -fs

In [2]:
import nltk

In [5]:
 nltk.ne_chunk(pos_history_nltk[:100], binary=True)

In [3]:
text = "Dr. Brian Spiering works at Galvanize / GalvanizeU, powered by the University of New Haven, in San Francisco, California."
tagged = nltk.pos_tag(nltk.word_tokenize(text))
print(nltk.ne_chunk(tagged,
                    binary=True))

(S
  Dr./NNP
  (NE Brian/NNP Spiering/NNP)
  works/VBZ
  at/IN
  (NE Galvanize/NNP)
  //NNP
  GalvanizeU/NNP
  ,/,
  powered/VBN
  by/IN
  the/DT
  (NE University/NNP)
  of/IN
  (NE New/NNP Haven/NNP)
  ,/,
  in/IN
  (NE San/NNP Francisco/NNP)
  ,/,
  (NE California/NNP)
  ./.)


In [1]:
# TODO Change the function call so the output includes more details than just NE/Named Entity

In [7]:
print(nltk.ne_chunk(tagged,
                    binary=False))

(S
  Dr./NNP
  (PERSON Brian/NNP Spiering/NNP)
  works/VBZ
  at/IN
  (ORGANIZATION Galvanize/NNP)
  //NNP
  GalvanizeU/NNP
  ,/,
  powered/VBN
  by/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (GPE New/NNP Haven/NNP)
  ,/,
  in/IN
  (GPE San/NNP Francisco/NNP)
  ,/,
  (GPE California/NNP)
  ./.)


It should look something like this:
```
(S
  Dr./NNP
  (PERSON Brian/NNP Spiering/NNP)
  works/VBZ
  at/IN
  (ORGANIZATION Galvanize/NNP)
  //NNP
  GalvanizeU/NNP
  ,/,
  powered/VBN
  by/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (GPE New/NNP Haven/NNP)
  ,/,
  in/IN
  (GPE San/NNP Francisco/NNP)
  ,/,
  (GPE California/NNP)
  ./.)
```

In [5]:
# TODO:
# How did the tagger do? 
# Is it surprising that it did so well?
# If it missed a tag, why do you think it did?"""

> The did decently well. It recognized  Brian Spiering as a person (but missed the Dr. part) and recognized Galvanize as an Organization. However, it missed New Haven (for University of New Haven) as an Organization. This is probably because New Haven is a physical location by itself , but an organization when combined with University.

---
Let's compare that to Standard's NER Tagger
---

[Stanford NER (Named Entity Recognizer)](http://nlp.stanford.edu/software/CRF-NER.shtml) is one of the most popular Named Entity Recognition tools and implemented by Java.

---
Installing Java
-----

You're required to have Java working on your system because Java code is at the heart of the program. However, you do __not__ need to worry about the Java part because we provide the commands to compile and run the program.

Assuming you are on a Mac... Make sure you have `homebrew` package manager installed.

In [13]:
import os

In [14]:
if os.system("java -version") == 32512: # Value for 'command not found'
    os.system("brew doctor")
    os.system("brew update")
    os.system("brew install cask")
    os.system("brew cask install java")

You might have run 'brew cask install java' in iTerm because installing Java sometimes requires a password.

Now we are ready to play with Standford NER Tagger

In [15]:
from nltk.tag.stanford import StanfordNERTagger

Check out the [source code](http://www.nltk.org/_modules/nltk/tag/stanford.html)

It requires external dependencies including .jar (Java files) and classifers.

We need to downloead those external dependencies. Let's grab a new-ish verison: `Stanford Named Entity Recognizer version 3.4`

In [16]:
base_path = "./stanford-ner-2014-06-16"

# If .jar file is not local, download it and set it up
if not os.path.isfile(base_path+"/stanford-ner.jar"):
    import urllib
    import shutil
    
    url = "http://nlp.stanford.edu/software/stanford-ner-2014-06-16.zip"
    file_name = url.split("/")[-1]
    
    with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
    
    os.system("brew tap homebrew/dupes") # Install unzip prerequisites
    os.system("brew install unzip") # Install unzip
    os.system("unzip "+file_name) # Unzip
    
    # Now is time to hack...
    # It turns out nltk hardcodes file names, thus we have to manually update the name
    os.rename(base_path+"/classifiers/english.all.3class.distsim.crf.ser.gz", base_path+"/classifiers/all.3class.distsim.crf.ser.gz")     

We are ready to define our tagger!

In [17]:
st = StanfordNERTagger(base_path+'/classifiers/all.3class.distsim.crf.ser.gz',
                       base_path+'/stanford-ner.jar')

Let's tag some words

In [18]:
st.tag(nltk.word_tokenize(text))

[('Dr.', 'O'),
 ('Brian', 'PERSON'),
 ('Spiering', 'PERSON'),
 ('works', 'O'),
 ('at', 'O'),
 ('Galvanize', 'LOCATION'),
 ('', 'O'),
 ('GalvanizeU', 'O'),
 (',', 'O'),
 ('powered', 'O'),
 ('by', 'O'),
 ('the', 'O'),
 ('University', 'ORGANIZATION'),
 ('of', 'ORGANIZATION'),
 ('New', 'ORGANIZATION'),
 ('Haven', 'ORGANIZATION'),
 (',', 'O'),
 ('in', 'O'),
 ('San', 'LOCATION'),
 ('Francisco', 'LOCATION'),
 (',', 'O'),
 ('California', 'LOCATION'),
 ('.', 'O')]

What does the 'O' mean?

> This 'O' denotes the other category. This is a catch all for POS that NLTK does not recognize. This performs much better than NLTK's built in POS tagger (it gets all of the Univeristy of New Haven as an Organization).

In [None]:
# TODO: Write code to tag all the 'O' tokens with more descriptive labels. 

In [31]:
nltk_tags = nltk.pos_tag(nltk.word_tokenize(text))

In [85]:
ner_tags = st.tag(nltk.word_tokenize(text))
ner_tags

[('HISTORY', 'O'),
 ('OF', 'O'),
 ('THE', 'O'),
 ('UNITED', 'O'),
 ('STATES', 'O'),
 ('BY', 'O'),
 ('CHARLES', 'O'),
 ('A', 'O'),
 ('.', 'O'),
 ('BEARD', 'O'),
 ('AND', 'O'),
 ('MARY', 'PERSON'),
 ('R.', 'PERSON'),
 ('BEARD', 'PERSON'),
 ('New', 'LOCATION'),
 ('York', 'LOCATION'),
 ('THE', 'O'),
 ('MACMILLAN', 'O'),
 ('COMPANY', 'O'),
 ('1921', 'O'),
 ('_All', 'O'),
 ('rights', 'O'),
 ('reserved_', 'O'),
 ('COPYRIGHT', 'O'),
 (',', 'O'),
 ('1921', 'O'),
 (',', 'O'),
 ('BY', 'O'),
 ('THE', 'O'),
 ('MACMILLAN', 'O'),
 ('COMPANY', 'O'),
 ('.', 'O'),
 ('Set', 'O'),
 ('up', 'O'),
 ('and', 'O'),
 ('electrotyped', 'O'),
 ('.', 'O'),
 ('Published', 'O'),
 ('March', 'O'),
 (',', 'O'),
 ('1921', 'O'),
 ('.', 'O'),
 ('Norwood', 'ORGANIZATION'),
 ('Press', 'ORGANIZATION'),
 ('J.S', 'O'),
 ('.', 'O'),
 ('Cushing', 'ORGANIZATION'),
 ('Co.', 'ORGANIZATION'),
 ('--', 'O'),
 ('Berwick', 'ORGANIZATION'),
 ('&', 'ORGANIZATION'),
 ('Smith', 'ORGANIZATION'),
 ('Co.', 'ORGANIZATION'),
 ('NORWOOD', 'LOCATION

In [34]:
print([(stan) if stan[1]!='O' else nltk for  stan,nltk in zip(ner_tags,nltk_tags)])

[('Dr.', 'NNP'), ('Brian', 'PERSON'), ('Spiering', 'PERSON'), ('works', 'VBZ'), ('at', 'IN'), ('Galvanize', 'LOCATION'), ('/', 'NNP'), ('GalvanizeU', 'NNP'), (',', ','), ('powered', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('University', 'ORGANIZATION'), ('of', 'ORGANIZATION'), ('New', 'ORGANIZATION'), ('Haven', 'ORGANIZATION'), (',', ','), ('in', 'IN'), ('San', 'LOCATION'), ('Francisco', 'LOCATION'), (',', ','), ('California', 'LOCATION'), ('.', '.')]


<details><summary>
Click here for a hint.
</summary>
Conditionally combine both sets of tags
</details>

------
Compare NER tags for ntlk default vs. Stanford. 
----

Use "History of The United States" by Harles A. Beard and Mary R. Beard as the corpus.

Load the file, then perform NER twice:

1. First with ntlk default
2. Second with Standford

Compare performance on:

1. Correctness
2. Runtime

Is there a pattern to the misses?

Which tagger would you use? Why? When?

Hints: 
- Start with a small sample to test your code

In [37]:
with open('../../corpora/history_usa.txt') as f:
          history_text = f.read()

In [44]:
tokens_history_nltk = nltk.word_tokenize(history_text) 

In [45]:
pos_history_nltk = nltk.pos_tag(tokens_history_nltk)

In [51]:
pos_history_nltk[:20]

[('HISTORY', 'NN'),
 ('OF', 'IN'),
 ('THE', 'NNP'),
 ('UNITED', 'NNP'),
 ('STATES', 'NNPS'),
 ('BY', 'NNP'),
 ('CHARLES', 'NNP'),
 ('A', 'NNP'),
 ('.', '.'),
 ('BEARD', 'NNP'),
 ('AND', 'NNP'),
 ('MARY', 'NNP'),
 ('R.', 'NNP'),
 ('BEARD', 'NNP'),
 ('New', 'NNP'),
 ('York', 'NNP'),
 ('THE', 'NNP'),
 ('MACMILLAN', 'NNP'),
 ('COMPANY', 'NNP'),
 ('1921', 'CD')]

In [66]:
ner_nltk_history = nltk.ne_chunk(pos_history_nltk[:100], binary=False)
print(ner_nltk_history)

(S
  HISTORY/NN
  OF/IN
  (ORGANIZATION THE/NNP)
  (ORGANIZATION UNITED/NNP)
  STATES/NNPS
  BY/NNP
  CHARLES/NNP
  A/NNP
  ./.
  (ORGANIZATION BEARD/NNP)
  AND/NNP
  MARY/NNP
  R./NNP
  (ORGANIZATION BEARD/NNP New/NNP York/NNP)
  THE/NNP
  (ORGANIZATION MACMILLAN/NNP)
  COMPANY/NNP
  1921/CD
  _All/NNP
  rights/NNS
  reserved_/NN
  (ORGANIZATION COPYRIGHT/NNP)
  ,/,
  1921/CD
  ,/,
  BY/NNP
  (ORGANIZATION THE/NNP)
  (ORGANIZATION MACMILLAN/NNP)
  COMPANY/NNP
  ./.
  (PERSON Set/NNP)
  up/RB
  and/CC
  electrotyped/VBD
  ./.
  Published/VBN
  (PERSON March/NNP)
  ,/,
  1921/CD
  ./.
  (PERSON Norwood/NNP Press/NNP)
  J.S/NNP
  ./.
  Cushing/NNP
  Co./NNP
  --/:
  (PERSON Berwick/NNP)
  &/CC
  (PERSON Smith/NNP)
  Co./NNP
  NORWOOD/NNP
  ,/,
  MASS./NNP
  ,/,
  (GPE U.S.A/NNP)
  ./.
  PREFACE/NNP
  As/IN
  things/NNS
  now/RB
  stand/VBP
  ,/,
  the/DT
  course/NN
  of/IN
  instruction/NN
  in/IN
  (GPE American/JJ)
  history/NN
  in/IN
  our/PRP$
  public/JJ
  schools/NNS
  embraces/N

In [47]:
# NLTK is very slow for POS

In [56]:
ner_tags = st.tag(nltk.word_tokenize(history_text)[:100])

In [57]:
ner_tags

[('HISTORY', 'O'),
 ('OF', 'O'),
 ('THE', 'O'),
 ('UNITED', 'O'),
 ('STATES', 'O'),
 ('BY', 'O'),
 ('CHARLES', 'O'),
 ('A', 'O'),
 ('.', 'O'),
 ('BEARD', 'O'),
 ('AND', 'O'),
 ('MARY', 'PERSON'),
 ('R.', 'PERSON'),
 ('BEARD', 'PERSON'),
 ('New', 'LOCATION'),
 ('York', 'LOCATION'),
 ('THE', 'O'),
 ('MACMILLAN', 'O'),
 ('COMPANY', 'O'),
 ('1921', 'O'),
 ('_All', 'O'),
 ('rights', 'O'),
 ('reserved_', 'O'),
 ('COPYRIGHT', 'O'),
 (',', 'O'),
 ('1921', 'O'),
 (',', 'O'),
 ('BY', 'O'),
 ('THE', 'O'),
 ('MACMILLAN', 'O'),
 ('COMPANY', 'O'),
 ('.', 'O'),
 ('Set', 'O'),
 ('up', 'O'),
 ('and', 'O'),
 ('electrotyped', 'O'),
 ('.', 'O'),
 ('Published', 'O'),
 ('March', 'O'),
 (',', 'O'),
 ('1921', 'O'),
 ('.', 'O'),
 ('Norwood', 'ORGANIZATION'),
 ('Press', 'ORGANIZATION'),
 ('J.S', 'O'),
 ('.', 'O'),
 ('Cushing', 'ORGANIZATION'),
 ('Co.', 'ORGANIZATION'),
 ('--', 'O'),
 ('Berwick', 'ORGANIZATION'),
 ('&', 'ORGANIZATION'),
 ('Smith', 'ORGANIZATION'),
 ('Co.', 'ORGANIZATION'),
 ('NORWOOD', 'LOCATION

- NLTK gives more information than the Standford for NER (it gives the part of speech in addition to the NEW)
- Standford only gives the NER. Standford is 'more correct' for NER, but it comes at the expensive of less descriptive labels for the other category.
- NLTK misses some NERs (i.e. University of New Haven), but it gives more context for the other tags.

In [67]:
import time

In [70]:
%timeit nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(history_text))[:100], binary=False)

1 loop, best of 3: 13.9 s per loop


In [69]:
%timeit st.tag(nltk.word_tokenize(history_text)[:100])

1 loop, best of 3: 4.9 s per loop


> NLTK is almost 3.5x slower than the Standford NER system. I would use the Standford NER system if I wanted to capture the majority of NER without worrying about the other category. However, if I was more interested in gaining as much context as I could in corpus, then NLTK may be better. In addition, the size of the corpus is a consideration as well.

----
Challenge Exercises
-----

Try NER with spaCy

In [1]:
# TODO: load and setup spacy

In [72]:
from spacy.en import English

In [78]:
nlp = English()

In [79]:
# Test
tokens = nlp("My name is Brian")
print(tokens)
for token in tokens:
    print(token, token.tag_, token.ent_type_, sep="\t| ")

# Output:
"""
My	| PRP$	| 
name	| NN	| 
is	| VBZ	| 
Brian	| NNP	| PERSON
"""

My name is Brian
My	| PRP$	| 
name	| NN	| 
is	| VBZ	| 
Brian	| NNP	| PERSON


'\nMy\t| PRP$\t| \nname\t| NN\t| \nis\t| VBZ\t| \nBrian\t| NNP\t| PERSON\n'

In [76]:
# TODO: Repeat NER for the same text used above

In [92]:

tokens = nlp(history_text[:550])
for token in tokens:
    print(token, token.tag_, token.ent_type_, sep="\t| ")

HISTORY	| NN	| 


	| SP	| 
OF	| IN	| 
THE	| DT	| 


	| SP	| 
UNITED	| NNP	| 
STATES	| NNP	| 



	| SP	| 
BY	| IN	| 



	| SP	| 
CHARLES	| NNP	| 
A.	| NN	| 
BEARD	| NNP	| 


	| SP	| 
AND	| CC	| 


	| SP	| 
MARY	| NNP	| PERSON
R.	| NNP	| PERSON
BEARD	| NNP	| PERSON




	| SP	| 
New	| NNP	| GPE
York	| NNP	| GPE


	| SP	| 
THE	| DT	| 
MACMILLAN	| NNP	| 
COMPANY	| NNP	| 


	| SP	| 
1921	| CD	| 


	| SP	| 
_All	| DT	| 
rights	| NNS	| 
reserved_	| NN	| 


	| SP	| 
COPYRIGHT	| NN	| 
,	| ,	| 
1921	| CD	| 
,	| ,	| 


	| SP	| 
BY	| IN	| 
THE	| DT	| 
MACMILLAN	| NNP	| 
COMPANY.	| NN	| 



	| SP	| 
Set	| VB	| 
up	| RP	| 
and	| CC	| 
electrotyped	| VBN	| 
.	| .	| 
Published	| VBN	| 
March	| NNP	| 
,	| ,	| 
1921	| CD	| 
.	| .	| 





	| SP	| 
Norwood	| NNP	| ORG
Press	| NNP	| ORG


	| SP	| 
J.S.	| NNP	| PERSON
Cushing	| NNP	| PERSON
Co.--Berwick	| NNP	| PERSON
&	| CC	| 
Smith	| NNP	| 
Co.	| NNP	| 


	| SP	| 
NORWOOD	| NNP	| ORG
,	| ,	| 
MASS.	| NNP	| GPE
,	| ,	| 
U.S.A.	| NNP	| 





	| SP	| 
PREFACE

In [5]:
# TODO: Benchmark spacy speed

In [93]:
%timeit nlp(history_text[:550])

100 loops, best of 3: 6.55 ms per loop


In [86]:
%timeit st.tag(nltk.word_tokenize(history_text)[:100])

1 loop, best of 3: 4.59 s per loop


In [87]:
%timeit nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(history_text))[:100], binary=False)

1 loop, best of 3: 13.9 s per loop


> Spacey is incredibly fast

In [None]:
nlp(history_text[:100])

<br>
<br>
----