---
Exercises: Named Entity Recognition (NER)
===

aka, PERSON Classification
----

![](http://cdn4.teehunter.com/wp-content/uploads/2015/10/wheres-waldo-via-the-telegraph.jpg)

__Where's Waldo in text?__

In [1]:
reset -fs

In [12]:
import nltk

In [14]:
text = "Dr. Brian Spiering works at Galvanize / GalvanizeU, powered by the University of New Haven, in San Francisco, California."
tagged = nltk.pos_tag(nltk.word_tokenize(text))
print(nltk.ne_chunk(tagged,
                    binary=True))

(S
  Dr./NNP
  (NE Brian/NNP Spiering/NNP)
  works/VBZ
  at/IN
  (NE Galvanize/NNP)
  //NNP
  GalvanizeU/NNP
  ,/,
  powered/VBN
  by/IN
  the/DT
  (NE University/NNP)
  of/IN
  (NE New/NNP Haven/NNP)
  ,/,
  in/IN
  (NE San/NNP Francisco/NNP)
  ,/,
  (NE California/NNP)
  ./.)


In [1]:
# TODO Change the function call so the output includes more details than just NE/Named Entity

It should look something like this:
```
(S
  Dr./NNP
  (PERSON Brian/NNP Spiering/NNP)
  works/VBZ
  at/IN
  (ORGANIZATION Galvanize/NNP)
  //NNP
  GalvanizeU/NNP
  ,/,
  powered/VBN
  by/IN
  the/DT
  (ORGANIZATION University/NNP)
  of/IN
  (GPE New/NNP Haven/NNP)
  ,/,
  in/IN
  (GPE San/NNP Francisco/NNP)
  ,/,
  (GPE California/NNP)
  ./.)
```

In [5]:
# TODO:
# How did the tagger do? 
# Is it surprising that it did so well?
# If it missed a tag, why do you think it did?"""

---
Let's compare that to Standard's NER Tagger
---

[Stanford NER (Named Entity Recognizer)](http://nlp.stanford.edu/software/CRF-NER.shtml) is one of the most popular Named Entity Recognition tools and implemented by Java.

---
Installing Java
-----

You're required to have Java working on your system because Java code is at the heart of the program. However, you do __not__ need to worry about the Java part because we provide the commands to compile and run the program.

Assuming you are on a Mac... Make sure you have `homebrew` package manager installed.

In [6]:
import os

In [7]:
if os.system("java -version") == 32512: # Value for 'command not found'
    os.system("brew doctor")
    os.system("brew update")
    os.system("brew install cask")
    os.system("brew cask install java")

You might have run 'brew cask install java' in iTerm because installing Java sometimes requires a password.

Now we are ready to play with Standford NER Tagger

In [8]:
from nltk.tag.stanford import StanfordNERTagger

Check out the [source code](http://www.nltk.org/_modules/nltk/tag/stanford.html)

It requires external dependencies including .jar (Java files) and classifers.

We need to downloead those external dependencies. Let's grab a new-ish verison: `Stanford Named Entity Recognizer version 3.4`

In [9]:
base_path = "./stanford-ner-2014-06-16"

# If .jar file is not local, download it and set it up
if not os.path.isfile(base_path+"/stanford-ner.jar"):
    import urllib
    import shutil
    
    url = "http://nlp.stanford.edu/software/stanford-ner-2014-06-16.zip"
    file_name = url.split("/")[-1]
    
    with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
    
    os.system("brew tap homebrew/dupes") # Install unzip prerequisites
    os.system("brew install unzip") # Install unzip
    os.system("unzip "+file_name) # Unzip
    
    # Now is time to hack...
    # It turns out nltk hardcodes file names, thus we have to manually update the name
    os.rename(base_path+"/classifiers/english.all.3class.distsim.crf.ser.gz", base_path+"/classifiers/all.3class.distsim.crf.ser.gz")     

We are ready to define our tagger!

In [10]:
st = StanfordNERTagger(base_path+'/classifiers/all.3class.distsim.crf.ser.gz',
                       base_path+'/stanford-ner.jar')

Let's tag some words

In [15]:
st.tag(nltk.word_tokenize(text))

[('Dr.', 'O'),
 ('Brian', 'PERSON'),
 ('Spiering', 'PERSON'),
 ('works', 'O'),
 ('at', 'O'),
 ('Galvanize', 'LOCATION'),
 ('', 'O'),
 ('GalvanizeU', 'O'),
 (',', 'O'),
 ('powered', 'O'),
 ('by', 'O'),
 ('the', 'O'),
 ('University', 'ORGANIZATION'),
 ('of', 'ORGANIZATION'),
 ('New', 'ORGANIZATION'),
 ('Haven', 'ORGANIZATION'),
 (',', 'O'),
 ('in', 'O'),
 ('San', 'LOCATION'),
 ('Francisco', 'LOCATION'),
 (',', 'O'),
 ('California', 'LOCATION'),
 ('.', 'O')]

What does the 'O' mean?

In [None]:
# TODO: Write code to tag all the 'O' tokens with more descriptive labels. 

<details><summary>
Click here for a hint.
</summary>
Conditionally combine both sets of tags
</details>

------
Compare NER tags for ntlk default vs. Stanford. 
----

Use "History of The United States" by Harles A. Beard and Mary R. Beard as the corpus.

Load the file, then perform NER twice:

1. First with ntlk default
2. Second with Standford

Compare performance on:

1. Correctness
2. Runtime

Is there a pattern to the misses?

Which tagger would you use? Why? When?

Hints: 
- Start with a small sample to test your code

----
Challenge Exercises
-----

Try NER with spaCy

In [1]:
# TODO: load and setup spacy

In [3]:
# Test
tokens = nlp("My name is Brian")

for token in tokens:
    print(token, token.tag_, token.ent_type_, sep="\t| ")

# Output:
"""
My	| PRP$	| 
name	| NN	| 
is	| VBZ	| 
Brian	| NNP	| PERSON
"""

NameError: name 'nlp' is not defined

In [4]:
# TODO: Repeat NER for the same text used above

In [5]:
# TODO: Benchmark spacy speed

<br>
<br>
----