Named entity recognition
=========

The *Editor* demo uses named entity recognition (NER) to identify people, places, and organizations in text.

Here we'll try to use the Stanford NER system to do this ourselves.

To do this, we need to get model files from here: http://nlp.stanford.edu/software/CRF-NER.shtml

In [21]:
from nltk.tag.stanford import StanfordNERTagger #NLTK has a wrapper for Stanford NER
import nltk
import os

In [22]:
javapath = '/usr/bin/java'

In [23]:
# apparently nltk wants to use /usr/lib/jvm/default-java instead, and that's out of date.
# this fixed that.
nltk.internals.config_java(bin=javapath)
os.environ['JAVAHOME'] = javapath

[Found /usr/bin/java: /usr/bin/java]


When calling the tagger, argument 1 is the model file, and argument 2 is the jarfile for the code.

In [24]:
tagger = StanfordNERTagger('/home/jacobe/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz',
                          '/home/jacobe/stanford-ner-2015-12-09/stanford-ner-3.6.0.jar')

Let's try this example paragraph from http://www.scientificamerican.com/article/apple-fears-court-order-will-open-pandora-s-box-for-iphone-security-video11/

In [25]:
para = 'As U.S. law enforcement escalates its battle to keep criminals from concealing their communication on digital devices or “going dark,” Apple CEO Tim Cook is digging in his heels in resisting government directives to support their investigations. A federal judge in California on Tuesday ordered Apple to step up efforts to help the FBI search the locked iPhone 5c used by Syed Rizwan Farook, who, along with wife Tashfeen Malik, is suspected of a mass shooting at a December 2 holiday party last year in San Bernardino, Calif., that killed 14 people and injured 22. Cook quickly countered the court action with an open letter posted to his company’s site suggesting that the FBI’s request could open a Pandora’s box that undermines security on all iPhones.'

In [30]:
print para

As U.S. law enforcement escalates its battle to keep criminals from concealing their communication on digital devices or “going dark,” Apple CEO Tim Cook is digging in his heels in resisting government directives to support their investigations. A federal judge in California on Tuesday ordered Apple to step up efforts to help the FBI search the locked iPhone 5c used by Syed Rizwan Farook, who, along with wife Tashfeen Malik, is suspected of a mass shooting at a December 2 holiday party last year in San Bernardino, Calif., that killed 14 people and injured 22. Cook quickly countered the court action with an open letter posted to his company’s site suggesting that the FBI’s request could open a Pandora’s box that undermines security on all iPhones.


In [27]:
tagger.tag(para.split(' '))

CRFClassifier invoked on Tue Feb 23 14:02:21 EST 2016 with arguments:
   -loadClassifier /home/jacobe/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz -textFile /tmp/tmprcL0b_ -outputFormat slashTags -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerOptions "tokenizeNLs=false" -encoding utf8
tokenizerFactory=edu.stanford.nlp.process.WhitespaceTokenizer
tokenizerOptions="tokenizeNLs=false"
loadClassifier=/home/jacobe/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz
encoding=utf8
textFile=/tmp/tmprcL0b_
outputFormat=slashTags
Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
	at edu.stanford.nlp.io.IOUtils.<clinit>(IOUtils.java:42)
	at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifier(AbstractSequenceClassifier.java:1484)
	at edu.stanford.nlp.ie.AbstractSequenceClassifier.loadClassifierNoExceptions(AbstractSequenceClassifier.java:1497)
	at edu.stanford.nlp.ie.crf.CRFClass

OSError: Java command failed : ['/usr/bin/java', '-mx1000m', '-cp', '/home/jacobe/stanford-ner-2015-12-09/stanford-ner-3.6.0.jar', 'edu.stanford.nlp.ie.crf.CRFClassifier', '-loadClassifier', '/home/jacobe/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz', '-textFile', '/tmp/tmprcL0b_', '-outputFormat', 'slashTags', '-tokenizerFactory', 'edu.stanford.nlp.process.WhitespaceTokenizer', '-tokenizerOptions', '"tokenizeNLs=false"', '-encoding', 'utf8']

## Shell version ##

I couldn't get NLTK's wrapper to work in time for class, so here's how to do it on the command line.

In [28]:
with open('sample.txt','w') as fout:
    print >>fout, para

In [29]:
%%bash

/home/jacobe/stanford-ner-2015-12-09/ner.sh sample.txt

As/O U.S./LOCATION law/O enforcement/O escalates/O its/O battle/O to/O keep/O criminals/O from/O concealing/O their/O communication/O on/O digital/O devices/O or/O ``/O going/O dark/O ,/O ''/O Apple/O CEO/O Tim/PERSON Cook/PERSON is/O digging/O in/O his/O heels/O in/O resisting/O government/O directives/O to/O support/O their/O investigations/O ./O 
A/O federal/O judge/O in/O California/LOCATION on/O Tuesday/O ordered/O Apple/ORGANIZATION to/O step/O up/O efforts/O to/O help/O the/O FBI/ORGANIZATION search/O the/O locked/O iPhone/O 5c/O used/O by/O Syed/PERSON Rizwan/PERSON Farook/PERSON ,/O who/O ,/O along/O with/O wife/O Tashfeen/PERSON Malik/PERSON ,/O is/O suspected/O of/O a/O mass/O shooting/O at/O a/O December/O 2/O holiday/O party/O last/O year/O in/O San/LOCATION Bernardino/LOCATION ,/O Calif./LOCATION ,/O that/O killed/O 14/O people/O and/O injured/O 22/O ./O 
Cook/PERSON quickly/O countered/O the/O court/O action/O with/O an/O open/O letter/O posted/O to/O his/O company/O 's/

CRFClassifier invoked on Tue Feb 23 14:02:23 EST 2016 with arguments:
   -loadClassifier /home/jacobe/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz -textFile sample.txt
loadClassifier=/home/jacobe/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz
textFile=sample.txt
Loading classifier from /home/jacobe/stanford-ner-2015-12-09/classifiers/english.all.3class.distsim.crf.ser.gz ... done [2.3 sec].
CRFClassifier tagged 140 words in 3 documents at 1206.90 words per second.
