# Named Entity Recognition (off-the-shelf tools NLTK, spaCy)

Named Entity Recognition has become such a common task that many available tools provide NER out of the box. In this tutorial, we use NLTK and spaCy to perform this task.

## Import required packages

In [9]:
import nltk
import spacy

from nltk import sent_tokenize, sent_tokenize, word_tokenize, pos_tag, ne_chunk
from nltk.tree import Tree

from newspaper import Article

We also need the English model for spaCy

In [10]:
nlp = spacy.load('en_core_web_sm')

## Loading a document

For this tutorial, we use news articles as documents using the `newspaper` package to make it easy for us; see the "Data Collection" tutorial for more details.

In [11]:
url = 'http://www.straitstimes.com/asia/east-asia/now-its-japans-turn-to-brace-for-a-monster-storm-as-typhoon-lan-nears'
url = 'http://www.straitstimes.com/singapore/ammonia-leak-in-food-factory-at-fishery-port-road-3-taken-to-hospital'
#url = 'http://www.straitstimes.com/singapore/police-car-mounts-divider-in-accident-in-kampong-bahru-road-no-injuries'

article = Article(url)

The methods `download()` and `parse()` fetch and process the current article. For the rest of this tutorial, we only consider the title and the main content (text) of an article.

In [12]:
article.download()
article.parse()

title = article.title
text = article.text

## Named Entity Recognition with NLTK

We first show how NLTK performs NLTK with a simple example sentence

In [13]:
example_sentence = "Straits Times interviewed PM Lee last week in Japan."

For preprocessing we need to tokenize and POS-tag the sentence, since POS tags a required for the NER extractions.

In [14]:
# Tokenize sentence
token_list = word_tokenize(example_sentence)
# POS tag token list
token_pos_list = pos_tag(token_list)

NLTK proivdes a method `ne_chunk` (ne = named entity) which labels words or phrases as named entities. The result is a token tree, an internal representation used in NLTK.

In [15]:
# Perform named entity recogniztion through chunking
ne_chunk_tree = ne_chunk(token_pos_list, binary=False)

print(ne_chunk_tree)

(S
  Straits/NNS
  (PERSON Times/NNP)
  interviewed/VBD
  (ORGANIZATION PM/NNP Lee/NNP)
  last/JJ
  week/NN
  in/IN
  (GPE Japan/NNP)
  ./.)


The following auxiliary method goes trough the tree and extract the named entities and put them with their label into a list.

In [16]:
def extract_named_entities(tree):
    chunk_list = []
    for i in tree:
        if type(i) == Tree:
            label = i.label()
            name = " ".join([token for token, pos in i.leaves()])
            chunk_list.append((name, label))
        else:
            continue
    return chunk_list

In [17]:
print(extract_named_entities(ne_chunk_tree))

[('Times', 'PERSON'), ('PM Lee', 'ORGANIZATION'), ('Japan', 'GPE')]


Now we can repate this steps with the news article. First we need to tokenize the document into sentences.

In [18]:
sentences = sent_tokenize(text)

For each sentence, we perform the required steps as shown above:

* tokenize sentice into words/tokens
* POS-tag token list
* Perform NER using `ne_chunk()`
* Extract found named entities and put them into a list

In [20]:
for sent in sentences:
    # Tokenize sentence
    token_list = word_tokenize(sent)
    # POS tag token list
    token_pos_list = pos_tag(token_list)
    # Perform named entity recogniztion through chunking
    ne_chunk_tree = ne_chunk(token_pos_list, binary=False)
    # Extract the named entities from tree into a simple list
    named_entities_list = extract_named_entities(ne_chunk_tree)
    # Print found named entities
    for ne in named_entities_list:
        print("-- {} ({})".format(ne[0], ne[1]))

-- SINGAPORE (GPE)
-- Jurong (GPE)
-- Singapore Civil Defence Force (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- Fishery Port Road (PERSON)
-- Ben Foods (PERSON)
-- SCDF (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- Ng Teng Fong (PERSON)
-- SCDF (ORGANIZATION)
-- Salim Anwar (PERSON)
-- NCS Cold Stores (ORGANIZATION)
-- Ben Foods (PERSON)
-- Straits Times (ORGANIZATION)
-- Fauzan Tahir (PERSON)
-- Ben Foods (PERSON)
-- ST (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- HazMat (ORGANIZATION)
-- Hazardous (ORGANIZATION)
-- HazMat Specialists (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- SCDF (ORGANIZATION)
-- Ben Foods (PERSON)
-- Kallang Way (PERSON)
-- Pioneer (GPE)


## Named Entity Recognition with spaCy

When spaCy analyzes a document, it performs a whole series steps including tokenizing, POS tagging, lemmatization and even named entity recognition. That means, after performing the following command, the `doc` object already contains the intormation about found named entities.

In [21]:
doc = nlp(text)

The auxiliary method `show_named_entities()` simply prints all found named entities in a nice layout. The only additional - an optional - parameter is `valid_labels` that allows to specify a list of named entity types. For example, we can simply print all persons by setting `valid_labels=['person']`

In [22]:
def show_named_entities(doc, valid_labels=None):
    print('{:35} {:25} {:6} {:6}'.format("NAME", "LABEL", "START", "END"))
    for e in doc.ents:
        label = e.label_.lower()
        if valid_labels is not None:
            if label not in valid_labels:
                continue
        # Print name, label, start and end (in a nice way)
        print('{:35} {:25} {:5} {:5}'.format(e.text, label, e.start_char, e.end_char))
    

Let's print the found named entities for the current document. You can filter the list by specifying a list of valid labels.

In [23]:
show_named_entities(doc)
#show_named_entities(doc, valid_labels=['gpe', 'loc'])

NAME                                LABEL                     START  END   
Jurong                              gpe                          81    87
Friday                              date                         91    97
Jan                                 person                       99   102
The Singapore Civil Defence Force   org                         109   142
SCDF                                org                         144   148
Facebook                            gpe                         160   168
1                                   cardinal                    231   232
Fishery Port Road                   fac                         234   251
about 11.40am.                      percent                     256   270
Ben Foods                           person                      292   301
first                               ordinal                     479   484
SCDF                                org                         570   574
SCDF                                

### List of entitiy types supported by spaCy

![title](images/ner-spacy-entity-labels.png "Test")