spaCy is known for industrial-grade NLP in python. It is written in Cython. We are going to be working on Named Entity Recognition tagging here.

In [1]:
article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a 
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped 
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2 
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in 
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight 
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.'''

In [2]:
import spacy

In [4]:
spacy_nlp = spacy.load('en')

In [5]:
document = spacy_nlp(article)
for entity in document.ents:
    print("Type: {}, Value: {}".format(entity.label_,entity))

Type: GPE, Value: 

Type: NORP, Value: Asian
Type: DATE, Value: Tuesday
Type: GPE, Value: 

Type: LOC, Value: Europe
Type: CARDINAL, Value: 16-month
Type: GPE, Value: 

Type: ORG, Value: MSCI
Type: LOC, Value: Asia-Pacific
Type: GPE, Value: Japan
Type: PERCENT, Value: 1.7 percent
Type: DATE, Value: a 1-1/2 
week
Type: NORP, Value: Australian
Type: PERCENT, Value: 1.6 percent
Type: GPE, Value: Japan
Type: PERCENT, Value: 3.1 percent
Type: GPE, Value: 

Type: ORG, Value: Apple
Type: MONEY, Value: 1.286
Type: CARDINAL, Value: three
Type: GPE, Value: 

Type: ORG, Value: the
European Union
Type: GPE, Value: Brexit
Type: NORP, Value: British
Type: PERSON, Value: Theresa May
Type: DATE, Value: Monday


The supported entity types are:
Person - people, including fictional
NORP - natinoalities or religious or political groups
FAC - buildings, airports, highways, bridges, etc.
ORG - companies, agencies, institutions
GPE - countries, cities, states
LOC - non-gpe locations, mountain ranges, bodies of water
PRODUCT - objects, vehicles, foods, etc. (not services)
EVENT - named hurricanes, battles, wars, sports events
WORK_OF_ART - titles of books, songs, etc.
LAW - named documents made into laws
LANGUAGE - any named language
DATE - absolute or relative dates or periods
TIME - time smaller than a day
PERCENT - percentage, including "%"
MONEY - monetary values, including unit
QUANTITY - measurements, weight or distance
ORDINAL - "first", "second"
CARDINAL - numerals that do not fall under another type

NLTK (Natural Language Toolkit) is a python package that provides a set of natural languages corpora and APIs of wide varieties of NLP algorithms. TO perform Named Entity Recognition using NLTK, it is done in three stages:
1. Word Tokenization
2. Parts of Speech (POS) tagging
3. Named Entity Recognition

In [7]:
import nltk

from nltk import word_tokenize, pos_tag, ne_chunk

nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')

[nltk_data] Downloading package words to /home/joseph/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/joseph/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /home/joseph/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /home/joseph/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


True

Note, we needed to download some standard corpora and API from NLTK to perform parts of speech tagging and named entity recognition. 

In [9]:
def preprocess( sentence ):
    sentence = nltk.word_tokenize(sentence)
    sentence = nltk.pos_tag(sentence)
    return sentence

sentence_processed = preprocess( article )
sentence_processed

[('Asian', 'JJ'),
 ('shares', 'NNS'),
 ('skidded', 'VBN'),
 ('on', 'IN'),
 ('Tuesday', 'NNP'),
 ('after', 'IN'),
 ('a', 'DT'),
 ('rout', 'NN'),
 ('in', 'IN'),
 ('tech', 'JJ'),
 ('stocks', 'NNS'),
 ('put', 'VBD'),
 ('Wall', 'NNP'),
 ('Street', 'NNP'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('sword', 'NN'),
 (',', ','),
 ('while', 'IN'),
 ('a', 'DT'),
 ('sharp', 'JJ'),
 ('drop', 'NN'),
 ('in', 'IN'),
 ('oil', 'NN'),
 ('prices', 'NNS'),
 ('and', 'CC'),
 ('political', 'JJ'),
 ('risks', 'NNS'),
 ('in', 'IN'),
 ('Europe', 'NNP'),
 ('pushed', 'VBD'),
 ('the', 'DT'),
 ('dollar', 'NN'),
 ('to', 'TO'),
 ('16-month', 'JJ'),
 ('highs', 'NNS'),
 ('as', 'IN'),
 ('investors', 'NNS'),
 ('dumped', 'VBD'),
 ('riskier', 'JJR'),
 ('assets', 'NNS'),
 ('.', '.'),
 ('MSCI', 'NNP'),
 ('’', 'NNP'),
 ('s', 'VBD'),
 ('broadest', 'JJS'),
 ('index', 'NN'),
 ('of', 'IN'),
 ('Asia-Pacific', 'NNP'),
 ('shares', 'NNS'),
 ('outside', 'IN'),
 ('Japan', 'NNP'),
 ('dropped', 'VBD'),
 ('1.7', 'CD'),
 ('percent', 'NN'),
 ('to', 'T

The codes are labeled as follows:

CC - coordinating conjunction

CD - cardinal digit

DT - determiner

EX - existential here ("there exists" instead of "there is")

FW - foreign word

IN - preposition/subordinating conjunction

JJ - adjective 'big'

JJR - adjective comparative 'bigger'

JJS - adjective superlative 'biggest'

LS - list marker

MD - modal could, will

NN - noun, singular 'desk'

NNS - noun plural 'desks'

NNP - proper noun, singular 'Harrison'

NNPS - proper noun, plural "Americans"

PDT - predeterminer 'all the kids'

POS - possessive ending parent's

PRP - presonal pronoun I, he, she

PRP$ - possess pronoun my, his, hers

RB - adverb very, silently

RBR - adverb, comparative better

RBS - adverb, superlative best

RP - particle give up

TO - to go 'to' the store

VB - verb, base form take

VDB - verb, past tense took

VBG - verb, gerund/present participle taking

VBP - verb, sing. present, non-3d take

VBZ - verb, 3d person sing. present takes

WDT - wh-determiner which

WP - wh-pronoun, who, what

WP$ - possessive wh-pronoun, whose

WRB - wh-adverb where, when

After we do the parts-of-speech tagging, we need to do **chunking**. This follows POS tagging to add more structure to the sentence. The result is grouping of words in "chunks"

Here, we only want to NER tag the Nouns

In [10]:
chunks = ne_chunk( sentence_processed )

for x in str(chunks).split('\n'):
    if '/NN' in x:
        print(x)

  shares/NNS
  Tuesday/NNP
  rout/NN
  stocks/NNS
  (FACILITY Wall/NNP Street/NNP)
  sword/NN
  drop/NN
  oil/NN
  prices/NNS
  risks/NNS
  (GPE Europe/NNP)
  dollar/NN
  highs/NNS
  investors/NNS
  assets/NNS
  (ORGANIZATION MSCI/NNP)
  ’/NNP
  index/NN
  Asia-Pacific/NNP
  shares/NNS
  (GPE Japan/NNP)
  percent/NN
  week/NN
  trough/NN
  shares/NNS
  percent/NN
  (PERSON Japan/NNP)
  ’/NNP
  (PERSON Nikkei/NNP)
  percent/NN
  losses/NNS
  machinery/NN
  makers/NNS
  suppliers/NNS
  (PERSON Apple/NNP)
  ’/NNP
  iphone/NN
  parts/NNS
  (PERSON Sterling/NN)
  sessions/NNS
  losses/NNS
  Nov.1/NNP
  issues/NNS
  (ORGANIZATION European/NNP Union/NNP)
  (GPE Brexit/NNP)
  (GPE British/NNP)
  Prime/NNP
  Minister/NNP
  (PERSON Theresa/NNP May/NNP)
  Monday/NNP


This looks okay, but not great. We can manually implement a pattern for performing the chunking. We can say that a noun phrase, (NP) is formed whenever the chunker finds an optional determiner (DT), followed by any number of adjectives (JJ) and then a noun (NN)

In [15]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(pattern)
cs = cp.parse( sentence_processed)
print(cs)

(S
  Asian/JJ
  shares/NNS
  skidded/VBN
  on/IN
  Tuesday/NNP
  after/IN
  (NP a/DT rout/NN)
  in/IN
  tech/JJ
  stocks/NNS
  put/VBD
  Wall/NNP
  Street/NNP
  to/TO
  (NP the/DT sword/NN)
  ,/,
  while/IN
  (NP a/DT sharp/JJ drop/NN)
  in/IN
  (NP oil/NN)
  prices/NNS
  and/CC
  political/JJ
  risks/NNS
  in/IN
  Europe/NNP
  pushed/VBD
  (NP the/DT dollar/NN)
  to/TO
  16-month/JJ
  highs/NNS
  as/IN
  investors/NNS
  dumped/VBD
  riskier/JJR
  assets/NNS
  ./.
  MSCI/NNP
  ’/NNP
  s/VBD
  broadest/JJS
  (NP index/NN)
  of/IN
  Asia-Pacific/NNP
  shares/NNS
  outside/IN
  Japan/NNP
  dropped/VBD
  1.7/CD
  (NP percent/NN)
  to/TO
  (NP a/DT 1-1/2/JJ week/NN)
  (NP trough/NN)
  ,/,
  with/IN
  Australian/JJ
  shares/NNS
  sinking/VBG
  1.6/CD
  (NP percent/NN)
  ./.
  Japan/NNP
  ’/NNP
  s/VBD
  Nikkei/NNP
  dived/VBD
  3.1/CD
  (NP percent/NN)
  led/VBN
  by/IN
  losses/NNS
  in/IN
  (NP electric/JJ machinery/NN)
  makers/NNS
  and/CC
  suppliers/NNS
  of/IN
  Apple/NNP
  ’/NNP
  s/

The output has a tree structure, and "S" means the sentence is the first level. A more aceptable format is called IOB tags (Inside, Outside, Beginning)

In [16]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('Asian', 'JJ', 'O'),
 ('shares', 'NNS', 'O'),
 ('skidded', 'VBN', 'O'),
 ('on', 'IN', 'O'),
 ('Tuesday', 'NNP', 'O'),
 ('after', 'IN', 'O'),
 ('a', 'DT', 'B-NP'),
 ('rout', 'NN', 'I-NP'),
 ('in', 'IN', 'O'),
 ('tech', 'JJ', 'O'),
 ('stocks', 'NNS', 'O'),
 ('put', 'VBD', 'O'),
 ('Wall', 'NNP', 'O'),
 ('Street', 'NNP', 'O'),
 ('to', 'TO', 'O'),
 ('the', 'DT', 'B-NP'),
 ('sword', 'NN', 'I-NP'),
 (',', ',', 'O'),
 ('while', 'IN', 'O'),
 ('a', 'DT', 'B-NP'),
 ('sharp', 'JJ', 'I-NP'),
 ('drop', 'NN', 'I-NP'),
 ('in', 'IN', 'O'),
 ('oil', 'NN', 'B-NP'),
 ('prices', 'NNS', 'O'),
 ('and', 'CC', 'O'),
 ('political', 'JJ', 'O'),
 ('risks', 'NNS', 'O'),
 ('in', 'IN', 'O'),
 ('Europe', 'NNP', 'O'),
 ('pushed', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('dollar', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('16-month', 'JJ', 'O'),
 ('highs', 'NNS', 'O'),
 ('as', 'IN', 'O'),
 ('investors', 'NNS', 'O'),
 ('dumped', 'VBD', 'O'),
 ('riskier', 'JJR', 'O'),
 ('assets', 'NNS', 'O'),
 ('.', '.', 'O'),
 ('MSCI', 'NNP'

Here the output is a line with a parts of speech and named entity tagged. 

In [17]:
for word, pos, ner in iob_tagged:
    print(word, pos, ner)

Asian JJ O
shares NNS O
skidded VBN O
on IN O
Tuesday NNP O
after IN O
a DT B-NP
rout NN I-NP
in IN O
tech JJ O
stocks NNS O
put VBD O
Wall NNP O
Street NNP O
to TO O
the DT B-NP
sword NN I-NP
, , O
while IN O
a DT B-NP
sharp JJ I-NP
drop NN I-NP
in IN O
oil NN B-NP
prices NNS O
and CC O
political JJ O
risks NNS O
in IN O
Europe NNP O
pushed VBD O
the DT B-NP
dollar NN I-NP
to TO O
16-month JJ O
highs NNS O
as IN O
investors NNS O
dumped VBD O
riskier JJR O
assets NNS O
. . O
MSCI NNP O
’ NNP O
s VBD O
broadest JJS O
index NN B-NP
of IN O
Asia-Pacific NNP O
shares NNS O
outside IN O
Japan NNP O
dropped VBD O
1.7 CD O
percent NN B-NP
to TO O
a DT B-NP
1-1/2 JJ I-NP
week NN I-NP
trough NN B-NP
, , O
with IN O
Australian JJ O
shares NNS O
sinking VBG O
1.6 CD O
percent NN B-NP
. . O
Japan NNP O
’ NNP O
s VBD O
Nikkei NNP O
dived VBD O
3.1 CD O
percent NN B-NP
led VBN O
by IN O
losses NNS O
in IN O
electric JJ B-NP
machinery NN I-NP
makers NNS O
and CC O
suppliers NNS O
of IN O
Apple NNP O