In [None]:
import nltk
nltk.download("popular")
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

Let's take a sentence from an article published in the New york Times as the test data.
Test data = 'United States tech giant Google on Wednesday launched a new cloud data hub in Warsaw its first in Central and Eastern Europe with an investment of nearly $2.0 billion i.e. €1.7 billion'

In [None]:
data = 'India reported 36,571 new Covid cases and 540 deaths in last 24 hours, according to health ministry bulletin on Friday. Active caseload stands at 3,63,605; lowest in 150 days. Meanwhile, the recovery rate has increased to 97.54%. Stay with TOI for all updates'

Then let's apply word tokenization and part-of-speech tagging to the data we have taken.
Below is the function module which is used for the preprocessing of the data take.

In [None]:
def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

By passing our test data as the parameter to out preprocess() function module, it processes the data and returns a list of tuples containing the words and their parts of speech.

In [None]:
sent = preprocess(data)
sent

[('India', 'NNP'),
 ('reported', 'VBD'),
 ('36,571', 'CD'),
 ('new', 'JJ'),
 ('Covid', 'NNP'),
 ('cases', 'NNS'),
 ('and', 'CC'),
 ('540', 'CD'),
 ('deaths', 'NNS'),
 ('in', 'IN'),
 ('last', 'JJ'),
 ('24', 'CD'),
 ('hours', 'NNS'),
 (',', ','),
 ('according', 'VBG'),
 ('to', 'TO'),
 ('health', 'NN'),
 ('ministry', 'NN'),
 ('bulletin', 'NN'),
 ('on', 'IN'),
 ('Friday', 'NNP'),
 ('.', '.'),
 ('Active', 'NNP'),
 ('caseload', 'NN'),
 ('stands', 'VBZ'),
 ('at', 'IN'),
 ('3,63,605', 'CD'),
 (';', ':'),
 ('lowest', 'JJS'),
 ('in', 'IN'),
 ('150', 'CD'),
 ('days', 'NNS'),
 ('.', '.'),
 ('Meanwhile', 'RB'),
 (',', ','),
 ('the', 'DT'),
 ('recovery', 'NN'),
 ('rate', 'NN'),
 ('has', 'VBZ'),
 ('increased', 'VBN'),
 ('to', 'TO'),
 ('97.54', 'CD'),
 ('%', 'NN'),
 ('.', '.'),
 ('Stay', 'NNP'),
 ('with', 'IN'),
 ('TOI', 'NNP'),
 ('for', 'IN'),
 ('all', 'DT'),
 ('updates', 'NNS')]

Our next process is chunking. Chunking is a process pf creating a chunk pattern that consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN. Below is the pattern we form with the rule we discussed about. Chunks are formed by words and the kinds of words which are defined using the part-of-speech tags. A pattern can be defined as words that can’t be a part of chunks and such words are known as chinks.

In [None]:
pattern = 'NP: {<DT>?<JJ>*<NN>}'

The next step is to create a chunk parser and to test it on our test data. The output can be observed as a tree or a hierarchy with S as the root level which denotes the sentence.

In [None]:
chun_p = nltk.RegexpParser(pattern)
chun_sol = chun_p.parse(sent)
print(chun_sol)

(S
  India/NNP
  reported/VBD
  36,571/CD
  new/JJ
  Covid/NNP
  cases/NNS
  and/CC
  540/CD
  deaths/NNS
  in/IN
  last/JJ
  24/CD
  hours/NNS
  ,/,
  according/VBG
  to/TO
  (NP health/NN)
  (NP ministry/NN)
  (NP bulletin/NN)
  on/IN
  Friday/NNP
  ./.
  Active/NNP
  (NP caseload/NN)
  stands/VBZ
  at/IN
  3,63,605/CD
  ;/:
  lowest/JJS
  in/IN
  150/CD
  days/NNS
  ./.
  Meanwhile/RB
  ,/,
  (NP the/DT recovery/NN)
  (NP rate/NN)
  has/VBZ
  increased/VBN
  to/TO
  97.54/CD
  (NP %/NN)
  ./.
  Stay/NNP
  with/IN
  TOI/NNP
  for/IN
  all/DT
  updates/NNS)


Before moving to the next part of the code, let's discuss about IOP tags. It is a chunks format. These tags are similar to part-of-speech tags but provided they can denote the inside, outside, and the beginning of a chunk. Not just noun phrase but multiple different chunk phrase types are allowed here. IOB tags have become the standard way to represent chunk structures in files, and we will also be using this format.

In [None]:
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(chun_sol)
pprint(iob_tagged)

[('India', 'NNP', 'O'),
 ('reported', 'VBD', 'O'),
 ('36,571', 'CD', 'O'),
 ('new', 'JJ', 'O'),
 ('Covid', 'NNP', 'O'),
 ('cases', 'NNS', 'O'),
 ('and', 'CC', 'O'),
 ('540', 'CD', 'O'),
 ('deaths', 'NNS', 'O'),
 ('in', 'IN', 'O'),
 ('last', 'JJ', 'O'),
 ('24', 'CD', 'O'),
 ('hours', 'NNS', 'O'),
 (',', ',', 'O'),
 ('according', 'VBG', 'O'),
 ('to', 'TO', 'O'),
 ('health', 'NN', 'B-NP'),
 ('ministry', 'NN', 'B-NP'),
 ('bulletin', 'NN', 'B-NP'),
 ('on', 'IN', 'O'),
 ('Friday', 'NNP', 'O'),
 ('.', '.', 'O'),
 ('Active', 'NNP', 'O'),
 ('caseload', 'NN', 'B-NP'),
 ('stands', 'VBZ', 'O'),
 ('at', 'IN', 'O'),
 ('3,63,605', 'CD', 'O'),
 (';', ':', 'O'),
 ('lowest', 'JJS', 'O'),
 ('in', 'IN', 'O'),
 ('150', 'CD', 'O'),
 ('days', 'NNS', 'O'),
 ('.', '.', 'O'),
 ('Meanwhile', 'RB', 'O'),
 (',', ',', 'O'),
 ('the', 'DT', 'B-NP'),
 ('recovery', 'NN', 'I-NP'),
 ('rate', 'NN', 'B-NP'),
 ('has', 'VBZ', 'O'),
 ('increased', 'VBN', 'O'),
 ('to', 'TO', 'O'),
 ('97.54', 'CD', 'O'),
 ('%', 'NN', 'B-NP'),
 

In [None]:
ne_tree = nltk.ne_chunk(pos_tag(word_tokenize(data)))
print(ne_tree)

(S
  (GPE India/NNP)
  reported/VBD
  36,571/CD
  new/JJ
  Covid/NNP
  cases/NNS
  and/CC
  540/CD
  deaths/NNS
  in/IN
  last/JJ
  24/CD
  hours/NNS
  ,/,
  according/VBG
  to/TO
  health/NN
  ministry/NN
  bulletin/NN
  on/IN
  Friday/NNP
  ./.
  Active/NNP
  caseload/NN
  stands/VBZ
  at/IN
  3,63,605/CD
  ;/:
  lowest/JJS
  in/IN
  150/CD
  days/NNS
  ./.
  Meanwhile/RB
  ,/,
  the/DT
  recovery/NN
  rate/NN
  has/VBZ
  increased/VBN
  to/TO
  97.54/CD
  %/NN
  ./.
  Stay/NNP
  with/IN
  (ORGANIZATION TOI/NNP)
  for/IN
  all/DT
  updates/NNS)


So far we have tried to perform the named entity recognition by using the nltk kit. It is observed that Google is idenitified as a person which is a disappointing output. Thus, we are going to use the named entity recognition module from Spacy which has been trained on the OntoNotes5 corpus and it can support recognition of various entities.

In [None]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = spacy.load("en_core_web_sm")

The same input which we used for the previous segment using ntlk kit was repeated here. This time we are using the Spacy's NER module. For the given data it idenitifies and tags the words in the sentence which are recogonised. Majorly consisting of the noun phrases. The meanings for the identified tags are given below:

In [None]:
doc = nlp('India reported 36,571 new Covid cases and 540 deaths in last 24 hours, according to health ministry bulletin on Friday. Active caseload stands at 3,63,605; lowest in 150 days. Meanwhile, the recovery rate has increased to 97.54%. Stay with TOI for all updates')
pprint([(X.text, X.label_) for X in doc.ents])


[('India', 'GPE'),
 ('36,571', 'CARDINAL'),
 ('Covid', 'PRODUCT'),
 ('540', 'CARDINAL'),
 ('last 24 hours', 'TIME'),
 ('Friday', 'DATE'),
 ('3,63,605', 'CARDINAL'),
 ('150 days', 'DATE'),
 ('97.54%', 'PERCENT')]


GPE : Countires, Syayes and Cities etc..
NORP : Nationalities or religious or political groups
ORG : Organization
MONEY : Currency and Monetary values
DATE : Relative or absolute date
LOC : Non GPE locations
ORDINAL : first, second etc..

In the previous segment we have identified the entities and tagged the recognised entities. Now let's explore the token level recognition which will tokenize our data and will let us explore the token-level entity annotation. It will use the BILUO tagging scheme. 
B : BEGIN (The first token of a multi-token entity)
I : IN (An inner token of a multi-token entity)
L : LAST (The last token of a multi-token entity)
U : UNIT (A single token entity which occurs only once in the data)
O : OUT (A non-entity token which means there doesn't exist such entity)

In [None]:
pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

[(India, 'B', 'GPE'),
 (reported, 'O', ''),
 (36,571, 'B', 'CARDINAL'),
 (new, 'O', ''),
 (Covid, 'B', 'PRODUCT'),
 (cases, 'O', ''),
 (and, 'O', ''),
 (540, 'B', 'CARDINAL'),
 (deaths, 'O', ''),
 (in, 'O', ''),
 (last, 'B', 'TIME'),
 (24, 'I', 'TIME'),
 (hours, 'I', 'TIME'),
 (,, 'O', ''),
 (according, 'O', ''),
 (to, 'O', ''),
 (health, 'O', ''),
 (ministry, 'O', ''),
 (bulletin, 'O', ''),
 (on, 'O', ''),
 (Friday, 'B', 'DATE'),
 (., 'O', ''),
 (Active, 'O', ''),
 (caseload, 'O', ''),
 (stands, 'O', ''),
 (at, 'O', ''),
 (3,63,605, 'B', 'CARDINAL'),
 (;, 'O', ''),
 (lowest, 'O', ''),
 (in, 'O', ''),
 (150, 'B', 'DATE'),
 (days, 'I', 'DATE'),
 (., 'O', ''),
 (Meanwhile, 'O', ''),
 (,, 'O', ''),
 (the, 'O', ''),
 (recovery, 'O', ''),
 (rate, 'O', ''),
 (has, 'O', ''),
 (increased, 'O', ''),
 (to, 'O', ''),
 (97.54, 'B', 'PERCENT'),
 (%, 'I', 'PERCENT'),
 (., 'O', ''),
 (Stay, 'O', ''),
 (with, 'O', ''),
 (TOI, 'O', ''),
 (for, 'O', ''),
 (all, 'O', ''),
 (updates, 'O', '')]


Lets take a few examples and analyse them with respect to the data.
(i) United States - B - means it begins an entity of type GPE in the data.
(ii) Billion - I - means it is inside an entity of type MONEY.
(iii) the,and,to - more common words and are not recognisable entities.

In [None]:
displacy.render(doc, jupyter=True, style="ent")