In [33]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

True

Named Entity Recognition with NLTK

---


This cell uses the NLTK (Natural Language Toolkit) library to identify named entities in the text. This process involves a few distinct steps.

import nltk, from nltk.tokenize import...: These lines import the necessary functions. word_tokenize splits text into words, pos_tag assigns parts of speech (like noun, verb), and ne_chunk performs the entity recognition.

text = "...": Defines the sample sentence to be analyzed.

words = word_tokenize(text): The sentence is broken down into a list of individual words and punctuation, a process called tokenization.

postags = pos_tag(words): Each word (token) is tagged with its part of speech (POS). For example, "Modi" is tagged as a proper noun (NNP). This step is crucial because NLTK's NER relies on these grammatical tags to identify entities.

ner=nltk.ne_chunk(postags, binary=False): This is the main NER step. It takes the POS-tagged words and "chunks" them together to form named entities. The binary=False argument tells it to classify entities into specific types (like PERSON, GPE for Geo-Political Entity, ORGANIZATION) rather than just marking them as a generic "Named Entity."

print(ner): The output displays a tree-like structure where the identified entities are labeled. You'll see "Narendra Modi" identified as a PERSON and "India" as a GPE.

In [34]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = """Prime Minister Narendra Modi on Tuesday announced 20 lakh crore package for the India to fight against coronavirus pandemic"""
words = word_tokenize(text)
postags = pos_tag(words)

ner=nltk.ne_chunk(postags, binary=False)
print(ner)

(S
  Prime/NNP
  Minister/NNP
  (PERSON Narendra/NNP Modi/NNP)
  on/IN
  Tuesday/NNP
  announced/VBD
  20/CD
  lakh/NN
  crore/NN
  package/NN
  for/IN
  the/DT
  (GPE India/NNP)
  to/TO
  fight/VB
  against/IN
  coronavirus/NN
  pandemic/NN)


Named Entity Recognition with spaCy

---


This cell performs the same task as the first one but uses spaCy, a more modern and powerful NLP library known for its speed and accuracy. 🚀

import spacy: Imports the spaCy library.

nlp=spacy.load('en_core_web_sm'): Loads a small pre-trained English language model. This single model contains everything needed for tokenization, POS tagging, and NER, making the process much simpler than NLTK's multi-step approach.

doc=nlp(text): The text is processed by the nlp model pipeline. The result is a doc object which contains a wealth of information about the text, including the identified named entities.

for ent in doc.ents:: This loop iterates directly over the entities found in the doc. spaCy automatically finds and labels them.

print(ent.text, ent.label_): For each entity found, its text (e.g., "Narendra Modi") and its label (e.g., "PERSON") are printed. This approach is more direct than NLTK's. You'll also see it identifies "20 lakh crore" as CARDINAL (a number) and "Tuesday" as a DATE.

In [35]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [36]:
doc=nlp(text)
for ent in doc.ents:
  print(ent.text, ent.label_)

Narendra Modi PERSON
Tuesday DATE
20 CARDINAL
India GPE


In [37]:
from spacy import displacy
displacy.render(doc, style='ent', jupyter=True)

In [None]:
#gpe means geopolitical entity