<a href="https://colab.research.google.com/github/goel4ever/machine-learning-notebooks/blob/main/nlp_name_entity_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP: Name Entity Recognition (NER)

`Named entities` are noun phrases that refer to specific locations, people, organizations, and so on. With named entity recognition, you can `find the named entities` in your texts and also determine what `kind of named entity` they are.

Reference from [here](https://www.nltk.org/book/ch07.html#sec-ner)

In [1]:
# Required imports
import nltk

In [3]:
# Download chunker and words
nltk.download("maxent_ne_chunker")
nltk.download("words")

# Download resource punkt for tokenization
nltk.download('punkt')

# Required imports
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
# We'll reuse the quote used in chunking
quote = "It's a dangerous business, Frodo, going out your door."

In [5]:
# Tokenize the string by word
words_in_quote = word_tokenize(quote)
words_in_quote

['It',
 "'s",
 'a',
 'dangerous',
 'business',
 ',',
 'Frodo',
 ',',
 'going',
 'out',
 'your',
 'door',
 '.']

In [6]:
# Tag those words by part of speech
nltk.download("averaged_perceptron_tagger")
pos_tags = nltk.pos_tag(words_in_quote)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

In [8]:
# use nltk.ne_chunk() to recognize named entities.
tree = nltk.ne_chunk(pos_tags)

# This will cause an error in notebooks because there's no display to draw the tree on
# tree.draw()
print(tree.pretty_print())

# Frodo has been tagged as a PERSON!!

                                                 S                                                  
   ______________________________________________|_____________________________________________      
  |      |     |        |            |       |   |      |       |        |        |     |    PERSON 
  |      |     |        |            |       |   |      |       |        |        |     |      |     
It/PRP 's/VBZ a/DT dangerous/JJ business/NN ,/, ,/, going/VBG out/RP your/PRP$ door/NN ./. Frodo/NNP

None


In [9]:
# Use the parameter binary=True if you just want to know what the named entities are but not what kind of named entity they are
tree = nltk.ne_chunk(pos_tags, binary=True)
print(tree.pretty_print())

                                                 S                                                  
   ______________________________________________|_____________________________________________      
  |      |     |        |            |       |   |      |       |        |        |     |      NE   
  |      |     |        |            |       |   |      |       |        |        |     |      |     
It/PRP 's/VBZ a/DT dangerous/JJ business/NN ,/, ,/, going/VBG out/RP your/PRP$ door/NN ./. Frodo/NNP

None


In [10]:
# Take this one step further and extract named entities directly from your text.
# Create a string from which to extract named entities.
quote = """
Men like Schiaparelli watched the red planet—it is odd, by-the-bye, that
for countless centuries Mars has been the star of war—but failed to
interpret the fluctuating appearances of the markings they mapped so well.
All that time the Martians must have been getting ready.

During the opposition of 1894 a great light was seen on the illuminated
part of the disk, first at the Lick Observatory, then by Perrotin of Nice,
and then by other observers. English readers heard of it first in the
issue of Nature dated August 2.
"""

In [11]:
# Create a function to extract named entities
def extract_ne(quote):
    words = word_tokenize(quote, language="english")
    tags = nltk.pos_tag(words)
    tree = nltk.ne_chunk(tags, binary=True)
    return set(
        " ".join(i[0] for i in t)
        for t in tree
        if hasattr(t, "label") and t.label() == "NE"
    )

In [12]:
# Gather all named entities, with no repeats
extract_ne(quote)

# You missed the city of Nice, possibly because NLTK interpreted it as a regular English adjective, but you still got the following:
# An institution: 'Lick Observatory'
# A planet: 'Mars'
# A publication: 'Nature'
# People: 'Perrotin', 'Schiaparelli'

{'Lick Observatory', 'Mars', 'Nature', 'Perrotin', 'Schiaparelli'}