<a href="https://colab.research.google.com/github/iamfady/NLP/blob/main/Named_Entity_Recognition_(NER).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Named Entity Recognition (NER) is a subtask of Natural Language Processing (NLP) that aims to identify and extract named entities from a text. Named entities are objects or concepts that are assigned a name, such as persons, organizations, locations, dates, and numerical expressions.

NER involves using machine learning algorithms to automatically recognize and classify named entities in text data, based on their context and characteristics. NER can be applied in a wide range of applications, such as information extraction, question answering, text classification, and sentiment analysis.

The output of NER is a structured representation of the text, where named entities are tagged and classified according to predefined categories. NER is a critical component in many NLP applications, as it helps to extract structured information from unstructured text data, making it easier to process and analyze.


In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [3]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
#print(doc.ents)
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [4]:
from spacy import displacy

displacy.render(doc, style="ent")

## List down all the entities


In [5]:
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [6]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
doc.ents

(Michael Bloomberg, Bloomberg, 1982)

In [7]:
doc = nlp("Michael Bloomberg founded Bloomberg in 1982")
for ent in doc.ents:
    print(ent.text, "|", ent.label_, "|", spacy.explain(ent.label_))

Michael Bloomberg | PERSON | People, including fictional
Bloomberg | GPE | Countries, cities, states
1982 | DATE | Absolute or relative dates or periods


In [8]:
doc = nlp("Tesla Inc is going to acquire Twitter Inc for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", ent.start_char, "|", ent.end_char)

Tesla Inc  |  ORG  |  0 | 9
Twitter Inc  |  PERSON  |  30 | 41
$45 billion  |  MONEY  |  46 | 57


## Setting custom entities


In [9]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  PERSON
$45 billion  |  MONEY


In [10]:
s = doc[2:5]
s

going to acquire

In [11]:
type(s)

spacy.tokens.span.Span

In [12]:
from spacy.tokens import Span

s1 = Span(doc, 0, 1, label="ORG")
s2 = Span(doc, 5, 6, label="ORG")

doc.set_ents([s1, s2], default="unmodified")

In [13]:
for ent in doc.ents:
    print(ent.text, " | ", ent.label_)

Tesla  |  ORG
Twitter  |  ORG
$45 billion  |  MONEY


In [19]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [20]:
# Define input text
input_text = "Steve Jobs was the CEO of Apple Corp. in California."

# Tokenize input text
tokens = word_tokenize(input_text)

# Perform Part-of-Speech (POS) tagging
pos_tags = pos_tag(tokens)
print('pos_tags',pos_tags)

# Perform Named Entity Recognition (NER)
ne_tree = ne_chunk(pos_tags)
print('ne_tree',ne_tree)

# Extract named entities and their labels
named_entities = []
for subtree in ne_tree.subtrees():
    if subtree.label() in ['PERSON', 'ORGANIZATION', 'LOCATION']:
        named_entity = ' '.join(word for word, tag in subtree.leaves())
        print(named_entity)
        named_entities.append((named_entity, subtree.label()))

# Print named entities and their labels
print(named_entities)


pos_tags [('Steve', 'NNP'), ('Jobs', 'NNP'), ('was', 'VBD'), ('the', 'DT'), ('CEO', 'NNP'), ('of', 'IN'), ('Apple', 'NNP'), ('Corp.', 'NNP'), ('in', 'IN'), ('California', 'NNP'), ('.', '.')]
ne_tree (S
  (PERSON Steve/NNP)
  (PERSON Jobs/NNP)
  was/VBD
  the/DT
  (ORGANIZATION CEO/NNP)
  of/IN
  (ORGANIZATION Apple/NNP Corp./NNP)
  in/IN
  (GPE California/NNP)
  ./.)
Steve
Jobs
CEO
Apple Corp.
[('Steve', 'PERSON'), ('Jobs', 'PERSON'), ('CEO', 'ORGANIZATION'), ('Apple Corp.', 'ORGANIZATION')]


In [21]:
for word, tag in subtree.leaves():
    print(word)

California
