<a href="https://colab.research.google.com/github/guilhermelaviola/NaturalLanguageProcessing/blob/main/Class08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Named Entity Recognition**
Named Entity Recognition (NER) is a fundamental task in Natural Language Processing that aims to identify and classify entities such as people, places, dates, and organizations in unstructured text, transforming it into structured data for computational analysis. A key component of this process is manual entity annotation, which, despite being time-consuming, is essential for training accurate machine learning models. Tools such as NLTK and spaCy support NER by offering efficient text processing and pre-trained models for multiple languages. NER has wide-ranging applications, including question answering, information extraction, machine translation, and intelligent assistants, where it enhances accuracy, preserves meaning, and enables systems to understand and act on user input. Overall, NER plays a crucial role in advancing NLP by enabling deeper understanding and effective use of textual data.

In [1]:
! pip3 install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=6252f895b46fa05c9b663a5987d25dc0c94ef7dcb20abfd39d6c4a47becd46cd
  Stored in directory: /root/.cache/pip/wheels/63/47/7c/a9688349aa74d228ce0a9023229c6c0ac52ca2a40fe87679b8
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [2]:
# Importing all the necessary libraries and resources:
import spacy
import re
import nltk
import wikipedia
import os

## **Example: Named Entity Recognition in Texts**
Named Entity Recognition (NER) is an essential task in the field of Natural Language Processing (NLP) that consists of identifying and classifying significant semantic elements in texts, such as names of people, organizations, places, and dates. The importance of NER lies in its ability to assign meaning and structure to raw textual data, facilitating the understanding and automatic analysis of such data.

In [3]:
# Loading the English language template:
nlp = spacy.load('en_core_web_sm')

# Example text for entity recognition:
text = 'Barack Obama was born in Honolulu, Hawaii, on August 4, 1961.'

# Text processing:
doc = nlp(text)

# Displaying the recognized entities:
for entity in doc.ents:
  print(entity.text, entity.label_)

Barack Obama PERSON
Honolulu GPE
Hawaii GPE
August 4, 1961 DATE


## **Example: Named Entity Annotation**
Named entity annotation is a critical process in Natural Language Processing (NLP), involving the identification and classification of text segments as meaningful entities, such as names of people, places, organizations, and others. This step is fundamental for training machine learning models capable of automatically processing and interpreting large volumes of text.

In [4]:
text = 'Cristiano Ronaldo played for Real Madrid.'
rules = {'player': ['Cristiano Ronaldo'], 'club': ['Real Madrid']}

def annotate_text(text, rules):
  for entity, terms in rules.items():
    for term in terms:
      text = text.replace(term, f'<{entity}>{term}</{entity}>')
      return text

annotated_text = annotate_text(text, rules)
print(annotated_text)

<player>Cristiano Ronaldo</player> played for Real Madrid.


## **Example: NLTK and SpaCy Libraries for REN**
In the context of Named Entity Recognition (REN) in Natural Language Processing (NLP), the NLTK (Natural Language Toolkit) and spaCy libraries are essential tools. Both offer pre-trained models that facilitate the identification and classification of entities in texts. However, it is crucial to understand that these generic models may not cover all the specificities of different domains, which sometimes requires manual annotation and training of specific models.

In [7]:
nltk.download('punkt_tab') # Added to download the missing resource
nltk.download('averaged_perceptron_tagger_eng') # Changed to download the specific English tagger resource
nltk.download('maxent_ne_chunker_tab') # This line is added to download the missing data
nltk.download('words') # Added to download the missing 'words' corpus

text = 'Henrikh Mkhitaryan was born in Yerevan.'
tokens = nltk.word_tokenize(text)
tags = nltk.pos_tag(tokens)
entities = nltk.chunk.ne_chunk(tags)

print(entities)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


(S
  (PERSON Henrikh/NNP)
  (PERSON Mkhitaryan/NNP)
  was/VBD
  born/VBN
  in/IN
  (GPE Yerevan/NNP)
  ./.)


## **Example: Named Entity Recognition**
In the following example, the script downloads a Wikipedia article and analyzes it using two different NLP frameworks to identify named entities and linguistic structure.

In [10]:
wikipedia.set_lang('en')

page = wikipedia.page('Roy Ayers')
page.content

# Named Entitites recognition with NLTK:
! pip3 install svgling

nltk.download('punkt')
# Download the 'averaged_perceptron_tagger' for English
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
# Download the 'maxent_ne_chunker_tab' data package
nltk.download('maxent_ne_chunker_tab') # This line is added to download the missing data

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

tokens = word_tokenize(page.content)
print(tokens)

tags = pos_tag(tokens)
print(tags)

entities = ne_chunk(tags)
entities

# Named Entities Recognition with SpaCy:

!pip install en_core_web_sm # Install the smaller 'sm' model for reliability
# The following line is removed as it is causing the error by introducing a symlink
# !python -m spacy link en_core_news_lg en # Creates a symlink to the model for spaCy to locate
# os.environ['SPACY_DATA'] = "/usr/local/lib/python3.11/dist-packages/spacy/data" # This line is not needed and can be removed

# Changed from 'en_core_news_lg' to 'en_core_web_sm' to load the correct model
nlp = spacy.load('en_core_web_sm')
doc = nlp(page.content)

doc[0]
doc[0].pos_
doc[0].ent_type_

for token in doc[:60]:
  print((token, token.ent_iob_, token.ent_type_,))



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!


['Roy', 'Edward', 'Ayers', 'Jr.', '(', 'September', '10', ',', '1940', '–', 'March', '4', ',', '2025', ')', 'was', 'an', 'American', 'vibraphonist', ',', 'record', 'producer', ',', 'and', 'composer', '.', 'Ayers', 'began', 'his', 'career', 'as', 'a', 'post-bop', 'jazz', 'artist', ',', 'releasing', 'several', 'studio', 'albums', 'with', 'Atlantic', 'Records', ',', 'before', 'his', 'tenure', 'at', 'Polydor', 'Records', 'beginning', 'in', 'the', '1970s', ',', 'during', 'which', 'he', 'helped', 'to', 'pioneer', 'jazz-funk', '.', 'He', 'was', 'a', 'key', 'figure', 'in', 'the', 'acid', 'jazz', 'movement', ',', 'and', 'has', 'been', 'described', 'as', '``', 'The', 'Godfather', 'of', 'Neo', 'Soul', "''", '.', 'He', 'was', 'best', 'known', 'for', 'his', 'compositions', '``', 'Everybody', 'Loves', 'the', 'Sunshine', "''", ',', '``', 'Running', 'Away', "''", ',', 'and', '``', 'Freaky', 'Deaky', "''", 'and', 'others', 'that', 'charted', 'in', 'the', '1970s', '.', 'At', 'one', 'time', ',', 'Ayers',