# **Natural Language Processing**
## *Practice 4 - Entity Recognition*

## Objectives

* Learn to identify and classify the entities of a text using the ***spaCy*** library.

## Entity recognition

The goal of an entity recognition system, also known as "NER system" (*Named Entity Recognition system*), is to identify and classify in predefined categories (person, organization, etc.) the entities found in a text. Entity recognition is usually divided into two subtasks: entity detection and identification of the type of detected entities.

This task can be useful, among other things, to improve the accuracy of Answer Search systems by returning to the user the text fragment that contains an answer to his question instead of returning the whole text, etc.

For this practice we will make use of the [***spaCy***](https://spacy.io/usage/linguistic-features#named-entities) library and its automatic entity recognizer using as always a pre-trained model:

In [1]:
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

In [2]:
sentence = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices.'

In [3]:
doc = nlp(sentence)

We can see the type of each entity by navigating through the *doc*:

In [4]:
for word in doc:
    print(word.text, word.ent_type_)

European NORP
authorities 
fined 
Google ORG
a 
record 
$ MONEY
5.1 MONEY
billion MONEY
on 
Wednesday DATE
for 
abusing 
its 
power 
in 
the 
mobile 
phone 
market 
and 
ordered 
the 
company 
to 
alter 
its 
practices 
. 


As we can appreciate, the entity *European* is of type **NORD** (nationalities or religious or political groups), *Google* is an organization (**ORG**), *$5.1 billion* is the monetary value (**MONEY**) and *Wednesday* is a date object (**DATE**).

But it seems that it is not very clear when tokenizing the sentence since an entity can be composed of several words ($5.1 billion).

In entity recognition there is a word-level tagging scheme called BIO-tagging (Begin-Inside-Outside). So, "B" means the token starts an entity, "I" means it is inside an entity, and "O" means it is outside an entity. Let's see an example:

In [5]:
for word in doc:
    print(word.text, word.ent_iob_, word.ent_type_)

European B NORP
authorities O 
fined O 
Google B ORG
a O 
record O 
$ B MONEY
5.1 I MONEY
billion I MONEY
on O 
Wednesday B DATE
for O 
abusing O 
its O 
power O 
in O 
the O 
mobile O 
phone O 
market O 
and O 
ordered O 
the O 
company O 
to O 
alter O 
its O 
practices O 
. O 


To display only the entities and also to see the entities composed of several words, we can use the **ents** attribute of the doc:

In [6]:
for ent in doc.ents:
    print(ent.text, ent.label_)

European NORP
Google ORG
$5.1 billion MONEY
Wednesday DATE


Also, we can see through the display the entities visually:

In [7]:
from spacy import displacy

In [8]:
displacy.render(doc, jupyter=True, style='ent')


Depending on the model used, there are different entity categories, for example, using the "*en_core_web_sm*" model, the existing entities are:

*   PERSON People, including fictional.
*   NORP Nationalities or religious or political groups.
*   FAC Buildings, airports, highways, bridges, etc.
*   ORG Companies, agencies, institutions, etc.
*   GPE  Geopolitical entities including countries, cities, states.
*   LOC Non-GPE locations, mountain ranges, bodies of water.
*   PRODUCT Objects, vehicles, foods, etc. (Not services.)
*   EVENT Named hurricanes, battles, wars, sports events, etc.
*   WORK_OF_ART Titles of books, songs, etc.
*   LAW Named documents made into laws.
*   LANGUAGE Any named language.
*   DATE Absolute or relative dates or periods.
*   TIME Times smaller than a day.
*   etc...


## Exercises

The result of this practice must be submited by PLATEA and its deadline is **23:59 hours on March 17th, 2025**. This same notebook with extension *.ipynb* will be submitted and renamed as follows: pr5_user1_user2.ipynb. Replace "user1" and "user2" with your email alias.

Download the file "nyt.txt" which is available in Docencia Virtual (Material Complementario folder) and carry out the following tasks:

### Exercise 1

Extract for each category the entities present in the text and their frequency of occurrence (taking into account multi-words), and answer the following questions:
- From the entities extracted, indicate in brief what the news seems to be about.
- Which two people are the most mentioned in the news?
- What are the three most referenced geopolitical entities?

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [11]:
file_path ='/content/drive/MyDrive/NLP/nyt.txt'

In [59]:
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

with open(file_path, 'r', encoding='cp1252') as file:
    text = file.read()

doc = nlp(text)

All entities in the text:

In [60]:
for ent in doc.ents:
    print(ent.text, ent.label_)

WASHINGTON GPE
nearly four years ago DATE
Abdel Fattah PERSON
Egypt GPE
the White House ORG
Trump PERSON
Monday DATE
Sisi PERSON
Trump ORG
Sisi PERSON
the Oval Office FAC
el-Sisi PERSON
Egypt GPE
Egypt GPE
The United States GPE
one CARDINAL
Trump ORG
American NORP
Sisi PERSON
Trump ORG
Egypt GPE
Trump ORG
the United States GPE
Sisi NORP
Cairo GPE
the six minutes TIME
the Oval Office FAC
Barack Obama PERSON
the White House ORG
American NORP
Sisi PERSON
Eric Trager PERSON
Egypt GPE
the Washington Institute for Near East Policy ORG
the White House ORG
Sisi’s ORG
Egypt GPE
hundreds CARDINAL
Cairo GPE
first ORDINAL
Egyptian NORP
Washington GPE
2009 DATE
Hosni Mubarak PERSON
the waning years DATE
Egyptians NORP
Mubarak PERSON
2011 DATE
the Muslim Brotherhood’s ORG
Mohamed Morsi ORG
Morsi PERSON
two years later DATE
Sisi PERSON
Sisi NORP
97 percent PERCENT
Trump ORG
America GPE
Middle Eastern LOC
Monday DATE
Trump ORG
the United States GPE
Trump ORG
Sisi PERSON
Sisi PERSON
Trumpian NORP
Egypt

Amount of entities of each category

In [61]:
entity_frequencies = {}
for ent in doc.ents:
    if ent.label_ in entity_frequencies:
        entity_frequencies[ent.label_].append(ent.text)
    else:
        entity_frequencies[ent.label_] = [ent.text]

for entity_type, entities in entity_frequencies.items():
    print(f"Entity Type: {entity_type} - {len(entities)}")


Entity Type: GPE - 38
Entity Type: DATE - 19
Entity Type: PERSON - 29
Entity Type: ORG - 41
Entity Type: FAC - 2
Entity Type: CARDINAL - 8
Entity Type: NORP - 34
Entity Type: TIME - 1
Entity Type: ORDINAL - 3
Entity Type: PERCENT - 1
Entity Type: LOC - 5
Entity Type: LANGUAGE - 1
Entity Type: MONEY - 1


Frequency of each entity

In [62]:
from collections import Counter

for entity_type, entities in entity_frequencies.items():
    print(f"Entity Type: {entity_type}")
    entity_counts = Counter(entities)
    for entity, frequency in entity_counts.items():
        print(f"  {entity}: {frequency}")

Entity Type: GPE
  WASHINGTON: 1
  Egypt: 15
  The United States: 1
  the United States: 3
  Cairo: 6
  Washington: 2
  America: 1
  Russia: 1
  Rome: 1
  Mexico: 1
  China: 1
  Turkey: 1
  United States: 1
  Sinai: 1
  U.S.: 2
Entity Type: DATE
  nearly four years ago: 1
  Monday: 3
  2009: 1
  the waning years: 1
  2011: 1
  two years later: 1
  the century: 2
  September: 1
  December: 1
  five years: 1
  January 2015: 1
  last year: 3
  recent months: 1
  2015: 1
Entity Type: PERSON
  Abdel Fattah: 1
  Trump: 2
  Sisi: 12
  el-Sisi: 1
  Barack Obama: 1
  Eric Trager: 1
  Hosni Mubarak: 1
  Mubarak: 1
  Morsi: 1
  Vladimir V. Putin: 1
  Giulio Regeni: 1
  Alessandra Ballerini: 1
  Aya Hijazi: 1
  Hijazi: 1
  Wade McMullen: 1
  Amy Hawthorne: 1
  Hawthorne: 1
Entity Type: ORG
  the White House: 3
  Trump: 21
  the Washington Institute for Near East Policy: 1
  Sisi’s: 1
  the Muslim Brotherhood’s: 1
  Mohamed Morsi: 1
  the United Nations General Assembly: 1
  Fox Business Network: 1

In [63]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('wordnet')
nltk.download('stopwords')

nlp = en_core_web_sm.load()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [64]:
def preprocess_text(text):
  tokens = [lemmatizer.lemmatize(token) for token in text.split() if token not in stop_words]  # Lemmatization and stop word removal
  return " ".join(tokens)

preprocessed_text = preprocess_text(text)

doc = nlp(preprocessed_text)

entity_frequencies = {}

for ent in doc.ents:
    if ent.label_ in entity_frequencies:
        entity_frequencies[ent.label_].append(ent.text)
    else:
        entity_frequencies[ent.label_] = [ent.text]


for entity_type, entities in entity_frequencies.items():
    print(f"Entity Type: {entity_type}")
    entity_counts = Counter(entities)
    for entity, frequency in entity_counts.most_common(5):  # Get top 5
        print(f"  {entity}: {frequency}")

Entity Type: GPE
  Egypt: 15
  Cairo: 6
  United States: 4
  Washington: 2
  U.S.: 2
Entity Type: DATE
  Monday: 3
  last year: 3
  nearly four year ago: 1
  2009: 1
  year: 1
Entity Type: PERSON
  Sisi: 8
  Trump: 2
  Abdel Fattah: 1
  el-Sisi: 1
  Barack Obama: 1
Entity Type: ORG
  Trump: 19
  Sisi: 6
  White House: 4
  Islam: 3
  Muslim Brotherhood: 2
Entity Type: CARDINAL
  one: 2
  two: 2
  hundred: 1
  17: 1
  dozen: 1
Entity Type: NORP
  American: 7
  Sisi: 5
  Egyptians: 5
  Egyptian: 3
  jihadist: 2
Entity Type: TIME
  six minute: 1
Entity Type: ORDINAL
  first: 2
  fifth: 1
Entity Type: PERCENT
  97 percent: 1
Entity Type: LOC
  Middle East: 2
  Middle Eastern: 1
  Middle East Democracy: 1
Entity Type: FAC
  part street: 1
Entity Type: MONEY
  $1.3 billion: 1


- From this we can understand that the news are about the relationship between Egypt (under President Sisi) and the United States (under President Trump). There might be mentions of economic aid or political strategies involving the Middle East. The presence of "Islam" and "Muslim Brotherhood" suggests a possible focus on religious or political groups within Egypt

- Two most mentiont people: **Egyptian President Abdel Fattah el-Sisi**  and **U.S. President Donald Trump**.

- Top 3 GPE: Egypt, Cairo, United States.

### Exercise 2

Considering the BIO labeling system, indicate how many entities of type PERSON appear with only one word.

In [65]:
for word in doc:
  if word.ent_type_ == "PERSON":
    print(word.text, word.ent_iob_)

Abdel B
Fattah I
Trump B
Sisi B
el B
- I
Sisi I
Sisi B
Barack B
Obama I
Eric B
Trager I
Sisi B
Hosni B
Mubarak I
Mubarak B
Morsi B
Sisi B
pro B
forma I
Sisi B
Vladimir B
V. I
Putin I
Russia I
Sisi B
Giulio B
Regeni I
Alessandra B
Ballerini I
Suez B
Canal I
Sisi I
Sisi B
Aya B
Hijazi I
Hijazi B
Wade B
McMullen I
Robert B
F. I
Kennedy I
Obama B
2015 I
warplane B
large I
- I
Trump B
Amy B
Hawthorne I
Sisi B
Hawthorne B


In [66]:
single_word_person_count = 0

for ent in doc.ents:
    if ent.label_ == "PERSON":
        if len(ent) == 1:
            single_word_person_count += 1

print(f"Number of single-word PERSON entities: {single_word_person_count}")

Number of single-word PERSON entities: 14


### Exercise 3

You have to select all the sentences with PERSON entities, which you extracted in **exercise 2**, and to make the following actions:

1. To identify the main verb.
2. To count the number of verbs, adjectives and adverbs.
3. To desambiguate the verbs, adjectives and adverbs with WordNet. You have to show the synsets and its gloss.

In [67]:
with open(file_path, 'r', encoding='cp1252') as file:
    text = file.read()

doc = nlp(text)

In [70]:
sentences_with_person_entities = []
for sentence in doc.sents:
  for ent in sentence.ents:
    if ent.label_ == "PERSON":
      sentences_with_person_entities.append(sentence)
      break

In [71]:
from spacy import displacy
displacy.render(sentences_with_person_entities, jupyter=True, style='ent')

- To identify the main verb.

- To count the number of verbs, adjectives and adverbs.
-
To desambiguate the verbs, adjectives and adverbs with WordNet. You have to show the synsets and its gloss.

In [72]:
import spacy
spacy.cli.download("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [80]:
import re

filtered_sentences = []
for sentence in sentences_with_person_entities:
    cleaned_sentence = re.sub(r'[\x00-\x1f\x7f-\x9f]+', '', sentence.text)
    if cleaned_sentence.strip():
        filtered_sentences.append(cleaned_sentence)

In [87]:
for sentence in filtered_sentences:
  doc = nlp(sentence)
  print(f"Sentence: {sentence}")
  main_verb = None
  for token in doc:
    if token.pos_ == "VERB" and token.dep_== "ROOT":
      main_verb = token.text
      break
  if main_verb:
    print(f"Main Verb: {main_verb}")
  else  :
    print("No main verb found\n")


Sentence: WASHINGTON — Ever since he seized power in a military takeover nearly four years ago, President Abdel Fattah el-Sisi of Egypt has been barred from the White House.
Main Verb: barred
Sentence: But President Trump made clear on Monday that the period of ostracism was over as he hosted  Sisi and pledged unstinting support for the autocratic ruler.
Main Verb: made
Sentence: Trump said as he sat beside  Sisi in the Oval Office.
Main Verb: said
Sentence: “I just want to let everybody know in case there was any doubt that we are very much behind President el-Sisi.
Main Verb: want
Sentence: While his predecessors considered authoritarians like  Sisi to be distasteful and at times shied away from them,  Trump signaled that he sees international relations through a transactional lens.
Main Verb: signaled
Sentence: He arrived from Cairo with a list of financial, security and political requests, but effectively he got what he really wanted in the six minutes that news media photographers

In [95]:
counter = 0
for sentence in filtered_sentences:
  counter += 1

  doc = nlp(sentence)
  print(f"Sentence {counter}: {sentence}")
  verb_count = 0
  adjective_count = 0
  adverb_count = 0
  for token in doc:
    if token.pos_ == "VERB":
      verb_count += 1
    elif token.pos_ == "ADJ":
      adjective_count += 1
    elif token.pos_ == "ADV":
      adverb_count += 1
  print(f"Verbs: {verb_count}, Adjectives: {adjective_count}, Adverbs: {adverb_count}")

Sentence 1: WASHINGTON — Ever since he seized power in a military takeover nearly four years ago, President Abdel Fattah el-Sisi of Egypt has been barred from the White House.
Verbs: 2, Adjectives: 1, Adverbs: 3
Sentence 2: But President Trump made clear on Monday that the period of ostracism was over as he hosted  Sisi and pledged unstinting support for the autocratic ruler.
Verbs: 3, Adjectives: 3, Adverbs: 1
Sentence 3: Trump said as he sat beside  Sisi in the Oval Office.
Verbs: 2, Adjectives: 0, Adverbs: 0
Sentence 4: “I just want to let everybody know in case there was any doubt that we are very much behind President el-Sisi.
Verbs: 4, Adjectives: 0, Adverbs: 3
Sentence 5: While his predecessors considered authoritarians like  Sisi to be distasteful and at times shied away from them,  Trump signaled that he sees international relations through a transactional lens.
Verbs: 4, Adjectives: 3, Adverbs: 1
Sentence 6: He arrived from Cairo with a list of financial, security and politic

To desambiguate the verbs, adjectives and adverbs with WordNet. You have to show the synsets and its gloss.

In [94]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [107]:
from nltk.corpus import wordnet as wn
def disambiguate_word(word, pos=None):

    # Map POS tags to WordNet POS tags
    pos_map = {
        'VERB': 'v',
        'ADJ': 'a',
        'ADV': 'r'
    }

    if pos:
        synsets = wn.synsets(word, pos=pos_map.get(pos))
    else:
        synsets = wn.synsets(word)

    if not synsets:
        print(f"\tNo synsets found for the word: {word}")
        return

    print(f"\n\tSynsets for '{word}' (POS: {pos if pos else 'any'}):")
    for i, synset in enumerate(synsets, 1):
        print(f"\t{i}. {synset.name()}: {synset.definition()}")
        # print(f"\t   Examples: {synset.examples()}\n")

In [108]:
counter = 0
for sentence in filtered_sentences:
  counter += 1

  doc = nlp(sentence)
  print(f"Sentence {counter}:")

  for token in doc:
    if token.pos_ == "VERB":
      disambiguate_word(token.text, "VERB")
    elif token.pos_ == "ADJ":
      disambiguate_word(token.text, "ADJ")
    elif token.pos_ == "ADV":
      disambiguate_word(token.text, "ADV")


Sentence 1:

	Synsets for 'Ever' (POS: ADV):
	1. ever.r.01: at any time
	2. always.r.01: at all times; all the time and on every occasion
	3. ever.r.03: (intensifier for adjectives) very

	Synsets for 'seized' (POS: VERB):
	1. seize.v.01: take hold of; grab
	2. seize.v.02: take or capture by force
	3. appropriate.v.02: take possession of by force, as after an invasion
	4. impound.v.01: take temporary possession of as a security, by legal authority
	5. assume.v.06: seize and take control without authority and possibly with force; take as one's right or possession
	6. seize.v.06: hook by a pull on the line
	7. seize.v.07: affect
	8. grab.v.06: capture the attention or imagination of

	Synsets for 'military' (POS: ADJ):
	1. military.a.01: of or relating to the study of the principles of warfare
	2. military.a.02: characteristic of or associated with soldiers or the military
	3. military.a.03: associated with or performed by members of the armed services as contrasted with civilians

	Syns