<a href="https://colab.research.google.com/github/griisnc/PLN_Python/blob/main/Entity_Recognition_using_NLTK_and_spaCy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This code was made by Griselda Navarrete

# Description:

The code performs Natural Language Processing (NLP) on a Spanish text file. It first mounts Google Drive to access the file, then uses the NLTK library to tokenize the text and remove stop words. It further cleans the tokens by removing punctuation and special characters. After that, it uses the spaCy library to identify named entities within the text, printing the entity and its corresponding label.

In essence, this code prepares a Spanish text file for analysis by tokenizing, cleaning, and identifying key named entities within it using popular NLP libraries like NLTK and spaCy.

In [1]:
# Import necessary libraries
from IPython import get_ipython # Imports the get_ipython function from the IPython module, which is used to interact with the current IPython session.
from IPython.display import display # Imports the display function from the IPython.display module, which is used to display objects in the output area of a notebook cell.

In [2]:
import os # Imports the os module, which provides functions for interacting with the operating system, such as file and directory manipulation.
import nltk # Imports the nltk (Natural Language Toolkit) library, which is used for natural language processing tasks.
import random # Imports the random module, which provides functions for generating random numbers and making random choices.


In [3]:
from google.colab import drive # Imports the drive module from the google.colab library, which is used to mount Google Drive in a Colab notebook.
drive.mount('/content/drive') # Mounts Google Drive to the '/content/drive' directory in the Colab environment. This allows access to files stored in your Google Drive.

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
nltk.download('punkt_tab') # Downloads the 'punkt_tab' resource from NLTK, which is used for tokenizing text.

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [5]:
archivo =random.sample(os.listdir('/content/drive/MyDrive/ia2_3p/'),1)
archivo = archivo[0]
archivo

'resenia.csv'

In [6]:
with open('/content/drive/MyDrive/ia2_3p/'+archivo,"r", encoding="utf8") as entrada: # Opens the selected file in read mode ('r') with UTF-8 encoding and assigns it to the variable 'entrada'.
    texto = entrada.read() # Reads the content of the file and assigns it to the variable 'texto'.

In [7]:
tokens = nltk.word_tokenize(texto,"spanish") # Tokenizes the text using NLTK's word_tokenize function, specifying Spanish as the language.
tokens # Prints the tokens.

['Mark',
 'Zuckerberg',
 ',',
 'uno',
 'de',
 'los',
 'fundadores',
 'de',
 'Facebook',
 'y',
 'presidente',
 'de',
 'Meta',
 ',',
 'se',
 'reunió',
 'este',
 'miércoles',
 'con',
 'el',
 'presidente',
 'electo',
 'de',
 'Estados',
 'Unidos',
 ',',
 'Donald',
 'Trump',
 ',',
 'en',
 'la',
 'residencia',
 'de',
 'este',
 'último',
 'en',
 'Mar-a-Lago',
 '(',
 'Florida',
 ')',
 ',',
 'según',
 'informaron',
 'los',
 'diarios',
 'The',
 'New',
 'York',
 'Times',
 'y',
 'The',
 'New',
 'York',
 'Post',
 '.',
 'De',
 'acuerdo',
 'con',
 'The',
 'New',
 'York',
 'Times',
 ',',
 'que',
 'cita',
 'a',
 'tres',
 'personas',
 'con',
 'conocimiento',
 'del',
 'encuentro',
 ',',
 'Zuckerberg',
 'buscó',
 'la',
 'reunión',
 'con',
 'Trump',
 'como',
 'un',
 'intento',
 'de',
 'mejorar',
 'la',
 'relación',
 'entre',
 'ambos',
 'tras',
 'una',
 'década',
 'marcada',
 'por',
 'tensiones',
 '.',
 'Trump',
 'ha',
 'acusado',
 'en',
 'repetidas',
 'ocasiones',
 'a',
 'Meta',
 'de',
 'censurar',
 'injust

In [8]:
from nltk.corpus import stopwords # Imports the stopwords module from NLTK's corpus package, which contains lists of common stop words for various languages.

In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
nltk.download('stopwords') # Downloads the 'stopwords' resource from NLTK, which contains lists of stop words for various languages.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
stop_words = set(stopwords.words('spanish')) # Creates a set of Spanish stop words using NLTK's stopwords corpus.


In [12]:
for i,token in enumerate(tokens): # Iterates through the tokens list, using enumerate to get both the index (i) and the token value.
    if token.startswith("_") or token.startswith("—"): # Checks if the token starts with an underscore or a dash.
        tokens[i] = tokens[i][1:] # If it does, removes the first character of the token.
    if token.endswith("_") or token.endswith("—"): # Checks if the token ends with an underscore or a dash.
        tokens[i] = tokens[i][:-1] # If it does, removes the last character of the token.
texto = " ".join(tokens) # Joins the tokens back into a string, separated by spaces.


In [13]:
def tokenizar(texto): # Defines a function called 'tokenizar' that takes a text string as input.
    puntuacion = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~¿¡' # Defines a string containing punctuation characters.
    tokens = nltk.word_tokenize(texto,"spanish") # Tokenizes the input text using NLTK's word_tokenize function, specifying Spanish as the language.
    for i,token in enumerate(tokens): # Iterates through the tokens list, using enumerate to get both the index (i) and the token value.
        tokens[i] = token.strip(puntuacion) # Removes any punctuation characters from the beginning and end of each token using the strip method.
    texto = " ".join(tokens) # Joins the tokens back into a string, separated by spaces.
    tokens = nltk.word_tokenize(texto,"spanish") # Tokenizes the text again after removing punctuation.
    return tokens # Returns the final list of tokens.

In [14]:
tokens = tokenizar(texto)
tokens # Prints the tokens.

['Mark',
 'Zuckerberg',
 'uno',
 'de',
 'los',
 'fundadores',
 'de',
 'Facebook',
 'y',
 'presidente',
 'de',
 'Meta',
 'se',
 'reunió',
 'este',
 'miércoles',
 'con',
 'el',
 'presidente',
 'electo',
 'de',
 'Estados',
 'Unidos',
 'Donald',
 'Trump',
 'en',
 'la',
 'residencia',
 'de',
 'este',
 'último',
 'en',
 'Mar-a-Lago',
 'Florida',
 'según',
 'informaron',
 'los',
 'diarios',
 'The',
 'New',
 'York',
 'Times',
 'y',
 'The',
 'New',
 'York',
 'Post',
 'De',
 'acuerdo',
 'con',
 'The',
 'New',
 'York',
 'Times',
 'que',
 'cita',
 'a',
 'tres',
 'personas',
 'con',
 'conocimiento',
 'del',
 'encuentro',
 'Zuckerberg',
 'buscó',
 'la',
 'reunión',
 'con',
 'Trump',
 'como',
 'un',
 'intento',
 'de',
 'mejorar',
 'la',
 'relación',
 'entre',
 'ambos',
 'tras',
 'una',
 'década',
 'marcada',
 'por',
 'tensiones',
 'Trump',
 'ha',
 'acusado',
 'en',
 'repetidas',
 'ocasiones',
 'a',
 'Meta',
 'de',
 'censurar',
 'injustamente',
 'sus',
 'opiniones',
 'y',
 'las',
 'de',
 'otras',
 'vo

In [15]:
word_tokens = [word for word in tokens if word.isalpha()] # Creates a new list called 'word_tokens' containing only the tokens that are alphabetic. This is done using a list comprehension.


In [16]:
filteres_sentence =  [w for w in word_tokens if not w.lower() in stop_words] # Creates a new list called 'filteres_sentence' containing only the word tokens that are not in the stop_words set. This is done using a list comprehension and converts the word tokens to lowercase before checking if they are in the stop_words set.

In [17]:
filtered_sentence = [] # Initializes an empty list called 'filtered_sentence'.

In [18]:
for w in word_tokens: # Iterates through the word_tokens list.
    if w not in stop_words: # Checks if the current word token is not in the stop_words set.
        filtered_sentence.append(w) # If it is not a stop word, appends it to the filtered_sentence list.
print(word_tokens) # Prints the word_tokens list.

['Mark', 'Zuckerberg', 'uno', 'de', 'los', 'fundadores', 'de', 'Facebook', 'y', 'presidente', 'de', 'Meta', 'se', 'reunió', 'este', 'miércoles', 'con', 'el', 'presidente', 'electo', 'de', 'Estados', 'Unidos', 'Donald', 'Trump', 'en', 'la', 'residencia', 'de', 'este', 'último', 'en', 'Florida', 'según', 'informaron', 'los', 'diarios', 'The', 'New', 'York', 'Times', 'y', 'The', 'New', 'York', 'Post', 'De', 'acuerdo', 'con', 'The', 'New', 'York', 'Times', 'que', 'cita', 'a', 'tres', 'personas', 'con', 'conocimiento', 'del', 'encuentro', 'Zuckerberg', 'buscó', 'la', 'reunión', 'con', 'Trump', 'como', 'un', 'intento', 'de', 'mejorar', 'la', 'relación', 'entre', 'ambos', 'tras', 'una', 'década', 'marcada', 'por', 'tensiones', 'Trump', 'ha', 'acusado', 'en', 'repetidas', 'ocasiones', 'a', 'Meta', 'de', 'censurar', 'injustamente', 'sus', 'opiniones', 'y', 'las', 'de', 'otras', 'voces', 'conservadoras', 'en', 'sus', 'plataformas']


In [19]:
filtered_sentence # Prints the filtered_sentence list.

['Mark',
 'Zuckerberg',
 'fundadores',
 'Facebook',
 'presidente',
 'Meta',
 'reunió',
 'miércoles',
 'presidente',
 'electo',
 'Estados',
 'Unidos',
 'Donald',
 'Trump',
 'residencia',
 'último',
 'Florida',
 'según',
 'informaron',
 'diarios',
 'The',
 'New',
 'York',
 'Times',
 'The',
 'New',
 'York',
 'Post',
 'De',
 'acuerdo',
 'The',
 'New',
 'York',
 'Times',
 'cita',
 'tres',
 'personas',
 'conocimiento',
 'encuentro',
 'Zuckerberg',
 'buscó',
 'reunión',
 'Trump',
 'intento',
 'mejorar',
 'relación',
 'ambos',
 'tras',
 'década',
 'marcada',
 'tensiones',
 'Trump',
 'acusado',
 'repetidas',
 'ocasiones',
 'Meta',
 'censurar',
 'injustamente',
 'opiniones',
 'voces',
 'conservadoras',
 'plataformas']

In [20]:
!pip install spaCy # Installs the spaCy library using pip. This library is used for natural language processing tasks.




In [21]:
!python -m spacy download es_core_news_sm # Downloads the Spanish language model for spaCy called 'es_core_news_sm'. This model is required for processing Spanish text.

Collecting es-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_sm-3.7.0/es_core_news_sm-3.7.0-py3-none-any.whl (12.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.9/12.9 MB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: es-core-news-sm
Successfully installed es-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [22]:
import spacy # Imports the spaCy library.

# Load the Spanish language model
nlp = spacy.load("es_core_news_sm")

# Process the text with spaCy
doc = nlp(" ".join(filtered_sentence))

# Print the identified entities and their labels
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Entity: Mark Zuckerberg, Label: PER
Entity: Facebook presidente Meta, Label: MISC
Entity: Estados Unidos, Label: LOC
Entity: Donald Trump, Label: PER
Entity: Florida, Label: LOC
Entity: The New York Times The New York Post, Label: ORG
Entity: The New York Times, Label: ORG
Entity: Zuckerberg, Label: PER
Entity: Trump, Label: LOC
Entity: Trump, Label: LOC
Entity: Meta, Label: PER
