# NLP Basics Assessment
## Extracción de sentimiento de tweets

##ICESI
###Maestría en Inteligencia Artificial Aplicada


#### Angelica Maria Mayor
#### Freddy Mauricio Gutierrez
#### Wilman Quiñonez
#### Carlos Alberto Martinez Ramirez



[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cam2149/icesi-nlp/blob/main/Sesion1/8-practice.ipynb)

En este notebook vamos a poner en práctica algunos de los conceptos vistos en los notebooks anteriores, aplicado a un corpus específico:

Con tantos tuits circulando a cada segundo, es difícil determinar si el sentimiento detrás de un tuit específico impactará la marca de una empresa o persona por ser viral (positivo), o si devastará las ganancias por su tono negativo. Capturar el sentimiento con palabras es importante en estos tiempos donde las decisiones y reacciones se crean y actualizan en segundos. Pero, ¿qué palabras conducen realmente a la descripción del sentimiento? En esta competencia, tendrás que identificar la parte del tuit (palabra o frase) que refleje el sentimiento.

"Mi perro ridículo es increíble." [sentimiento: positivo]

Desarrollar habilidades en esta importante área con este amplio conjunto de datos de tuits. Perfecciona tu técnica para alcanzar el primer puesto en esta competencia. ¿Qué palabras en los tuits respaldan un sentimiento positivo, negativo o neutral? ¿Cómo puedes ayudar a determinarlo usando herramientas de aprendizaje automático?

El conjunto de datos se titula "Análisis de Sentimiento: Emoción en Tweets de Texto con Etiquetas de Sentimiento existentes", utilizado aquí bajo la licencia Creative Commons Atribución 4.0 Internacional. El objetivo en este concurso es construir un modelo que pueda hacer lo mismo: analizar el sentimiento etiquetado de un tweet determinado y determinar qué palabra o frase lo respalda mejor.

Descargo de responsabilidad: El conjunto de datos de este concurso contiene texto que puede considerarse profano, vulgar u ofensivo.

## Referencias
* [Extracción de sentimiento de tweets](https://www.kaggle.com/competitions/tweet-sentiment-extraction/overview)


In [None]:
!pip install kaggle
!pip install vaderSentiment
!pip install tqdm
!pip uninstall -y nltk numpy scikit-learn
!pip install nltk
!pip install --upgrade nltk
!pip install GingerIt


Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2
Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
Found existing installation: numpy 2.0.2
Uninstalling numpy-2.0.2:
  Successfully uninstalled numpy-2.0.2
Found existing installation: scikit-learn 1.6.1
Uninstalling scikit-learn-1.6.1:
  Successfully uninstalled scikit-learn-1.6.1
Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nlt

In [None]:
!pip install numpy==1.24.4 scikit-learn==1.2.2

Collecting numpy==1.24.4
  Downloading numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting scikit-learn==1.2.2
  Downloading scikit_learn-1.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
INFO: pip is looking at multiple versions of scipy to determine which version is compatible with other requirements. This could take a while.
Collecting scipy>=1.3.2 (from scikit-learn==1.2.2)
  Downloading scipy-1.16.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.9/61.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Downloading scipy-1.15.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.24.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.

In [None]:
import pkg_resources
import warnings
import spacy
import pandas as pd
import os
import nltk

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score, classification_report
from google.colab import files
from tqdm import tqdm  # Importa tqdm para la barra de progreso
warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

In [None]:
!python -m spacy download en_core_web_trf
nltk.download('vader_lexicon')

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-curated-transformers<1.0.0,>=0.2.2 (from en-core-web-trf==3.8.0)
  Downloading spacy_curated_transformers-0.3.1-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_tokenizers-0.0.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading spacy_curated_transformers-0.3.1-py2.py3-none-any.whl (237 kB)
[2K   [90m━━━━━

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

In [None]:
# Download and load the Kaggle dataset conditionally
#%%bash

#if [ ! -d "tweet_data" ]; then
#  echo "Downloading and extracting dataset..."
#  mkdir -p ~/.kaggle
#  # Assuming kaggle.json is already uploaded to Colab's files
#  test -f "kaggle.json" && mv kaggle.json ~/.kaggle/
#  chmod 600 ~/.kaggle/kaggle.json
#  kaggle competitions download -c tweet-sentiment-extraction
#  unzip -o tweet-sentiment-extraction.zip -d tweet_data
#else
#  echo "Dataset already exists in tweet_data directory."
#fi

In [None]:
!test '{IN_COLAB}' = 'True' && wget -O requirements.txt https://github.com/cam2149/icesi-nlp/raw/refs/heads/main/requirements.txt && pip install -r requirements.txt

--2025-08-13 02:08:17--  https://github.com/cam2149/icesi-nlp/raw/refs/heads/main/requirements.txt
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cam2149/icesi-nlp/refs/heads/main/requirements.txt [following]
--2025-08-13 02:08:18--  https://raw.githubusercontent.com/cam2149/icesi-nlp/refs/heads/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 349 [text/plain]
Saving to: ‘requirements.txt’


2025-08-13 02:08:18 (30.3 MB/s) - ‘requirements.txt’ saved [349/349]

Collecting pandas==2.1.1 (from -r requirements.txt (line 1))
  Downloading pandas-2.1.1-cp311-cp311-manylinux_2_17_

In [None]:
# Initialize SpaCy and VADER
#Esta celda de código inicializa las dos bibliotecas principales utilizadas en este notebook para el procesamiento del lenguaje natural y el análisis de sentimiento: SpaCy y VADER.

#nlp = spacy.load("en_core_web_sm"): Esta línea carga un modelo de lenguaje inglés pre-entrenado de la biblioteca SpaCy. El modelo "en_core_web_sm" es un modelo pequeño de inglés que incluye capacidades para tokenización, etiquetado de parte de la oración (POS tagging), análisis de dependencias y más. Este modelo cargado se asigna a la variable nlp, que luego se utiliza para procesar texto en todo el notebook.

#analyzer = SentimentIntensityAnalyzer(): Esta línea crea una instancia del SentimentIntensityAnalyzer de la biblioteca VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER es una herramienta de análisis de sentimiento basada en léxico y reglas que está específicamente sintonizada con los sentimientos expresados en las redes sociales. El objeto analizador creado se asigna a la variable analyzer, que se utilizará más adelante para obtener puntuaciones de sentimiento para el texto.

#En esencia, esta celda configura las herramientas necesarias (SpaCy para el procesamiento lingüístico y VADER para la puntuación de sentimiento) para analizar los datos de texto en los tweets.
nlp = spacy.load("en_core_web_sm")
#nlp = spacy.load("en_core_web_trf")
analyzer = SentimentIntensityAnalyzer()

In [None]:
# Load the dataset
try:
    train_df = pd.read_csv("https://raw.githubusercontent.com/cam2149/icesi-nlp/refs/heads/main/Sesion1/train.csv")
except FileNotFoundError:
    print("Error: train.csv not found even after attempting download and extraction.")
    # You might want to add code here to handle the case where the file is still not found.


# Data Exploration
print("Dataset Preview:")
print(train_df.head())
print("\nColumns:", train_df.columns.tolist())


Dataset Preview:
       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
4  358bd9e861   Sons of ****, why couldn`t they put them on t...   

                         selected_text sentiment  
0  I`d have responded, if I were going   neutral  
1                             Sooo SAD  negative  
2                          bullying me  negative  
3                       leave me alone  negative  
4                        Sons of ****,  negative  

Columns: ['textID', 'text', 'selected_text', 'sentiment']


In [None]:
train_df.dropna(inplace=True)

In [None]:
# Count initial number of rows
initial_rows = len(train_df)
# Filter out rows containing either " ****" or "http"
train_df = train_df[~train_df['text'].astype(str).str.contains(r" \*\*\*\*|http", regex=True)]
# Count remaining rows
rows_after_removal = len(train_df)
# Display results
print(f"Removed {initial_rows - rows_after_removal} rows containing ' ****' or 'http'.")
print(f"Remaining rows: {rows_after_removal}")


Removed 2066 rows containing ' ****' or 'http'.
Remaining rows: 25414


In [None]:
# Count the number of tokens in the processed_text column
# Handle potential non-string values by converting them to strings and replacing NaN with empty strings
token_count = train_df["text"].astype(str).apply(lambda x: len(x.split())).sum()

print(f"\nTotal number of tokens in the dataset: {token_count}")


Total number of tokens in the dataset: 326104


In [None]:
import random
# Get the number of records in the DataFrame
num_records = len(train_df)
# Generate a random integer between 0 and num_records-1 (inclusive)
random_index = random.randint(0, num_records - 1)
print(f"A random index based on the number of records is: {random_index}")

A random index based on the number of records is: 9758


In [None]:
# Select a row from the dataset (e.g., the first row)
selected_row = train_df.iloc[random_index]
text = selected_row["text"]
# Process the text with SpaCy
doc = nlp(text)
# Print information for each token
print(f"Analyzing row {random_index} in the dataset:\n")
print(f"Row Info:\n{selected_row}\n") # Corrected line
print(f"Analyzing text: '{text}'\n")
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in doc:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Analyzing row 9758 in the dataset:

Row Info:
textID                                         2efcec9326
text                   Five o`clock can`t come any faster
selected_text          Five o`clock can`t come any faster
sentiment                                         neutral
extracted_text                                           
processed_text                    o`clock can`t come fast
neg                                                   0.0
neu                                                   1.0
pos                                                   0.0
compound                                              0.0
predicted_sentiment                               neutral
Name: 10487, dtype: object

Analyzing text: 'Five o`clock can`t come any faster'

Text                POS                 dep                 lemma               
Five                NUM                 nummod              five                
o`clock             NOUN                compound            o`clock     

In [None]:
# Count the number of sentences in the text column
# Handle potential non-string values by converting them to strings and replacing NaN with empty strings
sentence_count =train_df.iloc[random_index].astype(str).apply(lambda x: len(list(nlp(x).sents))).sum()

print(f"\nTotal number of sentences in the selectext: {sentence_count}")


Total number of sentences in the selectext: 10


In [None]:
from spacy import displacy

doc = nlp(text)
# dep for syntactic dependency
# Este código utiliza displacy para visualizar las dependencias sintácticas de una oración.
# 'doc' es el objeto Doc de spaCy que contiene la oración procesada.
# style='dep' especifica que se visualicen las dependencias.
# jupyter=True permite renderizar la visualización directamente en un entorno Jupyter o Colab.
# options={'distance': 110} ajusta la distancia entre los tokens en la visualización para mejorar la legibilidad.
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [None]:
from tqdm import tqdm  # Importa tqdm para la barra de progreso

# Asegúrate de envolver la función con tqdm
tqdm.pandas(desc="Processing Text Justification")

# Text Justification Extraction with SpaCy Matcher
matcher = Matcher(nlp.vocab)

# Define patrones positivos y negativos
positive_pattern = [{"LOWER": {"IN": ["good", "great", "excellent", "love", "amazing"]}}]
negative_pattern = [{"LOWER": {"IN": ["bad", "poor", "terrible", "hate", "sad", "bullying",
                                     "leave me alone", "sons of", "son of", "boring", "aggressive",
                                     "anxiety", "angst", "gross", "chaos", "collapse", "confusion",
                                     "cringe", "critical", "damage", "disappointment", "deficient",
                                     "unpleasant", "disastrous", "desperate", "disillusion", "pain",
                                     "sick", "angry", "error", "stupid", "failure", "frustration",
                                     "horrible", "unacceptable", "incompetent", "ineffective", "unfair",
                                     "slow", "bad", "awful", "annoying", "negative", "danger", "loss",
                                     "problem", "rejection", "ridiculous", "risky", "terrible", "toxic",
                                     "shame", "wtf", "fail", "ew", "meh", "so gross", "nooo", "ugh",
                                     "lame", "trash", "cancelled", "cancel him", "cancel her", "worst",
                                     "fatal", "disgusting", "why tho", "nah", "not cool", "dead",
                                     "over it", "fake", "phony", "drama", "messy", "leave me alone",
                                     "sons of", "son of"]}}]

matcher.add("PositiveWords", [positive_pattern])
matcher.add("NegativeWords", [negative_pattern])

# Función para extraer la justificación de cada texto
def extract_justification(text):
    if isinstance(text, str):  # Asegurarse de que el texto sea una cadena
        doc = nlp(text)
        matches = matcher(doc)
        if matches:
            match_id, start, end = matches[0]
            return doc[start:end].text
    return ""

# Aplica la función de justificación al DataFrame con tqdm para mostrar progreso
train_df["extracted_text"] = train_df["selected_text"].progress_apply(extract_justification)

print("\nText justification extraction complete!")


Processing Text Justification: 100%|██████████| 25414/25414 [02:12<00:00, 191.29it/s]


Text justification extraction complete!





In [None]:
# Data Exploration
print("Dataset Preview:")
print(train_df.head())
print("\nColumns:", train_df.columns.tolist())

Dataset Preview:
       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
6  6e0c6d75b1  2am feedings for the baby are fun when he is a...   

                         selected_text sentiment extracted_text  
0  I`d have responded, if I were going   neutral                 
1                             Sooo SAD  negative            SAD  
2                          bullying me  negative       bullying  
3                       leave me alone  negative                 
6                                  fun  positive                 

Columns: ['textID', 'text', 'selected_text', 'sentiment', 'extracted_text']


In [None]:
# Preprocessing function using SpaCy
def preprocess_text(text):
    if pd.isna(text):
        return ""
    doc = nlp(str(text))
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)

train_df["processed_text"] = train_df["selected_text"].apply(preprocess_text)

In [None]:
from tqdm import tqdm  # Importa tqdm para la barra de progreso
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Inicializa el analizador VADER
analyzer = SentimentIntensityAnalyzer()

# Función para obtener las puntuaciones de sentimiento
def analyze_sentiment_scores(text):
    if isinstance(text, str):  # Verifica que el texto sea una cadena
        return analyzer.polarity_scores(text)
    else:
        return {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

# Añadimos la barra de progreso a la operación apply
tqdm.pandas(desc="Applying VADER Sentiment Analysis")

# Aplicar la función de análisis de sentimiento con barra de progreso
sentiment_scores_df = train_df['text'].progress_apply(analyze_sentiment_scores).apply(pd.Series)

# Concatenar las nuevas columnas al DataFrame original
train_df = pd.concat([train_df, sentiment_scores_df], axis=1)

# Mostrar un vistazo de los primeros registros con las nuevas columnas de puntuaciones
print("Dataset Preview with VADER Scores:")
print(train_df.head())

# Imprimir las columnas nuevas añadidas
print("\nColumns:", train_df.columns.tolist())


Applying VADER Sentiment Analysis: 100%|██████████| 25414/25414 [00:02<00:00, 8720.41it/s]


Dataset Preview with VADER Scores:
       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
6  6e0c6d75b1  2am feedings for the baby are fun when he is a...   

                         selected_text sentiment extracted_text  \
0  I`d have responded, if I were going   neutral                  
1                             Sooo SAD  negative            SAD   
2                          bullying me  negative       bullying   
3                       leave me alone  negative                  
6                                  fun  positive                  

   processed_text    neg    neu    pos  compound  
0  I`d respond go  0.000  1.000  0.000    0.0000  
1        Sooo SAD  0.474  0.526  0.000   -0.7437  


In [None]:
# Predict sentiment based on VADER compound score
# Define thresholds for sentiment prediction
def predict_vader_sentiment(compound_score):
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

train_df['predicted_sentiment'] = train_df['compound'].apply(predict_vader_sentiment)

# Evaluation against provided labels
print("\nSentiment Prediction Evaluation:")
accuracy = accuracy_score(train_df["sentiment"], train_df["predicted_sentiment"])
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(train_df["sentiment"], train_df["predicted_sentiment"]))


Sentiment Prediction Evaluation:
Accuracy: 0.6324
              precision    recall  f1-score   support

    negative       0.69      0.60      0.64      7096
     neutral       0.71      0.47      0.57     10326
    positive       0.56      0.87      0.68      7992

    accuracy                           0.63     25414
   macro avg       0.65      0.65      0.63     25414
weighted avg       0.66      0.63      0.62     25414



In [None]:
# Display results
print("\nSample Results with Extracted Justification:")
print(train_df[["text", "sentiment", "predicted_sentiment", "extracted_text"]].head())


Sample Results with Extracted Justification:
                                                text sentiment  \
0                I`d have responded, if I were going   neutral   
1      Sooo SAD I will miss you here in San Diego!!!  negative   
2                          my boss is bullying me...  negative   
3                     what interview! leave me alone  negative   
6  2am feedings for the baby are fun when he is a...  positive   

  predicted_sentiment extracted_text  
0             neutral                 
1            negative            SAD  
2            negative       bullying  
3            negative                 
6            positive                 


In [None]:
print(train_df.columns)
print([type(col) for col in train_df.columns])


Index(['textID', 'text', 'selected_text', 'sentiment', 'extracted_text',
       'processed_text', 'neg', 'neu', 'pos', 'compound',
       'predicted_sentiment'],
      dtype='object')
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]


In [None]:
# Calculate the percentage of rows where 'sentiment' and 'predicted_sentiment' are equal
percentage_equal = (train_df['sentiment'] == train_df['predicted_sentiment']).mean() * 100

print(f"Percentage of rows where 'sentiment' and 'predicted_sentiment' are equal: {percentage_equal:.2f}%")

Percentage of rows where 'sentiment' and 'predicted_sentiment' are equal: 63.24%
