# NLP Basics Assessment
## Extracción de sentimiento de tweets

## ICESI
### Maestría en Inteligencia Artificial Aplicada


#### Angelica Maria Mayor
#### Freddy Mauricio Gutierrez
#### Wilman Quiñonez
#### Carlos Alberto Martinez Ramirez
#### Diego agudelo




[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cam2149/icesi-nlp/blob/main/Sesion1/8-practice.ipynb)

En este notebook vamos a poner en práctica algunos de los conceptos vistos en los notebooks anteriores, aplicado a un corpus específico:

Con tantos tuits circulando a cada segundo, es difícil determinar si el sentimiento detrás de un tuit específico impactará la marca de una empresa o persona por ser viral (positivo), o si devastará las ganancias por su tono negativo. Capturar el sentimiento con palabras es importante en estos tiempos donde las decisiones y reacciones se crean y actualizan en segundos. Pero, ¿qué palabras conducen realmente a la descripción del sentimiento? En esta competencia, tendrás que identificar la parte del tuit (palabra o frase) que refleje el sentimiento.

"Mi perro ridículo es increíble." [sentimiento: positivo]

Desarrollar habilidades en esta importante área con este amplio conjunto de datos de tuits. Perfecciona tu técnica para alcanzar el primer puesto en esta competencia. ¿Qué palabras en los tuits respaldan un sentimiento positivo, negativo o neutral? ¿Cómo puedes ayudar a determinarlo usando herramientas de aprendizaje automático?

El conjunto de datos se titula "Análisis de Sentimiento: Emoción en Tweets de Texto con Etiquetas de Sentimiento existentes", utilizado aquí bajo la licencia Creative Commons Atribución 4.0 Internacional. El objetivo en este concurso es construir un modelo que pueda hacer lo mismo: analizar el sentimiento etiquetado de un tweet determinado y determinar qué palabra o frase lo respalda mejor.

Descargo de responsabilidad: El conjunto de datos de este concurso contiene texto que puede considerarse profano, vulgar u ofensivo.

## Referencias
* [Extracción de sentimiento de tweets](https://www.kaggle.com/competitions/tweet-sentiment-extraction/overview)


In [7]:
!pip install kaggle
!pip install vaderSentiment
!pip install tqdm
!pip uninstall -y nltk numpy scikit-learn
!pip install nltk
!pip install --upgrade nltk
!pip install GingerIt


Found existing installation: nltk 3.9.1
Uninstalling nltk-3.9.1:
  Successfully uninstalled nltk-3.9.1
Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Found existing installation: scikit-learn 1.7.1
Uninstalling scikit-learn-1.7.1:
  Successfully uninstalled scikit-learn-1.7.1
Collecting nltk
  Using cached nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Using cached nltk-3.9.1-py3-none-any.whl (1.5 MB)
Installing collected packages: nltk
Successfully installed nltk-3.9.1


In [1]:
!pip install numpy==1.24.4 scikit-learn==1.2.2



In [2]:
import pkg_resources
import warnings
import spacy
import pandas as pd
import os
import nltk

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from spacy.matcher import Matcher
from sklearn.metrics import accuracy_score, classification_report
from google.colab import files
from tqdm import tqdm  # Importa tqdm para la barra de progreso
warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

In [3]:
!python -m spacy download en_core_web_trf
nltk.download('vader_lexicon')

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [4]:
# Download and load the Kaggle dataset conditionally
#%%bash

#if [ ! -d "tweet_data" ]; then
#  echo "Downloading and extracting dataset..."
#  mkdir -p ~/.kaggle
#  # Assuming kaggle.json is already uploaded to Colab's files
#  test -f "kaggle.json" && mv kaggle.json ~/.kaggle/
#  chmod 600 ~/.kaggle/kaggle.json
#  kaggle competitions download -c tweet-sentiment-extraction
#  unzip -o tweet-sentiment-extraction.zip -d tweet_data
#else
#  echo "Dataset already exists in tweet_data directory."
#fi

In [5]:
!test '{IN_COLAB}' = 'True' && wget -O requirements.txt https://github.com/cam2149/icesi-nlp/raw/refs/heads/main/requirements.txt && pip install -r requirements.txt

--2025-08-13 23:56:30--  https://github.com/cam2149/icesi-nlp/raw/refs/heads/main/requirements.txt
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/cam2149/icesi-nlp/refs/heads/main/requirements.txt [following]
--2025-08-13 23:56:30--  https://raw.githubusercontent.com/cam2149/icesi-nlp/refs/heads/main/requirements.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 349 [text/plain]
Saving to: ‘requirements.txt’


2025-08-13 23:56:30 (34.1 MB/s) - ‘requirements.txt’ saved [349/349]

Collecting scikit-learn==1.3.0 (from -r requirements.txt (line 4))
  Using cached scikit_learn-1.3.0-cp311-cp311-ma

In [6]:
# Initialize SpaCy and VADER
#Esta celda de código inicializa las dos bibliotecas principales utilizadas en este notebook para el procesamiento del lenguaje natural y el análisis de sentimiento: SpaCy y VADER.

#nlp = spacy.load("en_core_web_sm"): Esta línea carga un modelo de lenguaje inglés pre-entrenado de la biblioteca SpaCy. El modelo "en_core_web_sm" es un modelo pequeño de inglés que incluye capacidades para tokenización, etiquetado de parte de la oración (POS tagging), análisis de dependencias y más. Este modelo cargado se asigna a la variable nlp, que luego se utiliza para procesar texto en todo el notebook.

#analyzer = SentimentIntensityAnalyzer(): Esta línea crea una instancia del SentimentIntensityAnalyzer de la biblioteca VADER (Valence Aware Dictionary and sEntiment Reasoner). VADER es una herramienta de análisis de sentimiento basada en léxico y reglas que está específicamente sintonizada con los sentimientos expresados en las redes sociales. El objeto analizador creado se asigna a la variable analyzer, que se utilizará más adelante para obtener puntuaciones de sentimiento para el texto.

#En esencia, esta celda configura las herramientas necesarias (SpaCy para el procesamiento lingüístico y VADER para la puntuación de sentimiento) para analizar los datos de texto en los tweets.
nlp = spacy.load("en_core_web_sm")
#nlp = spacy.load("en_core_web_trf")
analyzer = SentimentIntensityAnalyzer()

In [7]:
# Load the dataset
try:
    train_df = pd.read_csv("https://raw.githubusercontent.com/cam2149/icesi-nlp/refs/heads/main/Sesion1/train.csv")
except FileNotFoundError:
    print("Error: train.csv not found even after attempting download and extraction.")
    # You might want to add code here to handle the case where the file is still not found.


# Data Exploration
print("Dataset Preview:")
print(train_df.head())
print("\nColumns:", train_df.columns.tolist())


Dataset Preview:
       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
4  358bd9e861   Sons of ****, why couldn`t they put them on t...   

                         selected_text sentiment  
0  I`d have responded, if I were going   neutral  
1                             Sooo SAD  negative  
2                          bullying me  negative  
3                       leave me alone  negative  
4                        Sons of ****,  negative  

Columns: ['textID', 'text', 'selected_text', 'sentiment']


In [8]:
train_df.dropna(inplace=True)

In [9]:
# Count initial number of rows
initial_rows = len(train_df)
# Filter out rows containing either " ****" or "http"
train_df = train_df[~train_df['text'].astype(str).str.contains(r" \*\*\*\*|http", regex=True)]
# Count remaining rows
rows_after_removal = len(train_df)
# Display results
print(f"Removed {initial_rows - rows_after_removal} rows containing ' ****' or 'http'.")
print(f"Remaining rows: {rows_after_removal}")


Removed 2066 rows containing ' ****' or 'http'.
Remaining rows: 25414


In [10]:
# Count the number of tokens in the processed_text column
# Handle potential non-string values by converting them to strings and replacing NaN with empty strings
token_count = train_df["text"].astype(str).apply(lambda x: len(x.split())).sum()

print(f"\nTotal number of tokens in the dataset: {token_count}")


Total number of tokens in the dataset: 326104


In [11]:
import random
# Get the number of records in the DataFrame
num_records = len(train_df)
# Generate a random integer between 0 and num_records-1 (inclusive)
random_index = random.randint(0, num_records - 1)
print(f"A random index based on the number of records is: {random_index}")

A random index based on the number of records is: 1418


In [12]:
# Select a row from the dataset (e.g., the first row)
selected_row = train_df.iloc[random_index]
text = selected_row["text"]
# Process the text with SpaCy
doc = nlp(text)
# Print information for each token
print(f"Analyzing row {random_index} in the dataset:\n")
print(f"Row Info:\n{selected_row}\n") # Corrected line
print(f"Analyzing text: '{text}'\n")
print("{:20}{:20}{:20}{:20}".format("Text", "POS", "dep", "lemma"))
for token in doc:
    print(f"{token.text:{20}}{token.pos_:{20}}{token.dep_:{20}}{token.lemma_:{20}}")

Analyzing row 1418 in the dataset:

Row Info:
textID                          2f9f2f2a9a
text              next year will be sweeet
selected_text                       sweeet
sentiment                         positive
Name: 1520, dtype: object

Analyzing text: ' next year will be sweeet'

Text                POS                 dep                 lemma               
                    SPACE               dep                                     
next                ADJ                 amod                next                
year                NOUN                npadvmod            year                
will                AUX                 aux                 will                
be                  AUX                 ROOT                be                  
sweeet              ADJ                 acomp               sweeet              


In [13]:
# Count the number of sentences in the text column
# Handle potential non-string values by converting them to strings and replacing NaN with empty strings
sentence_count =train_df.iloc[random_index].astype(str).apply(lambda x: len(list(nlp(x).sents))).sum()

print(f"\nTotal number of sentences in the selectext: {sentence_count}")


Total number of sentences in the selectext: 4


In [14]:
from spacy import displacy

doc = nlp(text)
# dep for syntactic dependency
# Este código utiliza displacy para visualizar las dependencias sintácticas de una oración.
# 'doc' es el objeto Doc de spaCy que contiene la oración procesada.
# style='dep' especifica que se visualicen las dependencias.
# jupyter=True permite renderizar la visualización directamente en un entorno Jupyter o Colab.
# options={'distance': 110} ajusta la distancia entre los tokens en la visualización para mejorar la legibilidad.
displacy.render(doc, style='dep', jupyter=True, options={'distance': 110})

In [15]:
from tqdm import tqdm  # Importa tqdm para la barra de progreso

# Asegúrate de envolver la función con tqdm
tqdm.pandas(desc="Processing Text Justification")

# Text Justification Extraction with SpaCy Matcher
matcher = Matcher(nlp.vocab)

# Define patrones positivos y negativos
positive_pattern = [{"LOWER": {"IN": ["good", "great", "excellent", "love", "amazing"]}}]
negative_pattern = [{"LOWER": {"IN": ["bad", "poor", "terrible", "hate", "sad", "bullying",
                                     "leave me alone", "sons of", "son of", "boring", "aggressive",
                                     "anxiety", "angst", "gross", "chaos", "collapse", "confusion",
                                     "cringe", "critical", "damage", "disappointment", "deficient",
                                     "unpleasant", "disastrous", "desperate", "disillusion", "pain",
                                     "sick", "angry", "error", "stupid", "failure", "frustration",
                                     "horrible", "unacceptable", "incompetent", "ineffective", "unfair",
                                     "slow", "bad", "awful", "annoying", "negative", "danger", "loss",
                                     "problem", "rejection", "ridiculous", "risky", "terrible", "toxic",
                                     "shame", "wtf", "fail", "ew", "meh", "so gross", "nooo", "ugh",
                                     "lame", "trash", "cancelled", "cancel him", "cancel her", "worst",
                                     "fatal", "disgusting", "why tho", "nah", "not cool", "dead",
                                     "over it", "fake", "phony", "drama", "messy", "leave me alone",
                                     "sons of", "son of"]}}]

matcher.add("PositiveWords", [positive_pattern])
matcher.add("NegativeWords", [negative_pattern])

# Función para extraer la justificación de cada texto
def extract_justification(text):
    if isinstance(text, str):  # Asegurarse de que el texto sea una cadena
        doc = nlp(text)
        matches = matcher(doc)
        if matches:
            match_id, start, end = matches[0]
            return doc[start:end].text
    return ""

# Aplica la función de justificación al DataFrame con tqdm para mostrar progreso
train_df["extracted_text"] = train_df["selected_text"].progress_apply(extract_justification)

print("\nText justification extraction complete!")


Processing Text Justification: 100%|██████████| 25414/25414 [02:21<00:00, 179.82it/s]


Text justification extraction complete!





In [16]:
# Data Exploration
print("Dataset Preview:")
print(train_df.head())
print("\nColumns:", train_df.columns.tolist())

Dataset Preview:
       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
6  6e0c6d75b1  2am feedings for the baby are fun when he is a...   

                         selected_text sentiment extracted_text  
0  I`d have responded, if I were going   neutral                 
1                             Sooo SAD  negative            SAD  
2                          bullying me  negative       bullying  
3                       leave me alone  negative                 
6                                  fun  positive                 

Columns: ['textID', 'text', 'selected_text', 'sentiment', 'extracted_text']


In [17]:
# Preprocessing function using SpaCy
def preprocess_text(text):
    if pd.isna(text):
        return ""
    doc = nlp(str(text))
    tokens = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
    return " ".join(tokens)

train_df["processed_text"] = train_df["selected_text"].apply(preprocess_text)

In [18]:
from tqdm import tqdm  # Importa tqdm para la barra de progreso
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Inicializa el analizador VADER
analyzer = SentimentIntensityAnalyzer()

# Función para obtener las puntuaciones de sentimiento
def analyze_sentiment_scores(text):
    if isinstance(text, str):  # Verifica que el texto sea una cadena
        return analyzer.polarity_scores(text)
    else:
        return {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0}

# Añadimos la barra de progreso a la operación apply
tqdm.pandas(desc="Applying VADER Sentiment Analysis")

# Aplicar la función de análisis de sentimiento con barra de progreso
sentiment_scores_df = train_df['text'].progress_apply(analyze_sentiment_scores).apply(pd.Series)

# Concatenar las nuevas columnas al DataFrame original
train_df = pd.concat([train_df, sentiment_scores_df], axis=1)

# Mostrar un vistazo de los primeros registros con las nuevas columnas de puntuaciones
print("Dataset Preview with VADER Scores:")
print(train_df.head())

# Imprimir las columnas nuevas añadidas
print("\nColumns:", train_df.columns.tolist())


Applying VADER Sentiment Analysis: 100%|██████████| 25414/25414 [00:02<00:00, 8659.73it/s]


Dataset Preview with VADER Scores:
       textID                                               text  \
0  cb774db0d1                I`d have responded, if I were going   
1  549e992a42      Sooo SAD I will miss you here in San Diego!!!   
2  088c60f138                          my boss is bullying me...   
3  9642c003ef                     what interview! leave me alone   
6  6e0c6d75b1  2am feedings for the baby are fun when he is a...   

                         selected_text sentiment extracted_text  \
0  I`d have responded, if I were going   neutral                  
1                             Sooo SAD  negative            SAD   
2                          bullying me  negative       bullying   
3                       leave me alone  negative                  
6                                  fun  positive                  

   processed_text    neg    neu    pos  compound  
0  I`d respond go  0.000  1.000  0.000    0.0000  
1        Sooo SAD  0.474  0.526  0.000   -0.7437  


In [19]:
# Predict sentiment based on VADER compound score
# Define thresholds for sentiment prediction
def predict_vader_sentiment(compound_score):
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

train_df['predicted_sentiment'] = train_df['compound'].apply(predict_vader_sentiment)

# Evaluation against provided labels
print("\nSentiment Prediction Evaluation:")
accuracy = accuracy_score(train_df["sentiment"], train_df["predicted_sentiment"])
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(train_df["sentiment"], train_df["predicted_sentiment"]))


Sentiment Prediction Evaluation:
Accuracy: 0.6324
              precision    recall  f1-score   support

    negative       0.69      0.60      0.64      7096
     neutral       0.71      0.47      0.57     10326
    positive       0.56      0.87      0.68      7992

    accuracy                           0.63     25414
   macro avg       0.65      0.65      0.63     25414
weighted avg       0.66      0.63      0.62     25414



In [20]:
# Display results
print("\nSample Results with Extracted Justification:")
print(train_df[["text", "sentiment", "predicted_sentiment", "extracted_text"]].head())


Sample Results with Extracted Justification:
                                                text sentiment  \
0                I`d have responded, if I were going   neutral   
1      Sooo SAD I will miss you here in San Diego!!!  negative   
2                          my boss is bullying me...  negative   
3                     what interview! leave me alone  negative   
6  2am feedings for the baby are fun when he is a...  positive   

  predicted_sentiment extracted_text  
0             neutral                 
1            negative            SAD  
2            negative       bullying  
3            negative                 
6            positive                 


In [21]:
print(train_df.columns)
print([type(col) for col in train_df.columns])


Index(['textID', 'text', 'selected_text', 'sentiment', 'extracted_text',
       'processed_text', 'neg', 'neu', 'pos', 'compound',
       'predicted_sentiment'],
      dtype='object')
[<class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>, <class 'str'>]


In [22]:
# Calculate the percentage of rows where 'sentiment' and 'predicted_sentiment' are equal
percentage_equal = (train_df['sentiment'] == train_df['predicted_sentiment']).mean() * 100

print(f"Percentage of rows where 'sentiment' and 'predicted_sentiment' are equal: {percentage_equal:.2f}%")

Percentage of rows where 'sentiment' and 'predicted_sentiment' are equal: 63.24%


In [25]:

import re, random, warnings
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

warnings.filterwarnings("ignore")

# ---------------------------
# Config
# ---------------------------
SEED = 42
RUN_TRANSFORMER = True  # pon False si no quieres correr el modelo HF
MODEL_NAME = "cardiffnlp/twitter-roberta-base-sentiment-latest"
MAX_LENGTH = 256
BATCH = 32

LABELS = ["negative", "neutral", "positive"]
label2id = {k:i for i,k in enumerate(LABELS)}
id2label = {v:k for k,v in label2id.items()}

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)


# ---------------------------
assert "text" in train_df.columns and "sentiment" in train_df.columns, \
    "Se requiere train_df con columnas 'text' y 'sentiment'"

df = train_df.dropna(subset=["text","sentiment"]).copy()
df["sentiment"] = df["sentiment"].str.lower().str.strip()
df = df[df["sentiment"].isin(LABELS)].reset_index(drop=True)
-
def clean_en(s: str) -> str:
    s = s.lower()
    s = re.sub(r"http\S+|www\.\S+", " ", s)
    s = re.sub(r"@\w+|#\w+", " ", s)
    s = re.sub(r"[^a-z0-9\s']", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_clean"] = df["text"].astype(str).map(clean_en)

X_train, X_val, y_train, y_val = train_test_split(
    df["text_clean"], df["sentiment"],
    test_size=0.2, random_state=SEED, stratify=df["sentiment"]
)

# ==================================================

# ==================================================
print("\n=== TF-IDF + LinearSVC ===")
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words="english",
    ngram_range=(1,3),
    min_df=2,
    max_df=0.9,
    sublinear_tf=True,
    max_features=80000
)
Xtr = vectorizer.fit_transform(X_train)
Xva = vectorizer.transform(X_val)

svm = LinearSVC(C=1.0, class_weight="balanced")
svm.fit(Xtr, y_train)
svm_preds = svm.predict(Xva)

svm_acc = accuracy_score(y_val, svm_preds)
svm_f1  = f1_score(y_val, svm_preds, average="macro")
print(f"Accuracy: {svm_acc:.4f} | F1-macro: {svm_f1:.4f}")
print(classification_report(y_val, svm_preds, digits=4))
print("Confusion matrix (SVM):\n", confusion_matrix(y_val, svm_preds, labels=LABELS))


df.loc[X_val.index, "pred_svm"] = svm_preds

# ==================================================
# 2) Transformer (inferencia directa, sin pipelines)
# ==================================================
tr_acc = np.nan
tr_f1  = np.nan
if RUN_TRANSFORMER:
    print("\n=== RoBERTa (inferencia directa) ===")
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    def load_transformer_safely(model_name):
        tok = AutoTokenizer.from_pretrained(model_name)
        mdl = AutoModelForSequenceClassification.from_pretrained(model_name)
        return tok, mdl

    try:
        tokenizer, model = load_transformer_safely(MODEL_NAME)
        model.to(device)
        model.eval()

        # Intentar leer el mapping del modelo si está disponible
        if hasattr(model.config, "id2label") and model.config.id2label:
            id2label_model = {int(k): v.lower() for k, v in model.config.id2label.items()}
            # Normalizar a nuestras etiquetas
            remap = {"negative":"negative","neutral":"neutral","positive":"positive"}
            id2label_model = {i: remap.get(lbl, lbl) for i, lbl in id2label_model.items()}
        else:
            id2label_model = {0:"negative", 1:"neutral", 2:"positive"}

        def predict_batch(texts, max_length=MAX_LENGTH, batch_size=BATCH):
            preds = []
            for i in range(0, len(texts), batch_size):
                batch = texts[i:i+batch_size]
                enc = tokenizer(
                    batch, padding=True, truncation=True, max_length=max_length, return_tensors="pt"
                )
                for k in enc:
                    enc[k] = enc[k].to(device)
                with torch.no_grad():
                    logits = model(**enc).logits
                    y = torch.argmax(logits, dim=1).cpu().numpy().tolist()
                preds.extend([id2label_model.get(int(ix), id2label[int(ix)]) for ix in y])
            return preds

        tr_texts = X_val.astype(str).tolist()
        tr_preds = predict_batch(tr_texts)

        tr_acc = accuracy_score(y_val, tr_preds)
        tr_f1  = f1_score(y_val, tr_preds, average="macro")
        print(f"Accuracy: {tr_acc:.4f} | F1-macro: {tr_f1:.4f}")
        print(classification_report(y_val, tr_preds, digits=4))
        print("Confusion matrix (Transformer):\n", confusion_matrix(y_val, tr_preds, labels=LABELS))

        df.loc[X_val.index, "pred_transformer"] = tr_preds

    except Exception as e:
        print("⚠️ Transformer desactivado por entorno:", repr(e))

# ==================================================
# 3) Sincronizar predicciones a train_df por índice
# ==================================================

cols_to_merge = [c for c in ["pred_svm", "pred_transformer"] if c in df.columns]
train_df = train_df.join(df[cols_to_merge].reindex(train_df.index))

# ==================================================
# 4) Resumen comparativo
# ==================================================
summary = [{"model":"TFIDF+LinearSVC","accuracy":round(svm_acc,4),"f1_macro":round(svm_f1,4)}]
if not np.isnan(tr_acc):
    summary.append({"model":"RoBERTa (inference)","accuracy":round(tr_acc,4),"f1_macro":round(tr_f1,4)})

print("\n=== Summary ===")
print(pd.DataFrame(summary))

Ejemplos: 25414
Clases: {'neutral': 10326, 'positive': 7992, 'negative': 7096}

=== TF-IDF + LinearSVC ===
Accuracy: 0.6711 | F1-macro: 0.6726
              precision    recall  f1-score   support

    negative     0.6441    0.6427    0.6434      1419
     neutral     0.6465    0.6358    0.6411      2065
    positive     0.7249    0.7417    0.7332      1599

    accuracy                         0.6711      5083
   macro avg     0.6718    0.6734    0.6726      5083
weighted avg     0.6705    0.6711    0.6707      5083

Confusion matrix (SVM):
 [[ 912  408   99]
 [ 401 1313  351]
 [ 103  310 1186]]

=== RoBERTa (inferencia directa) ===


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Accuracy: 0.7167 | F1-macro: 0.7185
              precision    recall  f1-score   support

    negative     0.6678    0.8062    0.7305      1419
     neutral     0.7476    0.5608    0.6408      2065
    positive     0.7364    0.8386    0.7842      1599

    accuracy                         0.7167      5083
   macro avg     0.7173    0.7352    0.7185      5083
weighted avg     0.7218    0.7167    0.7110      5083

Confusion matrix (Transformer):
 [[1144  209   66]
 [ 493 1158  414]
 [  76  182 1341]]

=== Summary ===
                 model  accuracy  f1_macro
0      TFIDF+LinearSVC    0.6711    0.6726
1  RoBERTa (inference)    0.7167    0.7185


El codigo limpia el  texto, entrena un modelo clásico que combina TF-IDF (una técnica que convierte las palabras en números según su importancia en el texto) con SVM (un algoritmo que busca la mejor frontera para separar las clases) y otro moderno (RoBERTa), evalúa su precisión y F1-macro, guarda sus predicciones y muestra una tabla comparativa para ver cuál clasifica mejor el sentimiento.

In [29]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sentence_transformers import SentenceTransformer
warnings.filterwarnings("ignore")

SEED = 42
LABELS = ["negative", "neutral", "positive"]
RUN_SBERT = True
C_GRID = [0.5, 1.0, 2.0]
MAX_ITERS = 2000
NGRAMS = (1,3)
MAX_FEATS = 80000
SBERT_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
BATCH = 256

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# ---------------------------
# Validación de datos
# ---------------------------
assert "text" in train_df.columns and "sentiment" in train_df.columns

df = train_df.dropna(subset=["text","sentiment"]).copy()
df["sentiment"] = df["sentiment"].str.lower().str.strip()
df = df[df["sentiment"].isin(LABELS)].reset_index(drop=True)

print("Ejemplos:", len(df))
print("Clases:", df["sentiment"].value_counts().to_dict())

# ---------------------------
# Limpieza ligera
# ---------------------------
def clean_en(s: str) -> str:
    s = s.lower()
    s = re.sub(r"http\S+|www\.\S+", " ", s)
    s = re.sub(r"@\w+|#\w+", " ", s)
    s = re.sub(r"[^a-z0-9\s']", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_clean"] = df["text"].astype(str).map(clean_en)

X_train, X_val, y_train, y_val = train_test_split(
    df["text_clean"], df["sentiment"],
    test_size=0.2, random_state=SEED, stratify=df["sentiment"]
)

# ==================================================
# Modelo A: TF-IDF + LogisticRegression
# ==================================================
print("\n=== Modelo A: TF-IDF + LogisticRegression ===")
vectorizer = TfidfVectorizer(
    lowercase=True, stop_words="english",
    ngram_range=NGRAMS, min_df=2, max_df=0.9,
    sublinear_tf=True, max_features=MAX_FEATS
)
Xtr = vectorizer.fit_transform(X_train)
Xva = vectorizer.transform(X_val)

best_A_scores = (-1.0, -1.0)
best_A_preds = None
best_A_C = None

for C in C_GRID:
    clf = LogisticRegression(
        solver="saga", penalty="l2",
        multi_class="multinomial", class_weight="balanced",
        C=C, max_iter=MAX_ITERS, n_jobs=-1, random_state=SEED
    )
    clf.fit(Xtr, y_train)
    preds = clf.predict(Xva)
    acc = accuracy_score(y_val, preds)
    f1m = f1_score(y_val, preds, average="macro")
    print(f"C={C} -> Acc: {acc:.4f} | F1-macro: {f1m:.4f}")
    if (acc, f1m) > best_A_scores:
        best_A_scores = (acc, f1m)
        best_A_preds = preds
        best_A_C = C

print(f"\nMejor C={best_A_C} -> Acc: {best_A_scores[0]:.4f} | F1-macro: {best_A_scores[1]:.4f}")
print(classification_report(y_val, best_A_preds, digits=4))
print("Confusion matrix (A):\n", confusion_matrix(y_val, best_A_preds, labels=LABELS))

df.loc[X_val.index, "pred_tfidf_logreg"] = best_A_preds

# ==================================================
# Modelo B: SBERT + LogisticRegression
# ==================================================
if RUN_SBERT:
    print("\n=== Modelo B: SBERT + LogisticRegression ===")
    device = "cuda" if torch.cuda.is_available() else "cpu"
    sbert = SentenceTransformer(SBERT_MODEL, device=device)

    def embed_texts(texts, model, batch_size=BATCH):
        return model.encode(
            texts, batch_size=batch_size,
            convert_to_numpy=True, normalize_embeddings=True,
            show_progress_bar=True
        )

    Xtr_dense = embed_texts(X_train.tolist(), sbert, batch_size=BATCH)
    Xva_dense = embed_texts(X_val.tolist(), sbert, batch_size=BATCH)

    best_B_scores = (-1.0, -1.0)
    best_B_preds = None
    best_B_C = None

    for C in C_GRID:
        clf = LogisticRegression(
            solver="lbfgs", penalty="l2",
            multi_class="multinomial", class_weight="balanced",
            C=C, max_iter=MAX_ITERS, n_jobs=-1, random_state=SEED
        )
        clf.fit(Xtr_dense, y_train)
        preds = clf.predict(Xva_dense)
        acc = accuracy_score(y_val, preds)
        f1m = f1_score(y_val, preds, average="macro")
        print(f"C={C} -> Acc: {acc:.4f} | F1-macro: {f1m:.4f}")
        if (acc, f1m) > best_B_scores:
            best_B_scores = (acc, f1m)
            best_B_preds = preds
            best_B_C = C

    print(f"\nMejor C={best_B_C} -> Acc: {best_B_scores[0]:.4f} | F1-macro: {best_B_scores[1]:.4f}")
    print(classification_report(y_val, best_B_preds, digits=4))
    print("Confusion matrix (B):\n", confusion_matrix(y_val, best_B_preds, labels=LABELS))

    df.loc[X_val.index, "pred_sbert_logreg"] = best_B_preds

# ==================================================
# Resumen
# ==================================================
summary = [
    {"model": "TFIDF+LogReg", "accuracy": round(best_A_scores[0],4), "f1_macro": round(best_A_scores[1],4)}
]
if RUN_SBERT:
    summary.append({"model": "SBERT+LogReg", "accuracy": round(best_B_scores[0],4), "f1_macro": round(best_B_scores[1],4)})

print("\n=== Summary ===")
print(pd.DataFrame(summary))

Ejemplos: 25414
Clases: {'neutral': 10326, 'positive': 7992, 'negative': 7096}

=== Modelo A: TF-IDF + LogisticRegression ===
C=0.5 -> Acc: 0.6941 | F1-macro: 0.6954
C=1.0 -> Acc: 0.6907 | F1-macro: 0.6921
C=2.0 -> Acc: 0.6842 | F1-macro: 0.6858

Mejor C=0.5 -> Acc: 0.6941 | F1-macro: 0.6954
              precision    recall  f1-score   support

    negative     0.6780    0.6603    0.6690      1419
     neutral     0.6632    0.6809    0.6719      2065
    positive     0.7495    0.7411    0.7453      1599

    accuracy                         0.6941      5083
   macro avg     0.6969    0.6941    0.6954      5083
weighted avg     0.6945    0.6941    0.6942      5083

Confusion matrix (A):
 [[ 937  387   95]
 [ 358 1406  301]
 [  87  327 1185]]

=== Modelo B: SBERT + LogisticRegression ===


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/80 [00:00<?, ?it/s]

Batches:   0%|          | 0/20 [00:00<?, ?it/s]

C=0.5 -> Acc: 0.6945 | F1-macro: 0.6973
C=1.0 -> Acc: 0.6913 | F1-macro: 0.6942
C=2.0 -> Acc: 0.6911 | F1-macro: 0.6939

Mejor C=0.5 -> Acc: 0.6945 | F1-macro: 0.6973
              precision    recall  f1-score   support

    negative     0.6658    0.7385    0.7003      1419
     neutral     0.6949    0.6111    0.6503      2065
    positive     0.7206    0.7630    0.7412      1599

    accuracy                         0.6945      5083
   macro avg     0.6938    0.7042    0.6973      5083
weighted avg     0.6949    0.6945    0.6929      5083

Confusion matrix (B):
 [[1048  293   78]
 [ 408 1262  395]
 [ 118  261 1220]]

=== Summary ===
          model  accuracy  f1_macro
0  TFIDF+LogReg    0.6941    0.6954
1  SBERT+LogReg    0.6945    0.6973


Probamos varias formas de clasificar el sentimiento en tres clases. Empezamos con VADER, que usa un diccionario y reglas, es rápido pero no entiende bien el contexto, y se quedó en 0.63 de F1-macro. Luego pasamos a modelos clásicos con TF-IDF, que convierte el texto en números según la importancia de cada palabra: con LinearSVC subimos 4 puntos y con LogisticRegression ganamos 2 más usando trigramas y un mejor balance entre precisión y recall. Después probamos SBERT, que transforma frases en vectores con significado, y quedó muy parecido a TF-IDF+LogReg. El que mejor funcionó fue RoBERTa, un modelo moderno que entiende contexto y matices, llegando a 0.72 de F1-macro, aunque es más pesado y lento que los otros.