# Práctica NLP

Este trabajo presenta un sistema de clasificación binaria de sentimientos aplicado a reviews de libros de Amazon. El objetivo es predecir si una review es positiva o negativa basándose en el texto escrito por los usuarios.


## 3. Entrenamiento y test

In [1]:
import pandas as pd
import pickle
import numpy as np

# Importar datos del notebook anterior
with open('data_para_ejecutar_ejercicio3.pkl', 'rb') as f:
    df = pickle.load(f)

### Modelos seleccionados

* [**Logistic Regression**](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression): modelo lineal sencillo y rápido, que funciona bien en textos vectorizados (TF-IDF) y sirve de referencia para problemas de clasificación binaria.
* [**LinearSVC (SVM)**](https://scikit-learn.org/stable/modules/svm.html): clasificador lineal que suele obtener mejor rendimiento que la regresión logística.
* [**DistilBERT**](https://huggingface.co/distilbert/distilbert-base-uncased) (opcional, DL): modelo Transformer preentrenado optimizado para ser ligero, para comparar con los modelos de ML anteriores. Seleccionado después de leer [*Getting Started with Sentiment Analysis using Python*](https://huggingface.co/blog/sentiment-analysis-python#getting-started-with-sentiment-analysis-using-python).

### 3.1. TF-IDF

In [2]:
df['sentiment_binary'] = df['sentiment'].apply(lambda x: 'Positivo' if x >= 3 else 'Negativo')

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['processed_review'],
    df['sentiment_binary'],
    train_size=0.80,
    test_size=0.20,
    random_state=42,
    stratify=df['sentiment_binary']
)

print(f"Train: {len(X_train)} reviews")
print(f"Test: {len(X_test)} reviews")

# TF-IDF Vectorizer
vectorizer = TfidfVectorizer(
    max_df=0.95,        # Ignora términos en más del 95% docs
    min_df=5,           # Ignora términos en menos de 5 docs
    max_features=2500,  # Top 2500 features
    ngram_range=(1, 2)  # Captura unigramas y bigramas
)

# Fit en train, transform en train y test
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f"\nShape TF-IDF: {X_train_vec.shape}")
print(f"Vocabulario: {len(vectorizer.vocabulary_)} términos")

Train: 4000 reviews
Test: 1000 reviews

Shape TF-IDF: (4000, 2500)
Vocabulario: 2500 términos


### 3.2. Modelo Machine Learning 1: Logistic Regression

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Entrenar Logistic Regression
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42
)

lr_model.fit(X_train_vec, y_train)

# Predicciones
y_pred_lr = lr_model.predict(X_test_vec)

# Métricas
print(f"\nAccuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_lr))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))


Accuracy: 0.8240

Classification Report:
              precision    recall  f1-score   support

    Negativo       0.82      0.72      0.77       400
    Positivo       0.83      0.89      0.86       600

    accuracy                           0.82      1000
   macro avg       0.82      0.81      0.81      1000
weighted avg       0.82      0.82      0.82      1000


Confusion Matrix:
[[288 112]
 [ 64 536]]


### 3.3. Modelo Machine Learning 2: SVM

In [5]:
from sklearn.svm import LinearSVC

# Entrenar SVM
svm_model = LinearSVC(
    max_iter=1000,
    random_state=42
)

svm_model.fit(X_train_vec, y_train)

# Predicciones
y_pred_svm = svm_model.predict(X_test_vec)

# Métricas
print(f"\nAccuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_svm))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))


Accuracy: 0.8130

Classification Report:
              precision    recall  f1-score   support

    Negativo       0.79      0.73      0.76       400
    Positivo       0.83      0.87      0.85       600

    accuracy                           0.81      1000
   macro avg       0.81      0.80      0.80      1000
weighted avg       0.81      0.81      0.81      1000


Confusion Matrix:
[[291 109]
 [ 78 522]]


### 3.4. Modelo Deep Learning: DistilBERT

In [6]:
!pip install transformers datasets torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support



In [7]:
# Convertir etiquetas a números
label_map = {'Negativo': 0, 'Positivo': 1}
y_train_encoded = y_train.map(label_map)
y_test_encoded = y_test.map(label_map)

train_dataset = Dataset.from_dict({
    'text': X_train.tolist(),
    'label': y_train_encoded.tolist()
})

test_dataset = Dataset.from_dict({
    'text': X_test.tolist(),
    'label': y_test_encoded.tolist()
})

In [8]:
# Cargar modelo y tokenizador
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
# Tokenizar
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [10]:
# Métricas
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    acc = accuracy_score(labels, predictions)
    return {'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall}

In [11]:
# Argumentos para el trainer
# Referencia: https://huggingface.co/docs/transformers/main_classes/trainer
training_args = TrainingArguments(
    output_dir='./results', # path para guardar checkpoint
    num_train_epochs=3, # núm de veces que recorre datos
    per_device_train_batch_size=16, # núm reviews procesa a la vez en cada entrenamiento
    per_device_eval_batch_size=16, # núm reviews procesa a la vez en cada evaluación
    eval_strategy="epoch", # cuando evalúa --> al final del cada epoch
    save_strategy="epoch", # cuando guarda --> al final de cada epoch
    logging_dir='./logs', # path para guardar métricas
    load_best_model_at_end=True, # se queda con el mejor al final
    report_to="none", # para que no pida api key de wandb
)

In [12]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

In [13]:
# Entrenar
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.413425,0.804,0.805876,0.817012,0.804
2,0.378400,0.461192,0.815,0.810578,0.817314,0.815
3,0.378400,0.553353,0.829,0.828664,0.828466,0.829


TrainOutput(global_step=750, training_loss=0.30266086832682293, metrics={'train_runtime': 179.4352, 'train_samples_per_second': 66.876, 'train_steps_per_second': 4.18, 'total_flos': 397402195968000.0, 'train_loss': 0.30266086832682293, 'epoch': 3.0})

In [14]:
# Evaluar
results = trainer.evaluate()
print(f"\nAccuracy final: {results['eval_accuracy']:.4f}")
print(f"F1-score final: {results['eval_f1']:.4f}")

# Predicciones para classification report
predictions = trainer.predict(test_dataset)
y_pred_bert = np.argmax(predictions.predictions, axis=1)

# Métricas detalladas
inverse_label_map = {0: 'Negativo', 1: 'Positivo'}
y_pred_bert_labels = [inverse_label_map[pred] for pred in y_pred_bert]

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_bert_labels))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_bert_labels))


Accuracy final: 0.8040
F1-score final: 0.8059

Classification Report:
              precision    recall  f1-score   support

    Negativo       0.71      0.85      0.78       400
    Positivo       0.89      0.77      0.83       600

    accuracy                           0.80      1000
   macro avg       0.80      0.81      0.80      1000
weighted avg       0.82      0.80      0.81      1000


Confusion Matrix:
[[340  60]
 [136 464]]


In [15]:
# Guardar resultados de cada modelo para comparar en el siguiente ejercicio
results = {
    'lr_results': lr_results if 'lr_results' in locals() else None,
    'svm_results': svm_results if 'svm_results' in locals() else None,
    'bert_results': bert_results if 'bert_results' in locals() else None,
    'df': df
}

with open('data_para_ejecutar_ejercicio4.pkl', 'wb') as f:
    pickle.dump(results, f)

print("  Modelos disponibles:")
for model_name, result in results.items():
    if result is not None and model_name != 'df':
        print(f"    - {model_name}")

  Modelos disponibles:
