# Práctica NLP

Este trabajo presenta un sistema de clasificación binaria de sentimientos aplicado a reviews de libros de Amazon. El objetivo es predecir si una review es positiva o negativa basándose en el texto escrito por los usuarios.


## 3. Entrenamiento y test

In [None]:
import pandas as pd
import pickle
import numpy as np

# Importar datos del notebook anterior
with open('data_para_ejecutar_ejercicio3.pkl', 'rb') as f:
    df = pickle.load(f)

### Modelos seleccionados

* [**Logistic Regression**](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression): modelo lineal sencillo y rápido, que funciona bien en textos vectorizados (TF-IDF) y sirve de referencia para problemas de clasificación binaria.
* [**LinearSVC (SVM)**](https://scikit-learn.org/stable/modules/svm.html): clasificador lineal que suele obtener mejor rendimiento que la regresión logística.
* [**DistilBERT**](https://huggingface.co/distilbert/distilbert-base-uncased) (opcional, DL): modelo Transformer preentrenado optimizado para ser ligero, para comparar con los modelos de ML anteriores. Seleccionado después de leer [*Getting Started with Sentiment Analysis using Python*](https://huggingface.co/blog/sentiment-analysis-python#getting-started-with-sentiment-analysis-using-python).

### 3.1. TF-IDF

In [None]:
df['sentiment_binary'] = df['sentiment'].apply(lambda x: 'Positivo' if x >= 3 else 'Negativo')

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['processed_review'],
    df['sentiment_binary'],
    train_size=0.80,
    test_size=0.20,
    random_state=42,
    stratify=df['sentiment_binary']
)

print(f"Train: {len(X_train)} reviews")
print(f"Test: {len(X_test)} reviews")

# TF-IDF Vectorizer
vectorizer = TfidfVectorizer(
    max_df=0.95,        # Ignora términos en más del 95% docs
    min_df=5,           # Ignora términos en menos de 5 docs
    max_features=2500,  # Top 2500 features
    ngram_range=(1, 2)  # Captura unigramas y bigramas
)

# Fit en train, transform en train y test
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print(f"\nShape TF-IDF: {X_train_vec.shape}")
print(f"Vocabulario: {len(vectorizer.vocabulary_)} términos")

Train: 4000 reviews
Test: 1000 reviews

Shape TF-IDF: (4000, 2500)
Vocabulario: 2500 términos


### 3.2. Modelo Machine Learning 1: Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Entrenar Logistic Regression
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42
)

lr_model.fit(X_train_vec, y_train)

# Predicciones
y_pred_lr = lr_model.predict(X_test_vec)

# Métricas
print(f"\nAccuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_lr))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_lr))


Accuracy: 0.8250

Classification Report:
              precision    recall  f1-score   support

    Negativo       0.82      0.72      0.77       400
    Positivo       0.83      0.89      0.86       600

    accuracy                           0.82      1000
   macro avg       0.82      0.81      0.81      1000
weighted avg       0.82      0.82      0.82      1000


Confusion Matrix:
[[289 111]
 [ 64 536]]


In [None]:
# Guardar resultados de Logistic Regression
from sklearn.metrics import precision_recall_fscore_support

precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_lr, average='weighted')

lr_results = {
    'model_name': 'Logistic Regression',
    'accuracy': accuracy_score(y_test, y_pred_lr),
    'precision': precision,
    'recall': recall,
    'f1': f1,
    'y_pred': y_pred_lr,
    'y_test': y_test,
    'classification_report': classification_report(y_test, y_pred_lr, output_dict=True)
}

### 3.3. Modelo Machine Learning 2: SVM

In [None]:
from sklearn.svm import LinearSVC

# Entrenar SVM
svm_model = LinearSVC(
    max_iter=1000,
    random_state=42
)

svm_model.fit(X_train_vec, y_train)

# Predicciones
y_pred_svm = svm_model.predict(X_test_vec)

# Métricas
print(f"\nAccuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_svm))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_svm))


Accuracy: 0.8100

Classification Report:
              precision    recall  f1-score   support

    Negativo       0.78      0.73      0.75       400
    Positivo       0.83      0.86      0.85       600

    accuracy                           0.81      1000
   macro avg       0.80      0.80      0.80      1000
weighted avg       0.81      0.81      0.81      1000


Confusion Matrix:
[[291 109]
 [ 81 519]]


In [None]:
# Guardar resultados de SVM
precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred_svm, average='weighted')

svm_results = {
    'model_name': 'SVM (LinearSVC)',
    'accuracy': accuracy_score(y_test, y_pred_svm),
    'precision': precision,
    'recall': recall,
    'f1': f1,
    'y_pred': y_pred_svm,
    'y_test': y_test,
    'classification_report': classification_report(y_test, y_pred_svm, output_dict=True),
    'confusion_matrix': confusion_matrix(y_test, y_pred_svm)
}

### 3.4. Modelo Deep Learning: DistilBERT

In [None]:
!pip install transformers datasets torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
import torch
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support



In [None]:
# Convertir etiquetas a números
label_map = {'Negativo': 0, 'Positivo': 1}
y_train_encoded = y_train.map(label_map)
y_test_encoded = y_test.map(label_map)

train_dataset = Dataset.from_dict({
    'text': X_train.tolist(),
    'label': y_train_encoded.tolist()
})

test_dataset = Dataset.from_dict({
    'text': X_test.tolist(),
    'label': y_test_encoded.tolist()
})

In [None]:
# Cargar modelo y tokenizador
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Tokenizar
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# Métricas
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='weighted')
    acc = accuracy_score(labels, predictions)
    return {'accuracy': acc, 'f1': f1, 'precision': precision, 'recall': recall}

In [None]:
# Argumentos para el trainer
# Referencia: https://huggingface.co/docs/transformers/main_classes/trainer
training_args = TrainingArguments(
    output_dir='./results', # path para guardar checkpoint
    num_train_epochs=3, # núm de veces que recorre datos
    per_device_train_batch_size=16, # núm reviews procesa a la vez en cada entrenamiento
    per_device_eval_batch_size=16, # núm reviews procesa a la vez en cada evaluación
    eval_strategy="epoch", # cuando evalúa --> al final del cada epoch
    save_strategy="epoch", # cuando guarda --> al final de cada epoch
    logging_dir='./logs', # path para guardar métricas
    load_best_model_at_end=True, # se queda con el mejor al final
    report_to="none", # para que no pida api key de wandb
)

In [None]:
# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
# Entrenar
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.413198,0.791,0.792922,0.802354,0.791
2,0.394000,0.463898,0.805,0.799187,0.808875,0.805
3,0.394000,0.549907,0.822,0.822,0.822,0.822


TrainOutput(global_step=750, training_loss=0.31912750244140625, metrics={'train_runtime': 174.1345, 'train_samples_per_second': 68.912, 'train_steps_per_second': 4.307, 'total_flos': 397402195968000.0, 'train_loss': 0.31912750244140625, 'epoch': 3.0})

In [None]:
# Evaluar
results = trainer.evaluate()
print(f"\nAccuracy final: {results['eval_accuracy']:.4f}")
print(f"F1-score final: {results['eval_f1']:.4f}")

# Predicciones para classification report
predictions = trainer.predict(test_dataset)
y_pred_bert = np.argmax(predictions.predictions, axis=1)

# Métricas detalladas
inverse_label_map = {0: 'Negativo', 1: 'Positivo'}
y_pred_bert_labels = [inverse_label_map[pred] for pred in y_pred_bert]

print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_bert_labels))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_bert_labels))


y_test_bert_array = (
    y_test_encoded.values if hasattr(y_test_encoded, 'values')
    else y_test_encoded
)

# Guardar resultados de DistilBERT
bert_results = {
    'model_name': 'DistilBERT',
    'accuracy': results['eval_accuracy'],
    'precision': results['eval_precision'],
    'recall': results['eval_recall'],
    'f1': results['eval_f1'],
    'y_pred': y_pred_bert,
    'y_test': y_test_bert_array,
    'classification_report': classification_report(
        y_test_bert_array,
        y_pred_bert,
        output_dict=True,
        target_names=['Negativo', 'Positivo']
    )
}


Accuracy final: 0.7910
F1-score final: 0.7929

Classification Report:
              precision    recall  f1-score   support

    Negativo       0.70      0.82      0.76       400
    Positivo       0.87      0.77      0.82       600

    accuracy                           0.79      1000
   macro avg       0.79      0.80      0.79      1000
weighted avg       0.80      0.79      0.79      1000


Confusion Matrix:
[[330  70]
 [139 461]]


In [None]:
# Guardar resultados de cada modelo para comparar en el siguiente ejercicio
results = {}

results['lr_results'] = lr_results
results['svm_results'] = svm_results
results['bert_results'] = bert_results
results['df'] = df

# Guardar
with open('data_para_ejecutar_ejercicio4.pkl', 'wb') as f:
    pickle.dump(results, f)

print(f"\nModelos disponibles para el siguiente notebook:")
for model_name, result in results.items():
    if result is not None and model_name != 'df':
        if isinstance(result, dict) and 'accuracy' in result:
            print(f"  - {result['model_name']}: Accuracy = {result['accuracy']:.4f}")
        else:
            print(f"  - {model_name}")


Modelos disponibles para el siguiente notebook:
  - Logistic Regression: Accuracy = 0.8250
  - SVM (LinearSVC): Accuracy = 0.8100
  - DistilBERT: Accuracy = 0.7910
