# Fase 3b: Deep Learning Models (BiLSTM)

En este notebook, usamos `Word2Vec` para entrenar embeddings y `BiLSTM` para la clasificación.

In [1]:
%load_ext autoreload
%autoreload 2
import sys
import os
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Add src
sys.path.append(os.path.abspath("../src"))
from dl_models import AdvancedDLManager

  from .autonotebook import tqdm as notebook_tqdm


## 1. Load Data

In [None]:
data_path = Path("../data/processed_corpus.csv")
df_full = pd.read_csv(data_path)
df_full = df_full.dropna(subset=['clean_text', 'sentiment_score'])

# Definimos columnas a probar
input_columns = ['clean_text', 'lemmas_text']

# Usamos clean_text solo para sacar los índices, luego usaremos la columna que toque
X_indices = df_full['clean_text'] 
y = df_full['sentiment_score']

# Test 
X_train_raw, X_test_real, y_train_raw, y_test_real = train_test_split(
    X_indices, y, test_size=0.2, random_state=42, stratify=y
)
train_idx = X_train_raw.index
test_idx = X_test_real.index

print(f"Indices fijados -> Train Total: {len(train_idx)}, Test Intocable: {len(test_idx)}")

Indices fijados -> Train Total: 72679, Test Intocable: 18170


  df_full = pd.read_csv(data_path)


## 2. Entrenamiento de modelos de deep learning

In [3]:
experiments_config = [
    {
        'name': 'Baseline: Word2Vec + BiLSTM',
        'strategy': 'w2v',
        'model': None
    },
    {
        'name': 'SOTA: All-MiniLM + MLP',
        'strategy': 'transformer',
        'model': 'sentence-transformers/all-MiniLM-L6-v2' 
    },
    {
        'name': 'SOTA: BGE-Small + MLP',
        'strategy': 'transformer',
        'model': 'BAAI/bge-small-en-v1.5'
    },
    {
        'name': 'GenAI: Gemma-Embed + MLP',
        'strategy': 'ollama',
        'model': 'embeddinggemma:latest'
    }
]

In [None]:
results_dl = []

for col in input_columns:
    print(f"\n{'='*60}")
    print(f">>> PROCESANDO FEATURE: {col.upper()} <<<")
    print(f"{'='*60}")
    
    X_full_col = df_full[col].astype(str) 
    
    X_train_curr = X_full_col.loc[train_idx]
    y_train_curr = y.loc[train_idx]
    
    X_test_curr = X_full_col.loc[test_idx]
    y_test_curr = y.loc[test_idx]
    
    # Creamos DF temporal para facilitar el sampleo
    train_df_temp = pd.DataFrame({'feature': X_train_curr, 'target': y_train_curr})
    min_c = train_df_temp['target'].value_counts().min()
    
    print(f"Balanceando Train a {min_c} muestras por clase...")
    
    balanced_train = train_df_temp.groupby('target').apply(
        lambda x: x.sample(min_c, random_state=42)
    ).reset_index(drop=True)
    
    X_train_bal = balanced_train['feature']
    y_train_bal = balanced_train['target']
    
    X_tr_final, X_val_final, y_tr_final, y_val_final = train_test_split(
        X_train_bal, y_train_bal, test_size=0.1, random_state=42, stratify=y_train_bal
    )
    
    print(f"   Datos Finales DL -> Train: {len(X_tr_final)}, Val: {len(X_val_final)}")
    
    for exp in experiments_config:
        exp_id = f"{exp['name']} ({col})"
        print(f"\n   >>> Entrenando: {exp_id}")
        
        try:
            # Instanciar
            dl_man = AdvancedDLManager(strategy=exp['strategy'], model_name=exp['model'])
            
            # Entrenar W2V (si toca)
            if exp['strategy'] == 'w2v':
                # Entrenamos W2V con TODO el train balanceado (incluyendo val) para mejor vocabulario
                dl_man.train_w2v(X_train_bal)
            
            history = dl_man.train(X_tr_final, y_tr_final, X_val_final, y_val_final, 
                                 epochs=5, batch_size=32)
            
            # Evaluar en Test Real
            print("Evaluando en Test Set...")
            rep = dl_man.evaluate(X_test_curr, y_test_curr)
            
            # Guardar
            results_dl.append({
                'Feature': col,
                'Model': exp['name'],
                'Report_Raw': rep,
                'History': history
            })
            
            # Print rápido de resultados
            lines = rep.split('\n')
            print(f"RESULTADO: {lines[-4].strip()} | {lines[-3].strip()}")
            
        except Exception as e:
            print(f"ERROR en {exp_id}: {e}")


>>> PROCESANDO FEATURE: CLEAN_TEXT <<<
Balanceando Train a 3898 muestras por clase...
   Datos Finales DL -> Train: 10524, Val: 1170

   >>> Entrenando: Baseline: Word2Vec + BiLSTM (clean_text)
Estrategia: w2v | Dimensión Vectores: 100


  balanced_train = train_df_temp.groupby('target').apply(



Entrenando BiLSTM en cuda...


Ep 1: 100%|██████████| 329/329 [00:04<00:00, 72.24it/s]


Ep 1 - Loss: 0.5649 - Val Acc: 0.7547


Ep 2: 100%|██████████| 329/329 [00:04<00:00, 75.43it/s]


Ep 2 - Loss: 0.5137 - Val Acc: 0.7547


Ep 3: 100%|██████████| 329/329 [00:04<00:00, 77.14it/s]


Ep 3 - Loss: 0.5061 - Val Acc: 0.7675


Ep 4: 100%|██████████| 329/329 [00:04<00:00, 76.16it/s]


Ep 4 - Loss: 0.4972 - Val Acc: 0.7504


Ep 5: 100%|██████████| 329/329 [00:04<00:00, 76.88it/s]


Ep 5 - Loss: 0.4935 - Val Acc: 0.7761
      Evaluando en Test Set...
RESULTADO: accuracy                           0.91     18170 | macro avg       0.66      0.76      0.69     18170

   >>> Entrenando: SOTA: All-MiniLM + MLP (clean_text)
Cargando Sentence Transformer: sentence-transformers/all-MiniLM-L6-v2
Estrategia: transformer | Dimensión Vectores: 384
Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 329/329 [00:05<00:00, 62.96it/s] 


Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 37/37 [00:00<00:00, 66.03it/s]



Entrenando SemanticMLP en cuda...


Ep 1: 100%|██████████| 329/329 [00:01<00:00, 289.48it/s]


Ep 1 - Loss: 0.4817 - Val Acc: 0.8068


Ep 2: 100%|██████████| 329/329 [00:01<00:00, 289.23it/s]


Ep 2 - Loss: 0.4072 - Val Acc: 0.8068


Ep 3: 100%|██████████| 329/329 [00:01<00:00, 281.02it/s]


Ep 3 - Loss: 0.3836 - Val Acc: 0.8197


Ep 4: 100%|██████████| 329/329 [00:01<00:00, 271.40it/s]


Ep 4 - Loss: 0.3565 - Val Acc: 0.8128


Ep 5: 100%|██████████| 329/329 [00:01<00:00, 256.10it/s]


Ep 5 - Loss: 0.3327 - Val Acc: 0.8051
      Evaluando en Test Set...
Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 568/568 [00:05<00:00, 99.06it/s] 


RESULTADO: accuracy                           0.92     18170 | macro avg       0.69      0.81      0.73     18170

   >>> Entrenando: SOTA: BGE-Small + MLP (clean_text)
Cargando Sentence Transformer: BAAI/bge-small-en-v1.5


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Estrategia: transformer | Dimensión Vectores: 384
Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 329/329 [00:08<00:00, 36.89it/s]


Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 37/37 [00:01<00:00, 31.48it/s]



Entrenando SemanticMLP en cuda...


Ep 1: 100%|██████████| 329/329 [00:01<00:00, 289.80it/s]


Ep 1 - Loss: 0.4148 - Val Acc: 0.8419


Ep 2: 100%|██████████| 329/329 [00:01<00:00, 292.46it/s]


Ep 2 - Loss: 0.3434 - Val Acc: 0.8342


Ep 3: 100%|██████████| 329/329 [00:01<00:00, 282.05it/s]


Ep 3 - Loss: 0.3134 - Val Acc: 0.8453


Ep 4: 100%|██████████| 329/329 [00:01<00:00, 281.91it/s]


Ep 4 - Loss: 0.2864 - Val Acc: 0.8479


Ep 5: 100%|██████████| 329/329 [00:01<00:00, 271.80it/s]


Ep 5 - Loss: 0.2639 - Val Acc: 0.8470
      Evaluando en Test Set...
Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 568/568 [00:09<00:00, 57.36it/s]


RESULTADO: accuracy                           0.93     18170 | macro avg       0.73      0.85      0.77     18170

   >>> Entrenando: GenAI: Gemma-Embed + MLP (clean_text)
Conectando a Ollama: embeddinggemma:latest
Estrategia: ollama | Dimensión Vectores: 768
Pre-computando embeddings (ollama)... esto puede tardar.


Ollama Embedding: 100%|██████████| 10524/10524 [21:31<00:00,  8.15it/s]


Pre-computando embeddings (ollama)... esto puede tardar.


Ollama Embedding: 100%|██████████| 1170/1170 [02:23<00:00,  8.16it/s]



Entrenando SemanticMLP en cuda...


Ep 1: 100%|██████████| 329/329 [00:01<00:00, 252.86it/s]


Ep 1 - Loss: 0.3560 - Val Acc: 0.8684


Ep 2: 100%|██████████| 329/329 [00:01<00:00, 251.49it/s]


Ep 2 - Loss: 0.2901 - Val Acc: 0.8855


Ep 3: 100%|██████████| 329/329 [00:01<00:00, 246.40it/s]


Ep 3 - Loss: 0.2562 - Val Acc: 0.8829


Ep 4: 100%|██████████| 329/329 [00:01<00:00, 239.62it/s]


Ep 4 - Loss: 0.2267 - Val Acc: 0.8838


Ep 5: 100%|██████████| 329/329 [00:01<00:00, 254.53it/s]


Ep 5 - Loss: 0.2045 - Val Acc: 0.8632
      Evaluando en Test Set...
Pre-computando embeddings (ollama)... esto puede tardar.


Ollama Embedding: 100%|██████████| 18170/18170 [37:35<00:00,  8.05it/s]


RESULTADO: accuracy                           0.95     18170 | macro avg       0.77      0.86      0.81     18170

>>> PROCESANDO FEATURE: LEMMAS_TEXT <<<
Balanceando Train a 3898 muestras por clase...
   Datos Finales DL -> Train: 10524, Val: 1170

   >>> Entrenando: Baseline: Word2Vec + BiLSTM (lemmas_text)
Estrategia: w2v | Dimensión Vectores: 100


  balanced_train = train_df_temp.groupby('target').apply(



Entrenando BiLSTM en cuda...


Ep 1: 100%|██████████| 329/329 [00:05<00:00, 59.56it/s]


Ep 1 - Loss: 0.5660 - Val Acc: 0.7444


Ep 2: 100%|██████████| 329/329 [00:04<00:00, 70.50it/s]


Ep 2 - Loss: 0.5290 - Val Acc: 0.7581


Ep 3: 100%|██████████| 329/329 [00:04<00:00, 67.50it/s]


Ep 3 - Loss: 0.5152 - Val Acc: 0.7684


Ep 4: 100%|██████████| 329/329 [00:06<00:00, 54.62it/s]


Ep 4 - Loss: 0.5030 - Val Acc: 0.7761


Ep 5: 100%|██████████| 329/329 [00:04<00:00, 66.96it/s]


Ep 5 - Loss: 0.4975 - Val Acc: 0.7462
      Evaluando en Test Set...
RESULTADO: accuracy                           0.91     18170 | macro avg       0.63      0.75      0.67     18170

   >>> Entrenando: SOTA: All-MiniLM + MLP (lemmas_text)
Cargando Sentence Transformer: sentence-transformers/all-MiniLM-L6-v2
Estrategia: transformer | Dimensión Vectores: 384
Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 329/329 [00:05<00:00, 62.05it/s] 


Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 37/37 [00:00<00:00, 60.75it/s]



Entrenando SemanticMLP en cuda...


Ep 1: 100%|██████████| 329/329 [00:01<00:00, 258.14it/s]


Ep 1 - Loss: 0.4725 - Val Acc: 0.8103


Ep 2: 100%|██████████| 329/329 [00:01<00:00, 257.97it/s]


Ep 2 - Loss: 0.4088 - Val Acc: 0.8188


Ep 3: 100%|██████████| 329/329 [00:01<00:00, 247.19it/s]


Ep 3 - Loss: 0.3740 - Val Acc: 0.8265


Ep 4: 100%|██████████| 329/329 [00:01<00:00, 233.92it/s]


Ep 4 - Loss: 0.3506 - Val Acc: 0.8256


Ep 5: 100%|██████████| 329/329 [00:01<00:00, 231.71it/s]


Ep 5 - Loss: 0.3271 - Val Acc: 0.8162
      Evaluando en Test Set...
Pre-computando embeddings (transformer)... esto puede tardar.


Batches: 100%|██████████| 568/568 [00:06<00:00, 90.02it/s] 


RESULTADO: accuracy                           0.92     18170 | macro avg       0.68      0.82      0.74     18170

   >>> Entrenando: SOTA: BGE-Small + MLP (lemmas_text)
Cargando Sentence Transformer: BAAI/bge-small-en-v1.5


## 3. Evaluation

In [None]:
for res in results_dl:
    print(f"\n[{res['Feature']}] {res['Model']}")
    lines = res['Report_Raw'].split('\n')
    print(f"   Accuracy: {lines[-4].split()[1]}")
    print(f"   Macro F1: {lines[-3].split()[-2]}")