# 11 — BERT Fine-Tuning (BETO)

Fine-tuning **BETO** (`dccuchile/bert-base-spanish-wwm-cased`), a Spanish BERT model,
for text classification using **transfer learning**.

Unlike all previous models:
- Uses BERT's own **contextual embeddings** (not Word2Vec)
- Each word's representation depends on its **full context** in the sentence
- Pre-trained on massive Spanish text, then **fine-tuned** on our small corpus

This covers Transfer Learning, LLMs, and Transformers in one model.

## ⚠️ Important: Training Timeout

**BERT fine-tuning is too slow to run inside a Jupyter notebook on CPU.**
Training 3 epochs on both pipelines takes ~1.5–2 hours on CPU, which exceeds
the default notebook execution timeout.

### How to train

Run the standalone training script from the **project root**:

```bash
.venv/bin/python scripts/train_bert_base.py
```

The script trains Standard, Irony, and Obfuscated pipelines and saves models to
`models/bert_base/{standard,irony,obfuscated}/`.

The code below documents the architecture and approach for reference.

In [None]:
import numpy as np
import pandas as pd
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import Dataset
from sklearn.metrics import accuracy_score, classification_report
import os

## 1. Configuration

In [None]:
MODEL_NAME = 'dccuchile/bert-base-spanish-wwm-cased'
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 3
LEARNING_RATE = 2e-5
LABEL_MAP = {'NEGATIVE': 0, 'POSITIVE': 1}

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

## 2. Data Helpers

In [None]:
def load_dataset(data_dir):
    """Load CSV and convert to HuggingFace Dataset."""
    train_df = pd.read_csv(f'{data_dir}/train.csv').fillna('')
    test_df  = pd.read_csv(f'{data_dir}/test.csv').fillna('')
    
    train_df['label'] = train_df['label'].map(LABEL_MAP)
    test_df['label']  = test_df['label'].map(LABEL_MAP)
    
    train_ds = Dataset.from_pandas(train_df[['text_clean', 'label']])
    test_ds  = Dataset.from_pandas(test_df[['text_clean', 'label']])
    
    return train_ds, test_ds

def tokenize_fn(examples):
    return tokenizer(
        examples['text_clean'],
        padding='max_length',
        truncation=True,
        max_length=MAX_LEN
    )

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc}

## 3. Model & Training

**Architecture**: BETO (12 layers, 768 hidden, 12 heads) → Classification Head (2 classes)

**Training**: 3 epochs, AdamW (lr=2e-5), weight_decay=0.01, batch_size=16

In [None]:
def train_bert(variation_name, data_dir, output_dir):
    print(f"\n{'='*20} BERT: {variation_name} {'='*20}")
    
    # Load and tokenize
    train_ds, test_ds = load_dataset(data_dir)
    train_ds = train_ds.map(tokenize_fn, batched=True)
    test_ds  = test_ds.map(tokenize_fn, batched=True)
    
    train_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    test_ds.set_format('torch', columns=['input_ids', 'attention_mask', 'label'])
    
    print(f"Train: {len(train_ds)}, Test: {len(test_ds)}")
    
    # Fresh model
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME, num_labels=2
    )
    
    training_args = TrainingArguments(
        output_dir=f'{output_dir}/checkpoints',
        num_train_epochs=EPOCHS,
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        evaluation_strategy='epoch',
        save_strategy='epoch',
        logging_steps=50,
        learning_rate=LEARNING_RATE,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy',
        seed=42,
        report_to='none',
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_ds,
        eval_dataset=test_ds,
        compute_metrics=compute_metrics,
    )
    
    trainer.train()
    
    # Evaluate
    results = trainer.evaluate()
    acc = results['eval_accuracy']
    print(f"\nBERT ({variation_name}) Accuracy: {acc:.4f}")
    
    preds_output = trainer.predict(test_ds)
    y_pred = np.argmax(preds_output.predictions, axis=-1)
    y_true = preds_output.label_ids
    print(classification_report(y_true, y_pred))
    
    # Save
    os.makedirs(output_dir, exist_ok=True)
    model.save_pretrained(f'{output_dir}/model')
    tokenizer.save_pretrained(f'{output_dir}/tokenizer')
    print(f"Model saved to {output_dir}")
    
    return acc

## 4. Results

After running `scripts/train_bert_base.py`:

| Pipeline | Accuracy |
|:---|:---:|
| Standard | **86.22%** |
| Irony | **85.33%** |
| Obfuscated | **TBD** |
| Delta | -0.89% |

BERT is the best-performing model, surpassing Naive Bayes (83.78%) by +2.44%.