<h1>Bert Fine Tunning for multiclass clasification</h1>

Se escogió BERT como siguiente opcion, dado que se ha demostrado un excelente rendimiento en clasificacion gracias a que su capa de atencion es completa  y no causal, por otro lado, la diferentes capas favorecen el encontrar relaciones complejas entre los datos de texto, de este modo, podemos ingresar todo el texto, y que BERT lo clasifique, cambiando la capa de salida por una capa de 4 neuronas y que <i>y_test</i> funcione como un binarizer.

In [None]:
## Descomentar si esta trabajando en Google Colab
#from google.colab import drive
#drive.mount('/content/drive')

Mounted at /content/drive


<h2>Carga del dataset</h2>

In [None]:
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/techespere/dataset.csv", sep=";")
#df = pd.read_csv("../data/dataset.csv", sep=";") ## Descomentar si se trabaja en local
df["text"] = (df["title"].fillna("") + " " + df["abstract"].fillna("")).str.strip()

X = df["text"].values
y = df["group"].values

In [None]:
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer

# Define el orden fijo de las clases (columnas):
CLASSES = ['cardiovascular', 'neurological', 'hepatorenal', 'oncological']

# Parseo simple: split por '|'
y_lists = [s.split('|') if isinstance(s, str) and s else [] for s in y]

mlb = MultiLabelBinarizer(classes=CLASSES)
y_bin = mlb.fit_transform(y_lists)  # shape: (n_samples, 4)

# Opcional: verificar orden
print("Orden de columnas:", mlb.classes_)

Orden de columnas: ['cardiovascular' 'neurological' 'hepatorenal' 'oncological']


<h2>Creacion de la clase Bert Multiclass Clasiffier</h2>

In [None]:
# Imports
import os
import random
import math
from typing import List, Optional
from math import ceil

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import AutoTokenizer, BertModel, get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.preprocessing import MultiLabelBinarizer
from tqdm.auto import tqdm

In [None]:
# Configuración / hiperparámetros
MODEL_NAME = "bert-base-uncased" 
NUM_LABELS = 4
MAX_LENGTH = 128
BATCH_SIZE = 16
EPOCHS = 3
LEARNING_RATE = 2e-5
WEIGHT_DECAY = 0.01
WARMUP_STEPS = 0
SEED = 42
THRESHOLD = 0.5
OUTPUT_DIR = "./bert_multilabel_model"

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", DEVICE)

# reproducibilidad
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

Device: cuda


Creamos la clase TextDataset, que es una subclase de Dataset, para que la carga de los datos al modelo sea dinamica, y pueda ser soportada en un entorno con recursos medios/bajos.

In [None]:
# Dataset y collate function con padding dinámico
class TextDataset(Dataset):
    def __init__(self, texts: List[str], labels: np.ndarray, tokenizer, max_length: int = 128):
        assert len(texts) == labels.shape[0]
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx]
        return {"text": text, "label": torch.tensor(label, dtype=torch.float)}

# Creamos la funcion para tokenizar los datos en batches que el modelo llamará en sue entrenamiento
def collate_fn(batch, tokenizer, max_length):
    texts = [example["text"] for example in batch]
    labels = torch.stack([example["label"] for example in batch])
    enc = tokenizer(texts,
                    padding=True,
                    truncation=True,
                    max_length=max_length,
                    return_tensors="pt")
    enc["labels"] = labels
    return enc

Definimos el modelo de BERT mediante Pytorch, cargando el modelo pre entrenado, y añadiendo un Dropout y una capa densa al final 

In [None]:
# Definición del modelo
class BertForMultiLabel(nn.Module):
    def __init__(self, model_name: str, num_labels: int, dropout_prob: float = 0.1):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        hidden_size = self.bert.config.hidden_size
        self.dropout = nn.Dropout(dropout_prob)
        self.classifier = nn.Linear(hidden_size, num_labels)

    def forward(self, input_ids=None, attention_mask=None, token_type_ids=None, labels=None):
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            return_dict=True)
        pooled = outputs.pooler_output
        x = self.dropout(pooled)
        logits = self.classifier(x)
        return logits

Creamos las funciones para entrenamiento y evaluación

In [None]:
# train / evaluate
def train_epoch(model, dataloader, optimizer, scheduler, device, scaler=None):
    model.train()
    total_loss = 0.0
    criterion = nn.BCEWithLogitsLoss()
    for batch in tqdm(dataloader, desc="Train", leave=False):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        token_type_ids = batch.get("token_type_ids")
        if token_type_ids is not None:
            token_type_ids = token_type_ids.to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()
        if scaler is not None:
            with torch.cuda.amp.autocast():
                logits = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
                loss = criterion(logits, labels)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            logits = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            loss = criterion(logits, labels)
            loss.backward()
            optimizer.step()

        if scheduler is not None:
            scheduler.step()

        total_loss += loss.item() * input_ids.size(0)
    avg_loss = total_loss / len(dataloader.dataset)
    return avg_loss

def evaluate(model, dataloader, device, threshold=0.5):
    model.eval()
    criterion = nn.BCEWithLogitsLoss()
    total_loss = 0.0
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Eval", leave=False):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            token_type_ids = batch.get("token_type_ids")
            if token_type_ids is not None:
                token_type_ids = token_type_ids.to(device)
            labels = batch["labels"].to(device)

            logits = model(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            loss = criterion(logits, labels)
            total_loss += loss.item() * input_ids.size(0)

            probs = torch.sigmoid(logits).cpu().numpy()
            preds = (probs >= threshold).astype(int)
            all_preds.append(preds)
            all_labels.append(labels.cpu().numpy())

    avg_loss = total_loss / len(dataloader.dataset)
    all_preds = np.vstack(all_preds)
    all_labels = np.vstack(all_labels)

    f1_micro = f1_score(all_labels, all_preds, average="micro", zero_division=0)
    f1_macro = f1_score(all_labels, all_preds, average="macro", zero_division=0)

    return avg_loss, f1_micro, f1_macro

Creamos las funcion de Ajuste Fino, que consumirá todas las funciones anteriores.

In [None]:
# Celda 7: Función principal de fine-tune
def fine_tune_bert(X, y,
                   model_name=MODEL_NAME,
                   num_labels=NUM_LABELS,
                   max_length=MAX_LENGTH,
                   batch_size=BATCH_SIZE,
                   epochs=EPOCHS,
                   lr=LEARNING_RATE,
                   output_dir=OUTPUT_DIR,
                   threshold=THRESHOLD):
    Y = y

    # Separamos en datos de entrenamiento y validación
    X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.2, random_state=SEED, stratify=None)

    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    # Cargamos los datos en los TextDatasets
    train_ds = TextDataset(X_train, y_train, tokenizer, max_length=max_length)
    val_ds = TextDataset(X_val, y_val, tokenizer, max_length=max_length)

    # Usamos el Dataloader para la carga dinamica de los datos
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True,
                              collate_fn=lambda b: collate_fn(b, tokenizer, max_length), num_workers=2)
    val_loader = DataLoader(val_ds, batch_size=batch_size, shuffle=False,
                            collate_fn=lambda b: collate_fn(b, tokenizer, max_length), num_workers=2)

    # CArgamos el modelo y el Optimizador del entrenamiento
    model = BertForMultiLabel(model_name, num_labels).to(DEVICE)

    optimizer = AdamW(model.parameters(), lr=lr, weight_decay=WEIGHT_DECAY)
    total_steps = len(train_loader) * epochs
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=WARMUP_STEPS, num_training_steps=total_steps)

    scaler = torch.cuda.amp.GradScaler() if torch.cuda.is_available() else None

    best_val_f1 = 0.0
    os.makedirs(output_dir, exist_ok=True)

    # Entrenamiento por Epocas
    for epoch in range(1, epochs + 1):
        print(f"\n=== Epoch {epoch}/{epochs} ===")
        train_loss = train_epoch(model, train_loader, optimizer, scheduler, DEVICE, scaler=scaler)
        val_loss, val_f1_micro, val_f1_macro = evaluate(model, val_loader, DEVICE, threshold=threshold)
        print(f"train_loss: {train_loss:.4f}  val_loss: {val_loss:.4f}  val_f1_micro: {val_f1_micro:.4f}  val_f1_macro: {val_f1_macro:.4f}")

        if val_f1_micro > best_val_f1:
            best_val_f1 = val_f1_micro
            torch.save({
                "model_state_dict": model.state_dict(),
                "config": {"model_name": model_name, "num_labels": num_labels, "max_length": max_length}
            }, os.path.join(output_dir, "best_model.pt"))
            tokenizer.save_pretrained(output_dir)
            print(f"[saved] Mejor modelo guardado con F1_micro={best_val_f1:.4f} en {output_dir}")

    print("Entrenamiento finalizado.")
    return model, tokenizer

Entrenamiento del Modelo (Ignorar los errores en la ejecución)

In [None]:
model, tokenizer = fine_tune_bert(X, y_bin,
                                  model_name=MODEL_NAME,
                                  num_labels=NUM_LABELS,
                                  max_length=MAX_LENGTH,
                                  batch_size=BATCH_SIZE,
                                  epochs=EPOCHS,
                                  lr=LEARNING_RATE,
                                  output_dir=OUTPUT_DIR,
                                  threshold=THRESHOLD)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]


=== Epoch 1/3 ===


  scaler = torch.cuda.amp.GradScaler() if torch.cuda.is_available() else None


Train:   0%|          | 0/179 [00:00<?, ?it/s]

  with torch.cuda.amp.autocast():


Eval:   0%|          | 0/45 [00:00<?, ?it/s]

train_loss: 0.4645  val_loss: 0.2865  val_f1_micro: 0.8479  val_f1_macro: 0.8465
[saved] Mejor modelo guardado con F1_micro=0.8479 en ./bert_multilabel_model

=== Epoch 2/3 ===


Train:   0%|          | 0/179 [00:00<?, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7a9c52918040>Exception ignored in: 
Traceback (most recent call last):
<function _MultiProcessingDataLoaderIter.__del__ at 0x7a9c52918040>
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1664, in __del__
Traceback (most recent call last):
      File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1664, in __del__
self._shutdown_workers()
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1647, in _shutdown_workers
        if w.is_alive():self._shutdown_workers()

  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1647, in _shutdown_workers
       if w.is_alive(): 
    ^^^^^^^^ ^  ^^   ^^^^
^^  File "/usr/lib/python3.12/multiprocessing/process.py", line 160, in is_alive
^    assert self._parent_pid == os.getpid(), 'can only test a child process'^
^ ^ ^  ^ ^
   File "/usr/lib/

Eval:   0%|          | 0/45 [00:00<?, ?it/s]

train_loss: 0.2277  val_loss: 0.2185  val_f1_micro: 0.8862  val_f1_macro: 0.8896
[saved] Mejor modelo guardado con F1_micro=0.8862 en ./bert_multilabel_model

=== Epoch 3/3 ===


Train:   0%|          | 0/179 [00:00<?, ?it/s]

  with torch.cuda.amp.autocast():


Eval:   0%|          | 0/45 [00:00<?, ?it/s]

train_loss: 0.1690  val_loss: 0.2044  val_f1_micro: 0.8978  val_f1_macro: 0.9020
[saved] Mejor modelo guardado con F1_micro=0.8978 en ./bert_multilabel_model
Entrenamiento finalizado.


<h2>El F1 score micro en validación que brindó el modelo es de 89,78%, un resultado excelente para este problema de clasificacion</h2>

Guardamos el modelo en un archivo .pth (el cual, debido a su peso se debe gestionar mediante LFS en github)

In [None]:
output_dir = "./bert_multilabel_model"  # Está definido arriba, sin embargo se puede cambiar en caso de ser necesario
os.makedirs(output_dir, exist_ok=True)

# Guardar tokenizer (recomendado pero no obligatorio)
tokenizer.save_pretrained(output_dir)   # crea files en output_dir (tokenizer_config.json, vocab, etc.)

# Guardar state_dict + metadatos 
save_path = os.path.join(output_dir, "model_state.pth")
torch.save({
    "model_state_dict": model.state_dict(),
    "config": {
        "model_name": "bert-base-uncased",   # el backbone que usaste
        "num_labels": 4,
        "max_length": 128
    }
}, save_path)

print("Guardado:", save_path)
print("Tokenizer guardado en:", output_dir)

Guardado: ./bert_multilabel_model/model_state.pth
Tokenizer guardado en: ./bert_multilabel_model
