# Clasificación con BERT

**Autor: Inmaculada García Moreno**

Este notebook aborda la tarea de clasificación de noticias falsas a partir del texto utilizando el modelo BERT (Bidirectional Encoder Representations from Transformers). 

Para procesar el texto se utiliza el modelo `DistilBERT`, una versión ligera de BERT entrenada sobre corpus extensos como Wikipedia y el Toronto Book Corpus. BERT permite obtener representaciones semánticas profundas del texto mediante embeddings que se extraen del token `[CLS]`.

Estas representaciones se emplean como entrada a una red neuronal que actúa como clasificador binario. El objetivo final es predecir si una noticia es verdadera o falsa en función de su contenido textual.



## Importación de librerías

In [None]:
import pandas as pd
from google.colab import drive
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, DistilBertModel
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
import torch.nn as nn
import numpy as np
import torch.optim.lr_scheduler as lr_scheduler
from sklearn.utils.class_weight import compute_class_weight


## Importación y limpieza de datos

Se monta Google Drive para poder acceder al dataset que está guardado en una carpeta del entorno.

In [None]:
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


Se carga el archivo .tsv que contiene los datos del conjunto de entrenamiento. Este archivo está separado por tabulaciones y se lee directamente desde la ruta donde ha sido almacenado en Google Drive. Se muestra un vistazo a las primeras filas del dataset.

In [None]:
csv_path = "/content/drive/MyDrive/TFG/listo.csv"
# Cargar CSVs
df = pd.read_csv(csv_path, keep_default_na=False)

df

Unnamed: 0,title,text,label,news_type,lang,Cluster,category,title_length,text_length,total_length,sentences
0,law enforcement high alert following threat co...,no comment expected barack obama member fyf fu...,1,Real,en,3,Tensión social,18,871,889,law enforcement high alert following threat co...
1,unbelievable exclamationtoken obamas attorney ...,most demonstrator gathered last night exercisi...,1,Real,en,3,Tensión social,18,34,52,unbelievable exclamationtoken obamas attorney ...
2,bobby jindal raised hindu us story christian c...,dozen politically active pastor came private d...,0,Falsa,en,3,Tensión social,16,1321,1337,bobby jindal raised hindu us story christian c...
3,satan russia unvelis image terrifying new supe...,r sarmat missile dubbed satan replace s fly mi...,1,Real,en,4,Geopolítica,16,329,345,satan russia unvelis image terrifying new supe...
4,time exclamationtoken christian group sue amaz...,all say one time someone sued southern poverty...,1,Real,en,3,Tensión social,13,244,257,time exclamationtoken christian group sue amaz...
...,...,...,...,...,...,...,...,...,...,...,...
61844,wikileaks email show clinton foundation fund u...,email released wikileaks sunday appears show f...,1,Real,en,6,Clinton y escándalos,15,205,220,wikileaks email show clinton foundation fund u...
61845,russian steal research trump hack u democratic...,washington reuters hacker believed working rus...,0,Falsa,en,6,Clinton y escándalos,11,735,746,russian steal research trump hack u democratic...
61846,watch giuliani demand democrat apologize trump...,know fantasyland republican never questioned c...,1,Real,en,0,Trump y elecciones,10,604,614,watch giuliani demand democrat apologize trump...
61847,migrant refuse leave train refugee camp hungary,migrant refuse leave train refugee camp hungar...,0,Falsa,en,4,Geopolítica,10,477,487,migrant refuse leave train refugee camp hungar...


In [None]:
df.shape

(61849, 11)

In [None]:
df.isnull().sum()

Unnamed: 0,0
title,0
text,0
label,0
news_type,0
lang,0
Cluster,0
category,0
title_length,0
text_length,0
total_length,0


## Extracción de embeddings con BERT

Para representar el texto como vectores numéricos, se utiliza el modelo `DistilBERT` preentrenado, que transforma cada título en un vector de 768 dimensiones.

Se emplea el tokenizador de `DistilBERT`, y se utiliza el vector asociado al token `[CLS]` como representación agregada de la secuencia. Esta representación se usará como entrada al clasificador.



In [None]:
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)  #
bert_model = DistilBertModel.from_pretrained(model_name, output_hidden_states=True)  

bert_model.eval()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


DistilBertModel(
  (embeddings): Embeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (transformer): Transformer(
    (layer): ModuleList(
      (0-5): 6 x TransformerBlock(
        (attention): DistilBertSdpaAttention(
          (dropout): Dropout(p=0.1, inplace=False)
          (q_lin): Linear(in_features=768, out_features=768, bias=True)
          (k_lin): Linear(in_features=768, out_features=768, bias=True)
          (v_lin): Linear(in_features=768, out_features=768, bias=True)
          (out_lin): Linear(in_features=768, out_features=768, bias=True)
        )
        (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (ffn): FFN(
          (dropout): Dropout(p=0.1, inplace=False)
          (lin1): Linear(in_features=768, out_features=3072, bias=True)
          (lin2): L

In [None]:
def get_bert_embedding(text):
    # Tokenize input text and get token IDs and attention mask
    inputs = tokenizer.encode_plus(text, add_special_tokens = True, return_tensors='pt', max_length=80, truncation=True, padding='max_length')

    return inputs['input_ids'].squeeze(0), inputs['attention_mask'].squeeze(0)

# Testing embedding
text = "This is an example Reddit submission title."
input_ids, attention_mask = get_bert_embedding(text)
print(input_ids.shape)
print(attention_mask.shape)

torch.Size([80])
torch.Size([80])


In [None]:
df_train, df_test = train_test_split(df, test_size=0.2, stratify=df["label"])
df_test, df_val = train_test_split(df_test, test_size=0.5, stratify=df_test["label"])

## Definición del Dataset personalizado

Se define una clase `BertDataset` que hereda de `torch.utils.data.Dataset`. Esta clase se encarga de:

- Tokenizar el texto usando el tokenizador de BERT.
- Devolver los `input_ids`, `attention_mask` y las etiquetas para cada muestra.

Esta clase permite estructurar los datos para usarlos con un `DataLoader` de PyTorch.


In [None]:
class BertDataset(Dataset):
    def __init__(self, df, text_field="sentences", label_field="label"):
        self.df = df.reset_index(drop=True)
        self.text_field = text_field
        self.label_field = label_field
        self.tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")  # Cambio aquí

    def __getitem__(self, index):
        text = str(self.df.loc[index, self.text_field])
        label = self.df.loc[index, self.label_field]

        encoding = self.tokenizer(
            text,
            padding='max_length',
            truncation=True,
            max_length=64,
            return_tensors="pt"
        )

        return (
            encoding["input_ids"].squeeze(0),       # [seq_len]
            encoding["attention_mask"].squeeze(0),  # [seq_len]
            torch.tensor(label)
        )

    def __len__(self):
        return len(self.df)

In [None]:
train_dataset = BertDataset(df_train, text_field="title")
val_dataset = BertDataset(df_val, text_field="title")
test_dataset = BertDataset(df_test, text_field="title")


In [None]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)

print(len(train_loader))

3093


## Definición del modelo de clasificación

Se define una red neuronal que toma como entrada el embedding generado por BERT para el token `[CLS]` y lo pasa por una capa lineal para obtener la predicción final (2 clases: verdadera o falsa).

El modelo incluye una capa de dropout para evitar sobreajuste. No se aplica softmax en la salida ya que `CrossEntropyLoss` lo incluye internamente.


In [None]:


class BERTTextClassifier(nn.Module):
    def __init__(self, num_classes=2):
        super(BERTTextClassifier, self).__init__()
        self.bert = DistilBertModel.from_pretrained("distilbert-base-uncased")  # DistilBert
        self.drop = nn.Dropout(0.3)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]  # [CLS] token
        x = self.drop(cls_output)
        return self.classifier(x)


## Entrenamiento del modelo

El modelo se entrena usando un bucle clásico de PyTorch con la siguiente configuración:

- **Función de pérdida**: `CrossEntropyLoss` con pesos por clase para corregir el desbalance del dataset.
- **Optimizador**: Adam.
- **Scheduler**: `ReduceLROnPlateau` para reducir la tasa de aprendizaje si la pérdida de validación se estanca.
- **Early Stopping**: se detiene el entrenamiento si la pérdida de validación no mejora después de varias épocas seguidas.


In [None]:
class EarlyStopping:
    def __init__(self, patience=4, verbose=False, delta=0):
        self.patience = patience
        self.verbose = verbose
        self.counter = 0
        self.best_loss = None
        self.early_stop = False
        self.delta = delta

    def __call__(self, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
        elif val_loss > self.best_loss + self.delta:
            self.counter += 1
            if self.verbose:
                print(f"EarlyStopping counter: {self.counter} out of {self.patience}")
            if self.counter >= self.patience:
                self.early_stop = True
        else:
            self.best_loss = val_loss
            self.counter = 0

In [None]:
labels = df_train['label'].to_numpy()

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = BERTTextClassifier(num_classes=2)
model= model.to(device)

In [None]:
# Define loss function and optimizer


# Assuming 'labels' is a list of all labels in the dataset
class_weights = compute_class_weight(class_weight='balanced', classes=np.unique(labels), y=labels)
class_weights = torch.tensor(class_weights, dtype=torch.float).to(device)

# Define the loss function with class weights
criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
scheduler = lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode='min',        # 'min' para loss, 'max' para accuracy
    factor=0.5,
    patience=1,
    min_lr=1e-6
)
num_epochs = 20

import time

def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs):
    early_stopping = EarlyStopping(patience=5, verbose=True)

    for epoch in range(num_epochs):
        start_time = time.time()
        model.train()
        running_loss = 0.0

        print(f"\n⏳ Epoch {epoch+1}/{num_epochs} ------------------------")

        for step, (input_ids, attention_mask, label) in enumerate(train_loader):
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            label = label.to(device)

            optimizer.zero_grad()
            outputs = model(input_ids, attention_mask)
            loss = criterion(outputs, label)
            loss.backward()
            optimizer.step()

            running_loss += loss.item() * input_ids.size(0)

        # Evaluación en validación
        model.eval()
        val_loss = 0.0
        correct_preds = 0
        with torch.no_grad():
            for input_ids, attention_mask, label in val_loader:
                input_ids = input_ids.to(device)
                attention_mask = attention_mask.to(device)
                label = label.to(device)

                outputs = model(input_ids, attention_mask)
                loss = criterion(outputs, label)
                val_loss += loss.item() * input_ids.size(0)

                _, preds = torch.max(outputs, 1)
                correct_preds += torch.sum(preds == label)

        val_loss /= len(val_loader.dataset)
        accuracy = correct_preds.double() / len(val_loader.dataset)
        scheduler.step(val_loss)

        duration = time.time() - start_time
        print(f"\n✅ Epoch {epoch+1} completed in {duration:.2f} sec")
        print(f"   🔹 Training Loss: {running_loss/len(train_loader.dataset):.4f}")
        print(f"   🔹 Validation Loss: {val_loss:.4f}")
        print(f"   🔹 Accuracy: {accuracy:.4f}")

        # Early stopping
        early_stopping(val_loss)
        if early_stopping.early_stop:
            print("\n🛑 Early stopping triggered. Stopping training.")
            break





In [None]:
from sklearn.metrics import precision_score, recall_score

def evaluate_model(model, test_loader, criterion):
    model.eval()
    val_losses = []
    correct_preds = 0

    all_preds = []
    all_labels = []

    with torch.no_grad():
        for input_ids, attention_mask, label in test_loader:
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            label = label.to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)

            _, preds = torch.max(outputs, dim=1)
            val_loss = criterion(outputs, label)

            correct_preds += torch.sum(preds == label)

            val_losses.append(val_loss.item())
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(label.cpu().numpy())

    accuracy = float((correct_preds.double() / len(test_loader.dataset)) * 100)
    precision = precision_score(all_labels, all_preds, average='weighted')
    recall = recall_score(all_labels, all_preds, average='weighted')

    print("\nAccuracy: ", accuracy)
    print("Precision: ", precision)
    print("Recall: ", recall)


In [None]:
criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
num_epochs = 5

train_model(model, train_loader,val_loader, criterion, optimizer, scheduler, num_epochs)
#print("\n")
evaluate_model(model, test_loader, criterion)


⏳ Epoch 1/5 ------------------------

✅ Epoch 1 completed in 103.55 sec
   🔹 Training Loss: 0.2746
   🔹 Validation Loss: 0.2159
   🔹 Accuracy: 0.9125

⏳ Epoch 2/5 ------------------------

✅ Epoch 2 completed in 102.94 sec
   🔹 Training Loss: 0.1823
   🔹 Validation Loss: 0.1971
   🔹 Accuracy: 0.9237

⏳ Epoch 3/5 ------------------------

✅ Epoch 3 completed in 103.10 sec
   🔹 Training Loss: 0.1203
   🔹 Validation Loss: 0.2385
   🔹 Accuracy: 0.9156
EarlyStopping counter: 1 out of 5

⏳ Epoch 4/5 ------------------------

✅ Epoch 4 completed in 103.02 sec
   🔹 Training Loss: 0.0738
   🔹 Validation Loss: 0.2488
   🔹 Accuracy: 0.9206
EarlyStopping counter: 2 out of 5

⏳ Epoch 5/5 ------------------------

✅ Epoch 5 completed in 102.94 sec
   🔹 Training Loss: 0.0447
   🔹 Validation Loss: 0.3049
   🔹 Accuracy: 0.9240
EarlyStopping counter: 3 out of 5

Accuracy:  91.96443007275667
Precision:  0.9196305774231779
Recall:  0.9196443007275666


In [None]:
complete_model_path = '/content/drive/MyDrive/TFG/bert_title.pth'
torch.save(model.state_dict(), complete_model_path)
print(f"Modelo completo guardado en: {complete_model_path}")


Modelo completo guardado en: /content/drive/MyDrive/TFG/bert_title.pth


In [None]:
train_dataset = BertDataset(df_train, text_field="text")
val_dataset = BertDataset(df_val, text_field="text")
test_dataset = BertDataset(df_test, text_field="text")


In [None]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)

print(len(train_loader))

3093


In [None]:
criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
num_epochs = 5

train_model(model, train_loader,val_loader, criterion, optimizer, scheduler, num_epochs)
#print("\n")
evaluate_model(model, test_loader, criterion)


⏳ Epoch 1/5 ------------------------

✅ Epoch 1 completed in 335.55 sec
   🔹 Training Loss: 0.1690
   🔹 Validation Loss: 0.1176
   🔹 Accuracy: 0.9539

⏳ Epoch 2/5 ------------------------

✅ Epoch 2 completed in 335.20 sec
   🔹 Training Loss: 0.0759
   🔹 Validation Loss: 0.1036
   🔹 Accuracy: 0.9602

⏳ Epoch 3/5 ------------------------

✅ Epoch 3 completed in 336.36 sec
   🔹 Training Loss: 0.0383
   🔹 Validation Loss: 0.1127
   🔹 Accuracy: 0.9614
EarlyStopping counter: 1 out of 5

⏳ Epoch 4/5 ------------------------

✅ Epoch 4 completed in 336.60 sec
   🔹 Training Loss: 0.0195
   🔹 Validation Loss: 0.1630
   🔹 Accuracy: 0.9617
EarlyStopping counter: 2 out of 5

⏳ Epoch 5/5 ------------------------

✅ Epoch 5 completed in 335.16 sec
   🔹 Training Loss: 0.0133
   🔹 Validation Loss: 0.1609
   🔹 Accuracy: 0.9638
EarlyStopping counter: 3 out of 5

Accuracy:  96.29749393694422
Precision:  0.962971280162684
Recall:  0.9629749393694422


In [None]:
complete_model_path = '/content/drive/MyDrive/TFG/bert_text.pth'
torch.save(model.state_dict(), complete_model_path)
print(f"Modelo completo guardado en: {complete_model_path}")

Modelo completo guardado en: /content/drive/MyDrive/TFG/bert_text.pth


In [None]:
train_dataset = BertDataset(df_train)
val_dataset = BertDataset(df_val)
test_dataset = BertDataset(df_test)


In [None]:
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=True)

print(len(train_loader))

3093


In [None]:
criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)
num_epochs = 5

train_model(model, train_loader,val_loader, criterion, optimizer, scheduler, num_epochs)
#print("\n")
evaluate_model(model, test_loader, criterion)


⏳ Epoch 1/5 ------------------------

✅ Epoch 1 completed in 344.26 sec
   🔹 Training Loss: 0.0294
   🔹 Validation Loss: 0.0901
   🔹 Accuracy: 0.9727

⏳ Epoch 2/5 ------------------------

✅ Epoch 2 completed in 343.26 sec
   🔹 Training Loss: 0.0098
   🔹 Validation Loss: 0.0902
   🔹 Accuracy: 0.9769
EarlyStopping counter: 1 out of 5

⏳ Epoch 3/5 ------------------------

✅ Epoch 3 completed in 344.05 sec
   🔹 Training Loss: 0.0077
   🔹 Validation Loss: 0.1015
   🔹 Accuracy: 0.9745
EarlyStopping counter: 2 out of 5

⏳ Epoch 4/5 ------------------------

✅ Epoch 4 completed in 343.34 sec
   🔹 Training Loss: 0.0052
   🔹 Validation Loss: 0.1051
   🔹 Accuracy: 0.9756
EarlyStopping counter: 3 out of 5

⏳ Epoch 5/5 ------------------------

✅ Epoch 5 completed in 343.29 sec
   🔹 Training Loss: 0.0037
   🔹 Validation Loss: 0.1387
   🔹 Accuracy: 0.9728
EarlyStopping counter: 4 out of 5

Accuracy:  97.42926434923203
Precision:  0.9742959413599204
Recall:  0.9742926434923201


In [None]:
complete_model_path = '/content/drive/MyDrive/TFG/bert_sentences.pth'
torch.save(model.state_dict(), complete_model_path)
print(f"Modelo completo guardado en: {complete_model_path}")

Modelo completo guardado en: /content/drive/MyDrive/TFG/bert_sentences.pth


In [None]:
import os
print(os.path.exists('/content/drive/MyDrive/TFG/bert_sentences.pth'))

True
