<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="250" align="center">

**CEIA - MODELO GRANDE DE LENGUAJE (LLM) E IA GENERATIVA**

*TF - DETECCIÓN DE FAKE NEWS - ING. JUAN I. MUNAR*


*PARTE (2 de 2): ENTRENAMIENTO Y VALIDACIÓN*

In [None]:
# Montar el colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Librerías
import pandas as pd
import re
import os
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Paths
train_path = '/content/drive/MyDrive/Colab Notebooks/LLM/Dataset/train_V1.csv'
test_path = '/content/drive/MyDrive/Colab Notebooks/LLM/Dataset/test_V1.csv'

# Importar
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

train_df.sample(3, random_state=42)

Unnamed: 0,text,true
26843,"#disruptj20 protesters on the move, many leavi...",0
25656,WHAT S WRONG WITH THESE PEOPLE? Clinton Crony ...,0
5343,VIENNA (Reuters) - Russian Foreign Minister Se...,1


In [None]:
# Función de limpieza del texto y del encabezado "CIUDAD (Reuters) -"
def preprocess_clean_text(text):
    # Reuters
    pattern = '^.*\(Reuters\) - '
    text = re.sub(pattern, '', text)
    # Quitar caracteres especiales
    pattern = r'[^a-zA-z0-9.,!?’/:;\"\'\s]'
    text = re.sub(pattern, '', text)
    return text

In [None]:
# Ejecución preprocess_clean_text
train_df['text'] = [preprocess_clean_text(s) for s in train_df['text']]
test_df['text'] = [preprocess_clean_text(s) for s in test_df['text']]

In [None]:
# Conversión a listas
train_texts = train_df['text'].to_list()
val_texts = test_df['text'].to_list()
train_labels = train_df['true'].to_list()
val_labels = test_df['true'].to_list()

In [None]:
# Crear clase dataset.
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, return_tensors='pt',
                                  max_length=self.max_length,
                                  padding='max_length',
                                  truncation=True)
        return {'input_ids': encoding['input_ids'].flatten(),
                'attention_mask': encoding['attention_mask'].flatten(),
                'label': torch.tensor(label)}

In [None]:
# Hacer clasificador.
class BERTClassifier(nn.Module):
    def __init__(self, bert_model_name, num_classes):
        super(BERTClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        x = self.dropout(pooled_output)
        logits = self.fc(x)
        return logits

In [None]:
# Función de entrenamiento.
def train(model, data_loader, optimizer, scheduler, device):
    model.train()
    for batch in data_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()

In [None]:
# Función de evaluación.
def evaluate(model, data_loader, device):
    model.eval()
    predictions = []
    actual_labels = []
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs, dim=1)
            predictions.extend(preds.cpu().tolist())
            actual_labels.extend(labels.cpu().tolist())
    return accuracy_score(actual_labels, predictions), classification_report(actual_labels, predictions)

In [None]:
# Definir método de predicción
def predict_fake(text, model, tokenizer, device, max_length=128):
    model.eval()
    encoding = tokenizer(text, return_tensors='pt',
                         max_length=max_length,
                         padding='max_length',
                         truncation=True)
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        _, preds = torch.max(outputs, dim=1)
    return "true" if preds.item() == 1 else "fake"

In [None]:
# Definir parámetros del modelo
bert_model_name = 'bert-base-uncased'
num_classes = 2
max_length = 256
batch_size = 32
num_epochs = 5
learning_rate = 2e-5

In [None]:
# Inicializar
tokenizer = BertTokenizer.from_pretrained(bert_model_name)
train_dataset = TextClassificationDataset(train_texts,
                                          train_labels,
                                          tokenizer,
                                          max_length)
val_dataset = TextClassificationDataset(val_texts,
                                        val_labels,
                                        tokenizer,
                                        max_length)
train_dataloader = DataLoader(train_dataset,
                              batch_size=batch_size,
                              shuffle=True)
val_dataloader = DataLoader(val_dataset,
                            batch_size=batch_size)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Enviar todo a la gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BERTClassifier(bert_model_name, num_classes).to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [None]:
# Setear optimizador y scheduler del learning rate
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps)



In [None]:
# Correr entrenamiento.
for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        train(model, train_dataloader, optimizer, scheduler, device)
        accuracy, report = evaluate(model, val_dataloader, device)
        print(f"Validation Accuracy: {accuracy:.4f}")
        print(report)

Epoch 1/5
Validation Accuracy: 0.9940
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      3429
           1       0.99      1.00      0.99      3429

    accuracy                           0.99      6858
   macro avg       0.99      0.99      0.99      6858
weighted avg       0.99      0.99      0.99      6858

Epoch 2/5
Validation Accuracy: 0.9977
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3429
           1       1.00      1.00      1.00      3429

    accuracy                           1.00      6858
   macro avg       1.00      1.00      1.00      6858
weighted avg       1.00      1.00      1.00      6858

Epoch 3/5
Validation Accuracy: 0.9968
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3429
           1       1.00      1.00      1.00      3429

    accuracy                           1.00      6858
   macro avg  

In [None]:
# Guardar parámetros
torch.save(model.state_dict(), "bert_classifier_V1.pth")

In [None]:
# Test de predicción.
test_text = test_df['text'].iloc[5]
true_fake = predict_fake(test_text, model, tokenizer, device)
print(test_text)
print(f"Prediccción: {true_fake}")
print(f"Realidad: {'true' if test_df['true'].iloc[5]==1 else 'fake'}")

Proposals from U.S. Republicans to repeal or restrict a popular deduction on federal income tax for state and local tax SALT payments would hit Americans in hightax states. The 10 states with the highest taxes are represented in the Senate by Democrats, and by Vermont’s Bernie Sanders, an independent who votes with Democrats. A Senate tax plan about to be unveiled on Thursday was expected to propose ending the SALT deduction entirely. An earlier House plan would sharply curtail the deduction. If left unchanged, the Senate plan’s SALT provision would make it harder to attract what could be necessary support for the tax bill from Democratic senators, as it would disproportionately hurt their constituents. Republicans have a 5248 majority in the chamber, meaning they can afford to lose only two Republican votes and so could be looking for some Democratic support for passage.  Here are the U.S. states with the highest annual combined state and local income taxes and property taxes, measure

In [None]:
# Test de predicción.
test_text = test_df['text'].iloc[55]
true_fake = predict_fake(test_text, model, tokenizer, device)
print(test_text)
print(f"Prediccción: {true_fake}")
print(f"Realidad: {'true' if test_df['true'].iloc[55]==1 else 'fake'}")

Prediccción: true
Realidad: true


In [None]:
# Test de predicción.
test_text = test_df['text'].iloc[800]
true_fake = predict_fake(test_text, model, tokenizer, device)
print(test_text)
print(f"Prediccción: {true_fake}")
print(f"Realidad: {'true' if test_df['true'].iloc[800]==1 else 'fake'}")

In his baseball cap and baggy yellow tshirt, the rap star Li Yijie  better known by his stage name  Pissy   is an unlikely face of China s straitlaced ruling Communist Party.  His group, Tianfu Shibian, has won fans and the support of the party s youth league with songs like  Force of Red  and  This is China  that chime with President Xi Jinping s nationalist vision of China and its place in the world.  Under Xi, set to begin a second fiveyear term at a key party congress next month, the oncehidebound Communist Party has sought to revitalise its role in society amid challenges to its traditional authority as the country gets richer, more mobile and more digitally connected.  The party's modernising push also comes as a significant number of educated Chinese millennials, faced with a tough job market and high housing costs in big cities, have grown disillusioned about their career and life prospects.  The party s effort extends increasingly to coopting swathes of Chinese popular culture

In [None]:
# Test de predicción.
test_text = test_df['text'].iloc[880]
true_fake = predict_fake(test_text, model, tokenizer, device)
print(test_text)
print(f"Prediccción: {true_fake}")
print(f"Realidad: {'true' if test_df['true'].iloc[880]==1 else 'fake'}")

A majority of Americans, including a growing number of Republicans, want to see an independent investigation sort out any connections between Russia and President Donald Trump during the 2016 election campaign, according to a Reuters/Ipsos opinion poll released on Monday.  The May 1014 poll, which was conducted after Trump fired FBI Director James Comey, suggests the public is increasingly uneasy with allegations of meddling by the Russians in the U.S. election. Trump’s dismissal of Comey, who was leading the Federal Bureau of Investigation’s probe into ties between the White House and Russia, intensified calls by Democrats for an independent probe.  According to the poll, 59 percent of adults, including 41 percent of Republicans and 79 percent of Democrats, agreed that Congress should launch an independent investigation into communications between the Russian government and the Trump campaign during the 2016 election.  That compares with 54 percent of all adults, including 30 percent 