<img src="https://github.com/hernancontigiani/ceia_memorias_especializacion/raw/master/Figures/logoFIUBA.jpg" width="250" align="center">

**CEIA - MODELO GRANDE DE LENGUAJE (LLM) E IA GENERATIVA**

*TF - DETECCIÓN DE FAKE NEWS - ING. JUAN I. MUNAR*


*PARTE (2 de 2): ENTRENAMIENTO Y VALIDACIÓN*

In [1]:
# Montar el colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# Librerías
import pandas as pd
import os
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertModel, AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import accuracy_score, classification_report

In [87]:
# Paths
train_path = '/content/drive/MyDrive/Colab Notebooks/LLM/Dataset/train.csv'
test_path = '/content/drive/MyDrive/Colab Notebooks/LLM/Dataset/test.csv'

# Importar
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

train_df.sample(3, random_state=42)

Unnamed: 0,text,true
555,WASHINGTON (Reuters) - In less than three mont...,1
3491,Donald Trump has never been Senate Majority Le...,0
527,WASHINGTON (Reuters) - U.S. House Armed Servic...,1


In [88]:
# Conversión a listas
train_texts = train_df['text'].to_list()
val_texts = test_df['text'].to_list()
train_labels = train_df['true'].to_list()
val_labels = test_df['true'].to_list()

In [55]:
# Crear clase dataset.
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, return_tensors='pt',
                                  max_length=self.max_length,
                                  padding='max_length',
                                  truncation=True)
        return {'input_ids': encoding['input_ids'].flatten(),
                'attention_mask': encoding['attention_mask'].flatten(),
                'label': torch.tensor(label)}

In [56]:
# Hacer clasificador.
class BERTClassifier(nn.Module):
    def __init__(self, bert_model_name, num_classes):
        super(BERTClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(bert_model_name)
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        x = self.dropout(pooled_output)
        logits = self.fc(x)
        return logits

In [57]:
# Función de entrenamiento.
def train(model, data_loader, optimizer, scheduler, device):
    model.train()
    for batch in data_loader:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = nn.CrossEntropyLoss()(outputs, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()

In [58]:
# Función de evaluación.
def evaluate(model, data_loader, device):
    model.eval()
    predictions = []
    actual_labels = []
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            _, preds = torch.max(outputs, dim=1)
            predictions.extend(preds.cpu().tolist())
            actual_labels.extend(labels.cpu().tolist())
    return accuracy_score(actual_labels, predictions), classification_report(actual_labels, predictions)

In [59]:
# Definir método de predicción
def predict_fake(text, model, tokenizer, device, max_length=128):
    model.eval()
    encoding = tokenizer(text, return_tensors='pt',
                         max_length=max_length,
                         padding='max_length',
                         truncation=True)
    input_ids = encoding['input_ids'].to(device)
    attention_mask = encoding['attention_mask'].to(device)

    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        _, preds = torch.max(outputs, dim=1)
    return "true" if preds.item() == 1 else "fake"

In [66]:
# Definir parámetros del modelo
bert_model_name = 'bert-base-uncased'
num_classes = 2
max_length = 256
batch_size = 32
num_epochs = 4
learning_rate = 2e-5

In [67]:
# Inicializar
tokenizer = BertTokenizer.from_pretrained(bert_model_name)
train_dataset = TextClassificationDataset(train_texts,
                                          train_labels,
                                          tokenizer,
                                          max_length)
val_dataset = TextClassificationDataset(val_texts,
                                        val_labels,
                                        tokenizer,
                                        max_length)
train_dataloader = DataLoader(train_dataset,
                              batch_size=batch_size,
                              shuffle=True)
val_dataloader = DataLoader(val_dataset,
                            batch_size=batch_size)

In [68]:
# Enviar todo a la gpu
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BERTClassifier(bert_model_name, num_classes).to(device)

In [69]:
# Setear optimizador y scheduler del learning rate
optimizer = AdamW(model.parameters(), lr=learning_rate)
total_steps = len(train_dataloader) * num_epochs
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=total_steps)

In [70]:
# Correr entrenamiento.
for epoch in range(num_epochs):
        print(f"Epoch {epoch + 1}/{num_epochs}")
        train(model, train_dataloader, optimizer, scheduler, device)
        accuracy, report = evaluate(model, val_dataloader, device)
        print(f"Validation Accuracy: {accuracy:.4f}")
        print(report)

Epoch 1/4
Validation Accuracy: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       500

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000

Epoch 2/4
Validation Accuracy: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       500

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000

Epoch 3/4
Validation Accuracy: 1.0000
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       500
           1       1.00      1.00      1.00       500

    accuracy                           1.00      1000
   macro avg  

In [71]:
# Guardar parámetros
torch.save(model.state_dict(), "bert_classifier.pth")

In [92]:
# Test de predicción.
test_text = test_df['text'].iloc[5]
true_fake = predict_fake(test_text, model, tokenizer, device)
print(test_text)
print(f"Prediccción: {true_fake}")
print(f"Realidad: {'true' if test_df['true'].iloc[5]==1 else 'fake'}")

VIENNA (Reuters) - Austria s center-right People s Party (OVP) led by Sebastian Kurz and the anti-immigration Freedom Party (FPO) have agreed to form a coalition government. The deal marks a major victory for a European far-right party after a flurry of elections this year, in which right-wing parties have made gains but failed to enter coalitions elsewhere in Western Europe. Here are the main figures in the new government:  Kurz became a conservative junior minister at 23, Europe s youngest foreign minister at 27 and leader of the People s Party at 30, moving it further to the right of the political spectrum. He has yet to complete his law degree while he pursues politics. An early critic of German Chancellor Angela Merkel s open-border policy as Europe s migration crisis escalated in 2015, he infuriated Berlin when he spearheaded the closure of the Balkan route into Europe.  Known for his slicked-back hair, he speaks eloquently but often eschews discussing policy details in public. H

In [93]:
# Test de predicción.
test_text = test_df['text'].iloc[55]
true_fake = predict_fake(test_text, model, tokenizer, device)
print(test_text)
print(f"Prediccción: {true_fake}")
print(f"Realidad: {'true' if test_df['true'].iloc[55]==1 else 'fake'}")

WASHINGTON/PORTLAND, Me. (Reuters) - Supporters of Republican Donald Trump urged him to get back on message on Thursday after a week of dropping opinion poll numbers and a war of words with ranking Republicans over his U.S. presidential campaign. In response to the criticism, Trump pledged to focus more on Democratic candidate Hillary Clinton, who emerged from last week’s Democratic National Convention with a lead in the polls and who has been consistently attacking him as temperamentally unfit for the presidency. At a rally in Portland, Maine, on Thursday, Trump kept his attention on trying to undermine Clinton’s candidacy. He said the fact that she has moved past a scandal over her use of a private email server as President Barack Obama’s secretary of state was “probably the greatest accomplishment that she has ever had in politics.” Since formally accepting the Republican nomination two weeks ago, Trump has exasperated many supporters by getting bogged down in a public spat with the

In [94]:
# Test de predicción.
test_text = test_df['text'].iloc[800]
true_fake = predict_fake(test_text, model, tokenizer, device)
print(test_text)
print(f"Prediccción: {true_fake}")
print(f"Realidad: {'true' if test_df['true'].iloc[800]==1 else 'fake'}")

We all remember the absolutely disastrous visit Donald Trump had with German Chancellor Angela Merkel. He infamously made her look at him like he had lost his mind when joked about her allegedly being wiretapped by the Obama Administration while he was busy trying to defend his own false claims that Obama wiretapped Trump Tower, refused to shake Merkel s hand during a photo op, and even tried to give her a bill for NATO worth billions of dollars.Trump, however, seems to have a very different memory of his visit with the German leader. While speaking of the visit with the Associated Press, Trump said: Yeah, it s funny: One of the best chemistries I had was with Merkel. As if that weren t ridiculous enough, considering that the whole world saw how that visit actually went, Trump continued, this time trying to excuse his rudeness with the handshake debacle: And I guess somebody shouted out,  Shake her hand, shake her hand,  you know. But I never heard it. But I had already shaken her hand

In [95]:
# Test de predicción.
test_text = test_df['text'].iloc[880]
true_fake = predict_fake(test_text, model, tokenizer, device)
print(test_text)
print(f"Prediccción: {true_fake}")
print(f"Realidad: {'true' if test_df['true'].iloc[880]==1 else 'fake'}")

Maybe you ve finally heard by now that a mosque in Bloomington, Minnesota was bombed early Saturday morning. Maybe. It seemed confusing as it was being reported, because nobody was calling it what it actually was: an act of terrorism aimed at a church. In fact, you were hard-pressed to find anyone even calling it a hate crime, despite the fact that it clearly was. Everything about the incident added up to exactly what America would have called a terrorist attack   if it had been committed by Muslims, instead of against them.In the New York Times report, details included the fact that congregants has just begun to gather for morning prayers, the fact that the bomb was thrown into the Imam s (pastor s) room, and the fact that one worshiper ran outside at the sound of the explosion just in time to see a truck speeding out of the parking lot. Just everything you might need for a terrorist attack: maximum presence, a specific target, and a fleeing suspect.When news hit the wire that it was 