# CS: Natural Language Processing

## Hands-on Workshop - Third Session

### Fake News Detection
["Fake News Detection is a *Natural Language Processing* task that involves identifying and classifying news articles or other types of text as Real or Fake. The goal of Fake News Detection is to develop algorithms that can automatically identify and flag fake news articles, which can be used to combat misinformation and promote the dissemination of accurate information."](https://paperswithcode.com/task/fake-news-detection)
<br><br/>
This part of the notebook will go through the topics in order:
- [Load & Prepare Data and Setup Datasets](#Load-&-Prepare-Data-and-Setup-Datasets)

- [Config the Model and Optimizer](#Config-the-Model-and-Optimizer)

- [Trainer](#Trainer)

- [Fit](#Fit)

- [TensorBoard Logs](#TensorBoard-Logs)

- [Inference](#Inference)

---

In [1]:
import pandas as pd
import numpy as np
import torch, evaluate, time, os

from datasets import DatasetDict, Dataset
from transformers import BertTokenizer, DataCollatorWithPadding, BertForSequenceClassification, TrainingArguments, Trainer, get_linear_schedule_with_warmup, pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# @title Hyperparameters
SEED = 42   # @param {type:"integer"}

CASING = "bert-base-uncased"    # @param ["bert-base-uncased", "bert-large-uncased"]

MAX_LENGTH = 192  # @param {type:"slider", min:64, max:256, step:64}

EPOCHS = 6.1    # @param {type:"slider", min:1, max:7, step:0.1}

NUM_LABELS = 2
BATCH_SIZE = 16

In [3]:
LABEL2ID = {"fake":0,"true":1}
ID2LABEL = {0:"fake",1:"true"}

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#### Load & Prepare Data and Setup Datasets

In [4]:
# class DataModule(LightningDataModule):
class DataModule():
    def __init__(self, data_dir:str="../../data/", label2id=LABEL2ID, random_state:int=SEED, tr_ratio:float=0.75,
                 model_name_or_path:str=CASING, max_length:int=MAX_LENGTH):#, batch_size:int=BATCH_SIZE):
        super().__init__()
        self.data_dir = data_dir
        self.label2id = label2id
        self.random_state = random_state
        self.tr_ratio = tr_ratio
        self.tokenizer = BertTokenizer.from_pretrained(model_name_or_path, do_lower_case=True)
        self.data_collator = DataCollatorWithPadding(tokenizer=self.tokenizer, padding=True, return_tensors="pt")
        self.max_length = max_length
        # self.batch_size = batch_size
        self.prepare_data()
        self.setup()


    def prepare_data(self):
        fake_df, true_df = pd.read_csv(self.data_dir+"fake.csv"), pd.read_csv(self.data_dir+"true.csv")
        df = pd.concat([fake_df, true_df])
        df["labels"] = fake_df.size*[self.label2id["fake"]] + true_df.size*[self.label2id["true"]]
        df = df.sample(frac=1, replace=False, random_state=self.random_state)#.reset_index(drop=True)
        self.__text, self.__labels = df.text.to_list(), df.labels.to_list()


    def __text_encoder(self, batch):
        # encoded_batch = self.tokenizer(text=batch["text"], padding="max_length", truncation=True, max_length=self.max_length,
        #                                return_token_type_ids=False, return_attention_mask=True)
        encoded_batch = self.tokenizer(text=batch["text"], truncation=True, max_length=self.max_length,
                                       return_token_type_ids=False, return_attention_mask=True)
        return encoded_batch


    def setup(self, stage:str="validate"):
        n_tr_samples, n_samples = int(self.tr_ratio*len(self.__text)), len(self.__text)
        self.dataset = DatasetDict()
        f_idx = 0
        for split, l_idx in zip(("train","validate"), (n_tr_samples,n_samples)):
            self.dataset[split] = Dataset.from_dict({"text":self.__text[f_idx:l_idx], "labels":self.__labels[f_idx:l_idx]})
            self.dataset[split] = self.dataset[split].map(function=self.__text_encoder, batched=True, batch_size=100,
                                                          drop_last_batch=False)#, remove_columns=["text"])
            # self.dataset[split].set_format(type="torch", columns=self.dataset[split].column_names)
            f_idx = l_idx


    # def train_dataloader(self):
    #     return DataLoader(self.dataset["train"], batch_size=self.batch_size, shuffle=True, drop_last=False)


    # def val_dataloader(self):
    #     return DataLoader(self.dataset["validate"], batch_size=self.batch_size, shuffle=False, drop_last=False)

In [5]:
dm = DataModule()
data_collator, train_dataset, eval_dataset, tokenizer = dm.data_collator, dm.dataset["train"], dm.dataset["validate"], dm.tokenizer

                                                                

#### Config the Model and Optimizer

In [6]:
warmup_ratio = 0.05
num_training_steps = EPOCHS*(train_dataset.num_rows//BATCH_SIZE)

model = BertForSequenceClassification.from_pretrained(CASING, num_labels=NUM_LABELS, id2label=ID2LABEL)#.to(DEVICE)
for name, prm in model.named_parameters():
    if ("embeddings" in name) or ("encoder" in name and int(name.split('.')[3])<4):
        prm.requires_grad = False

optimizer = torch.optim.Adam(params=filter(lambda prm:prm.requires_grad, model.parameters()), lr=2e-5, eps=1e-8)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=int(warmup_ratio*num_training_steps),
                                            num_training_steps=num_training_steps, last_epoch=-1)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [7]:
metric = evaluate.load("accuracy")
def compute_metrics(val_outs):
    logits, targets = val_outs
    preds = np.argmax(logits, axis=1)
    return metric.compute(predictions=preds, references=targets)

#### Trainer

In [8]:
LOGS_DIR_NAME, version = "HuggingFace", time.strftime("%y-%m-%d_%H-%M-%S")
logging_dir = os.path.join("../../logs/", LOGS_DIR_NAME, version)
ckpt_dir = os.path.join(logging_dir, "checkpoints")

args = TrainingArguments(
                         output_dir=ckpt_dir,
                         evaluation_strategy="epoch",
                         prediction_loss_only=True,
                         per_device_train_batch_size=BATCH_SIZE,
                         per_device_eval_batch_size=BATCH_SIZE,
                         gradient_accumulation_steps=1,
                         eval_delay=1.5,
                        #  max_grad_norm=1,
                         num_train_epochs=EPOCHS,
                         logging_dir=logging_dir,
                         logging_strategy="epoch",
                        #  logging_nan_inf_filter=False,
                         save_strategy="epoch",
                         save_total_limit=1,
                        #  no_cuda=False,
                         seed=SEED,
                        #  data_seed=SEED,
                        #  dataloader_drop_last=False,
                         disable_tqdm=True,
                         load_best_model_at_end=True,
                         metric_for_best_model="eval_loss",
                         group_by_length=True,
                         report_to=["tensorboard"]
                         )

trainer = Trainer(
                  model=model,
                  args=args,
                  data_collator=data_collator,
                  train_dataset=train_dataset,
                  eval_dataset=eval_dataset,
                  tokenizer=tokenizer,
                  compute_metrics=compute_metrics,
                  optimizers=(optimizer,scheduler)
                  )

#### Fit

In [9]:
trainer.train()


# ckpt_path = os.path.join(ckpt_dir, "checkpoint-*")

# trainer.train(resume_from_checkpoint=ckpt_path)

{'loss': 0.1734, 'learning_rate': 1.7575757575757576e-05, 'epoch': 1.0}
{'loss': 0.0092, 'learning_rate': 1.4107226107226108e-05, 'epoch': 2.0}
{'eval_loss': 0.02150891162455082, 'eval_runtime': 3.266, 'eval_samples_per_second': 303.121, 'eval_steps_per_second': 18.983, 'epoch': 2.0}
{'loss': 0.0066, 'learning_rate': 1.063869463869464e-05, 'epoch': 3.0}
{'eval_loss': 0.016314147040247917, 'eval_runtime': 3.2931, 'eval_samples_per_second': 300.628, 'eval_steps_per_second': 18.827, 'epoch': 3.0}
{'loss': 0.0033, 'learning_rate': 7.17016317016317e-06, 'epoch': 4.0}
{'eval_loss': 0.020150141790509224, 'eval_runtime': 3.3094, 'eval_samples_per_second': 299.144, 'eval_steps_per_second': 18.734, 'epoch': 4.0}
{'loss': 0.0006, 'learning_rate': 3.7016317016317023e-06, 'epoch': 5.0}
{'eval_loss': 0.023844188079237938, 'eval_runtime': 3.3165, 'eval_samples_per_second': 298.507, 'eval_steps_per_second': 18.694, 'epoch': 5.0}
{'loss': 0.0003, 'learning_rate': 2.3310023310023313e-07, 'epoch': 6.0}
{

TrainOutput(global_step=1135, training_loss=0.03171119853558782, metrics={'train_runtime': 165.6194, 'train_samples_per_second': 109.279, 'train_steps_per_second': 6.853, 'train_loss': 0.03171119853558782, 'epoch': 6.1})

#### TensorBoard Logs

In [None]:
# %reload_ext tensorboard
%load_ext tensorboard
# %tensorboard --logdir ../../logs

#### Inference

In [10]:
ckpt_path = os.path.join(ckpt_dir, "checkpoint-558")

detector = pipeline(task="text-classification", model=ckpt_path, device=(DEVICE.type=="cuda")-1)

In [11]:
f_idx, l_idx = 40, 45
for text, label in zip(eval_dataset["text"][f_idx:l_idx],eval_dataset["labels"][f_idx:l_idx]):
    pred = detector(text)
    print(f'The news "{" ".join(text.split()[:9]):.<64s}" having a `{ID2LABEL[label]:s}` label seems to be `{pred[0]["label"]:s}`.\n')

The news "White House press briefings have gone completely dark. Sean....." having a `fake` label seems to be `fake`.

The news "PRAGUE (Reuters) - The Czech far-right Freedom and Direct......." having a `true` label seems to be `true`.

The news "21st Century Wire says As 21WIRE reported yesterday, imperial..." having a `fake` label seems to be `fake`.

The news "(Reuters) - Richard Cordray, a Democrat whose resignation as...." having a `true` label seems to be `true`.

The news "MEXICO CITY (Reuters) - Mexican leftist Andres Manuel Lopez....." having a `true` label seems to be `true`.

