## Bachelor Thesis
## "Exploring the Efficacy of Diverse Classification Techniques In Detecting Disinformation In News."
Ilia Sokolovskiy
HTW SS23

Notebook 4/5 - BERT Predictions
(This notebook was executed in Google Colab due to performance advantage and lack of local compute resources)

**Installing all necessary dependencies**

In [17]:
!pip install -q peft transformers datasets evaluate seqeval bertviz jupyterlab ipywidgets

**Importing all necessary libraries**

In [1]:
import os

import pandas as pd
import torch

from transformers import (
    AutoModelForSequenceClassification,
    BertTokenizerFast,
    TrainingArguments,
    Trainer,
    BertForSequenceClassification,
)
from peft import (
    PeftModel,
    PeftConfig,
    get_peft_model,
    LoraConfig,
    TaskType,
)

from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from bertviz import model_view

**Loading the prepared data frame from a pickle**

In [5]:
# Only needed when working with directories in Google Colab
# from google.colab import drive
# drive.mount('/content/drive')

In [7]:
# Load the pickle with the df
base_dir = "Data"
pickle_folder = "Pickles"
filename_pickle = "pickle_lg_df_2.pkl"

full_path_pickle = os.path.join(base_dir, pickle_folder, filename_pickle)

df = pd.read_pickle(full_path_pickle)

In [5]:
df.head()

Unnamed: 0,text,label,label_encoded,norp_count,gpe_count,vader_compound,cleaned_text,original_text_vector,cleaned_text_vector
0,Donald Trump just couldn t wish all Americans ...,FAKE,0,3,3,-0.8681,donald trump couldn t wish americans happy new...,"[-1.6619356, -0.0073223817, -1.6303111, -0.190...","[-0.17023614, 1.1278214, -2.2035916, -1.195557..."
1,House Intelligence Committee Chairman Devin Nu...,FAKE,0,10,5,-0.7141,house intelligence committee chairman devin nu...,"[-2.008067, 0.6831929, -1.9811207, 0.52264357,...","[-0.3486856, 0.6266792, -1.7451725, 0.01966631..."
2,"On Friday, it was revealed that former Milwauk...",FAKE,0,1,4,-0.9953,friday reveal milwaukee sheriff david clarke c...,"[-1.9425699, 0.0044210483, -1.7258451, 0.00323...","[-0.34773135, 0.7257386, -1.7822778, 0.2710289..."
3,"On Christmas day, Donald Trump announced that ...",FAKE,0,0,2,-0.9176,christmas day donald trump announce work follo...,"[-1.6670086, 0.23368433, -0.6346163, 0.1001595...","[-0.18105617, 0.730818, -0.28500575, -0.608257..."
4,Pope Francis used his annual Christmas Day mes...,FAKE,0,2,5,0.3134,pope francis annual christmas day message rebu...,"[-2.141846, 1.1239394, -2.4791837, 0.000615673...","[-0.003997393, 1.3095359, -2.2471106, -0.14267..."


In [8]:
# Set device to GPU, if present
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device set to -> {device}")

Device set to -> cuda


In [10]:
# Create a list with labels
labels = df['label'].unique().tolist()
labels = [s.strip() for s in labels ]
labels

['FAKE', 'TRUE']

In [11]:
# Create two dictionaries for BERT : id2label and label2id
num_labels= len(labels)
id2label={id:label for id,label in enumerate(labels)}
label2id={label:id for id,label in enumerate(labels)}

In [12]:
label2id

{'FAKE': 0, 'TRUE': 1}

In [13]:
id2label

{0: 'FAKE', 1: 'TRUE'}

In [14]:
# Download BERT base model and a faster version of the BERTs' tokenizer
model_name = "bert-base-uncased"
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels, id2label=id2label, label2id=label2id, output_attentions=True).to(device)
tokenizer = BertTokenizerFast.from_pretrained(model_name, max_length=512)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [13]:
# Splitting data into 80-10-10 for train-val-test
X = df['text'].tolist()
y = df['label_encoded'].tolist()

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=1)

X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1)

print("Training set:", len(X_train))
print("Validation set:", len(X_val))
print("Test set:", len(X_test))

Training set: 51543
Validation set: 6443
Test set: 6443


## For further performance testing : Split with pre-processed texts (optional)

In [None]:
# Splitting data into 80-10-10 for train-val-test
X = df['cleaned_text'].tolist()
y = df['label_encoded'].tolist()

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=1)

X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=1)

print("Training set:", len(X_train))
print("Validation set:", len(X_val))
print("Test set:", len(X_test))

Training set: 51543
Validation set: 6443
Test set: 6443


---

In [14]:
# Tokenize the data splits to get the encodings
train_encodings = tokenizer(X_train, truncation=True, padding=True)
val_encodings  = tokenizer(X_val, truncation=True, padding=True)
test_encodings = tokenizer(X_test, truncation=True, padding=True)

In [15]:
# Extension of DataLoader class specifically for Sequence Classifiers
class DataLoader(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        # Retrieve tokenized data for the given index
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # Add the label for the given index to the item dictionary
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [16]:
# Pass the tokenized encodings and the label splits into the data loaders
train_dataloader = DataLoader(train_encodings, y_train)
val_dataloader = DataLoader(val_encodings, y_val)
test_dataset = DataLoader(test_encodings, y_test)

In [17]:
def compute_metrics(pred):

    labels = pred.label_ids
    # Column index with the maximum probability
    preds = pred.predictions.argmax(-1)

    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='macro')
    accuracy = accuracy_score(labels, preds)

    return {
        'Accuracy': accuracy,
        'F1': f1,
        'Precision': precision,
        'Recall': recall
    }

In [18]:
# Local path for saving checkpoints
base_dir = "Models"
sub_dir = "Adapters"
sub_dir_2 = "LoRA-BERT-checkpoints"

full_path = os.path.join(base_dir, sub_dir, sub_dir_2)

In [19]:
# LoRA configuration!
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS, inference_mode=False, r=16, lora_alpha=16, lora_dropout=0.1, bias='all'
)

In [20]:
# Swapping the original model with the peft model and fitting the additional LoRA config. In the console you can see how many fewer parameters have to be trained because of the fine-tuning methods' efficiency
# Only 0.63 % of 110 Million has to be trained thanks to low-rank decomposition of the weight matrix!
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()

trainable params: 695,812 || all params: 110,075,140 || trainable%: 0.6321245650925359


In [21]:
# Set all the corresponding training arguments
training_args = TrainingArguments(
    output_dir=full_path,
    do_train=True,
    do_eval=True,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    warmup_steps=100,
    weight_decay=0.01,
    learning_rate=1e-3,
    logging_strategy='steps',
    logging_dir='Logs',
    logging_steps=300,
    evaluation_strategy='steps',
    eval_steps=300,
    save_steps=900,
    save_strategy='steps',
    fp16=True,
    load_best_model_at_end=True
)

In [22]:
# Initialise trainer and set pre-defined attributes
trainer = Trainer(
    model=model,

    args=training_args,
    train_dataset=train_dataloader,
    eval_dataset=val_dataloader,
    compute_metrics=compute_metrics
)

In [23]:
# Time to fine-tune with LoRA!
trainer.train()



Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
300,0.2702,0.0972,0.974701,0.974696,0.974858,0.974647
600,0.115,0.140162,0.958094,0.958065,0.959963,0.958326
900,0.108,0.349682,0.930157,0.929893,0.937965,0.93063
1200,0.0951,0.075323,0.977805,0.977797,0.97817,0.977719
1500,0.0859,0.076701,0.980444,0.980443,0.980659,0.98053
1800,0.0887,0.067522,0.981375,0.981374,0.981631,0.981467
2100,0.073,0.046799,0.985411,0.98541,0.985405,0.985418
2400,0.0745,0.063894,0.981686,0.981684,0.981729,0.981661
2700,0.0733,0.08152,0.975788,0.975783,0.976502,0.975934
3000,0.068,0.046641,0.986342,0.986341,0.986336,0.98635


TrainOutput(global_step=9666, training_loss=0.05691123270184363, metrics={'train_runtime': 2800.0089, 'train_samples_per_second': 55.224, 'train_steps_per_second': 3.452, 'total_flos': 4.09655083268137e+16, 'train_loss': 0.05691123270184363, 'epoch': 3.0})

In [24]:
# Evaluate the model performance
q=[trainer.evaluate(eval_dataset=df) for df in [train_dataloader, val_dataloader, test_dataset]]

pd.DataFrame(q, index=['train','val','test']).iloc[:,:5]

Unnamed: 0,eval_loss,eval_Accuracy,eval_F1,eval_Precision,eval_Recall
train,0.011378,0.996469,0.996469,0.996468,0.996469
val,0.027358,0.991774,0.991774,0.991768,0.991786
test,0.020916,0.994102,0.994102,0.994109,0.9941


In [25]:
def predict(text):
    # Tokenize the input and move tensors to the GPU, if possible
    inputs = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors='pt').to('cuda')

    outputs = model(**inputs)
    probs = outputs[0].softmax(1)

    # Get index with the highest probability
    pred_label_idx = probs.argmax()

    # Map the predicted class using id2label
    pred_label = model.config.id2label[pred_label_idx.item()]

    return probs, pred_label_idx, pred_label

In [26]:
# Evaluation test run
text = "Today Donald Trump traveled to Mars on a SpaceX rocket and got shot by an alien as soon as he got there."
predict(text)

(tensor([[0.9964, 0.0036]], device='cuda:0', grad_fn=<SoftmaxBackward0>),
 tensor(0, device='cuda:0'),
 'FAKE')

In [None]:
# Local path for saving adapters
base_dir = "Models"
sub_dir = "Adapters"
file_name = "BERT_LoRA_v1"

full_path = os.path.join(base_dir, sub_dir, file_name)

In [None]:
# Save model and tokenizer locally
model_path = full_path
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

('/content/drive/MyDrive/Colab-Notebooks/Models/Adapters/BERT_LoRA_v1/tokenizer_config.json',
 '/content/drive/MyDrive/Colab-Notebooks/Models/Adapters/BERT_LoRA_v1/special_tokens_map.json',
 '/content/drive/MyDrive/Colab-Notebooks/Models/Adapters/BERT_LoRA_v1/vocab.txt',
 '/content/drive/MyDrive/Colab-Notebooks/Models/Adapters/BERT_LoRA_v1/added_tokens.json',
 '/content/drive/MyDrive/Colab-Notebooks/Models/Adapters/BERT_LoRA_v1/tokenizer.json')

In [None]:
# Load locally stored model and tokenizer
model_path = full_path
model = BertForSequenceClassification.from_pretrained(model_path)
tokenizer = BertTokenizerFast.from_pretrained(model_path)

***

**Login to HuggingFace🤗 to save or load the adapter**

In [27]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [28]:
# Save the adapter on HF
model.push_to_hub("il1a/BERT_Fake_News_Classification_LoRA_v2")

adapter_model.bin:   0%|          | 0.00/2.83M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/il1a/BERT_Fake_News_Classification_LoRA_v2/commit/5fb03a582df7cb330911c693c30cab319dfb2682', commit_message='Upload model', commit_description='', oid='5fb03a582df7cb330911c693c30cab319dfb2682', pr_url=None, pr_revision=None, pr_num=None)

In [40]:
# Load the adapter from HF, when needed
peft_model_id = "il1a/BERT_Fake_News_Classification_LoRA_v2"
peft_config = PeftConfig.from_pretrained(peft_model_id)
bert_inference = AutoModelForSequenceClassification.from_pretrained(
    peft_config.base_model_name_or_path, num_labels=num_labels, id2label=id2label, label2id=label2id
)
bert_tokenizer = BertTokenizerFast.from_pretrained(peft_config.base_model_name_or_path)
bert = PeftModel.from_pretrained(bert_inference, peft_model_id)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [15]:
# Function for the visualisation of the BERT's attention across all layers and heads
def show_attention(text):
    inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=True)
    inputs = {name: tensor.to('cuda') for name, tensor in inputs.items()}
    input_ids = inputs['input_ids']
    attention = model(input_ids)[-1]
    input_id_list = input_ids[0].tolist()
    tokens = tokenizer.convert_ids_to_tokens(input_id_list)
    model_view(attention, tokens)

In [18]:
# Visualise BERT's attention on example of a fake-sounding sample text
text = "Today Donald Trump traveled to Mars on a SpaceX rocket and got shot by an alien as soon as he got there."
show_attention(text)

<IPython.core.display.Javascript object>