## Bachelor Thesis
## "Exploring the Efficacy of Diverse Classification Techniques In Detecting Disinformation In News."
Ilia Sokolovskiy
HTW SS23

Notebook 5/5 - Model Ensemble + Final Predictions

**Installing all necessary dependencies**

In [None]:
%%capture
!pip install spacy transformers peft

**Importing all necessary libraries**

In [1]:
import os
import spacy

import pandas as pd
import numpy as np
import torch
from torch.nn.functional import softmax


from transformers import (
    AutoModelForSequenceClassification,
    BertTokenizerFast
)
from peft import (
    PeftModel,
    PeftConfig
)

from utils import NewsClassifier

### Load the data and form and define the num_labels, id2label and label2id for BERT

In [2]:
# Load the pickle with the df
base_dir = "Data"
pickle_folder = "Pickles"
filename_pickle = "pickle_lg_df_2.pkl"

full_path_pickle = os.path.join(base_dir, pickle_folder, filename_pickle)

df = pd.read_pickle(full_path_pickle)

In [3]:
labels = df['label'].unique().tolist()
labels = [s.strip() for s in labels ]

num_labels= len(labels)
id2label={id:label for id,label in enumerate(labels)}
label2id={label:id for id,label in enumerate(labels)}

print(f"num_labels: {num_labels}")
print(f"id2label: {id2label}")
print(f"label2id: {label2id}")

num_labels: 2
id2label: {0: 'FAKE', 1: 'TRUE'}
label2id: {'FAKE': 0, 'TRUE': 1}


### Loading three models - SVM (97.1% Accuracy), Bi-LSTM (97.35% Accuracy), BERT (99.4% Accuracy)

In [51]:
# Setting path parameters
base_dir = "Models"

svm_dir = "Pickles"
svm_model_pickle = "best_sklearn_model_3.pkl"
svm_scaler_pickle = "best_sklearn_model_scaler_3.pkl"
svm_path_model = os.path.join(base_dir, svm_dir, svm_model_pickle)
svm_path_scaler = os.path.join(base_dir, svm_dir, svm_scaler_pickle)

bi_lstm_dir = "Torches"
bi_lstm_weights_file = "bi_lstm_weights_1.pth"
bi_lstm_path_weights = os.path.join(base_dir, bi_lstm_dir, bi_lstm_weights_file)

peft_model_id = "il1a/BERT_Fake_News_Classification_LoRA_v2"

**Loading SVM**

In [52]:
svm = pd.read_pickle(svm_path_model)
svm_scaler = pd.read_pickle(svm_path_scaler)

**Loading Bi-LSTM**

In [53]:
bi_lstm = NewsClassifier()
bi_lstm.load_state_dict(torch.load(bi_lstm_path_weights))
bi_lstm.eval()

NewsClassifier(
  (lstm): LSTM(300, 50, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=100, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

**Loading BERT Adapter from HuggingFace 🤗**

In [54]:
peft_config = PeftConfig.from_pretrained(peft_model_id)
bert_inference = AutoModelForSequenceClassification.from_pretrained(
    peft_config.base_model_name_or_path, num_labels=num_labels, id2label=id2label, label2id=label2id
)
bert_tokenizer = BertTokenizerFast.from_pretrained(peft_config.base_model_name_or_path)
bert = PeftModel.from_pretrained(bert_inference, peft_model_id)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly i

In [55]:
# Moving BERT to the GPU, if possible
bert = bert.to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

In [56]:
# Loading the same old spaCy model with disabled pipeline components
nlp = spacy.load('en_core_web_lg', disable=['textcat', 'parser', 'custom'])

# Vectorization function
def vectorize_texts(texts):
    return np.array([nlp(text).vector for text in texts])

In [57]:
# Initialising weights for weighted ensemble (defines 'authority' of each model)
weights = {
    "svm": 0.3306,
    "bi_lstm": 0.3314,
    "bert": 0.3380
}

# Probability function for SVM
def svm_probabilities(texts, svm_model, scaler):
    vectors = vectorize_texts(texts)
    scaled_vectors = scaler.transform(vectors)
    return svm_model.predict_proba(scaled_vectors)[:, 1]

# Probability function for Bi-LSTM
def bi_lstm_probabilities(texts, model):
    vectors = vectorize_texts(texts)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    vectors_tensor = torch.tensor(vectors).float().to(device)
    vectors_tensor = vectors_tensor.unsqueeze(1)
    with torch.no_grad():
        probabilities = model(vectors_tensor).squeeze()
    return probabilities.cpu().numpy()

# Probability function for BERT
def bert_probabilities(texts, model, tokenizer, max_length=512):
    inputs = tokenizer(texts, return_tensors='pt', max_length=max_length, truncation=True, padding=True).to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
    with torch.no_grad():
        outputs = model(**inputs)
        probabilities = softmax(outputs.logits, dim=1)[:, 1]
    return probabilities.cpu().numpy()

# Probability function for the weighted model ensemble
def ensemble_predict_with_probabilities(texts):
    # Get individual model probabilities and predictions
    svm_probs = svm_probabilities(texts, svm, svm_scaler)
    bi_lstm_probs = bi_lstm_probabilities(texts, bi_lstm)
    bert_probs = bert_probabilities(texts, bert, bert_tokenizer)

    # Calculate ensemble prediction using weights
    ensemble_probs = (svm_probs * weights['svm'] + bi_lstm_probs * weights['bi_lstm'] + bert_probs * weights['bert']) / sum(weights.values())
    ensemble_preds = [1 if prob > 0.5 else 0 for prob in ensemble_probs]

    results = {
        'svm_probs': svm_probs,
        'bi_lstm_probs': bi_lstm_probs,
        'bert_probs': bert_probs,
        'ensemble_probs': ensemble_probs,
        'ensemble_preds': ensemble_preds
    }

    return results

In [58]:
# Function for final result formatting
def interpret_results(results, test_texts):
    interpretations = []
    for i, text in enumerate(test_texts):
        interpretation = {
            "Text": text,
            "SVM Probability (%)": round(results['svm_probs'][i] * 100, 2),
            "Bi-LSTM Probability (%)": round(results['bi_lstm_probs'][i] * 100, 2),
            "BERT Probability (%)": round(results['bert_probs'][i] * 100, 2),
            "Ensemble Probability (%)": round(results['ensemble_probs'][i] * 100, 2),
            "Ensemble Prediction": "True" if results['ensemble_preds'][i] == 1 else "Fake"
        }
        interpretations.append(interpretation)
    return interpretations

In [59]:
# Some fresh sample texts from the web FAKE-FAKE-FAKE-TRUE-TRUE-TRUE
test_texts = ["Al-Harir base in Erbil, Iraq is overcrowded after the arrival of a new batch of troops from the US", "Excessive use of earphones or earbuds can lead to deafness, ear infection and pain and stress, but the claim that it causes facial paralysis is false.", "Sea level rise over the last 20,000 years shows climate change is a 'scam'", "The total number of Ukrainian and Russian troops killed or wounded since the war in Ukraine began 18 months ago is nearing 500,000, U.S. officials said, a staggering toll as Russia assaults its next-door neighbor and tries to seize more territory.", "President Biden welcomed his counterparts from Japan and South Korea to Camp David on Friday morning as he seeks to cement a newly fortified three-way alliance, bridging generations of friction between the two Asian powers to forge mutual security arrangements in the face of an increasingly assertive China.", "U.S. officials say Chinese and Russian spy agencies are trying to steal technology from private American space companies and preparing cyberattacks that could disable satellites in a conflict."]

In [60]:
# Ensemble with fine-tuned SVM [97.1%]
results = ensemble_predict_with_probabilities(test_texts)
interpretations = interpret_results(results, test_texts)

for interpretation in interpretations:
    for key, value in interpretation.items():
        print(f"{key}: {value}")
    print("\n")

Text: Al-Harir base in Erbil, Iraq is overcrowded after the arrival of a new batch of troops from the US
SVM Probability (%): 8.04
Bi-LSTM Probability (%): 0.02
BERT Probability (%): 3.27
Ensemble Probability (%): 3.77
Ensemble Prediction: Fake


Text: Excessive use of earphones or earbuds can lead to deafness, ear infection and pain and stress, but the claim that it causes facial paralysis is false.
SVM Probability (%): 4.89
Bi-LSTM Probability (%): 2.2
BERT Probability (%): 0.39
Ensemble Probability (%): 2.47
Ensemble Prediction: Fake


Text: Sea level rise over the last 20,000 years shows climate change is a 'scam'
SVM Probability (%): 9.47
Bi-LSTM Probability (%): 0.1
BERT Probability (%): 0.41
Ensemble Probability (%): 3.3
Ensemble Prediction: Fake


Text: The total number of Ukrainian and Russian troops killed or wounded since the war in Ukraine began 18 months ago is nearing 500,000, U.S. officials said, a staggering toll as Russia assaults its next-door neighbor and tries to se

In [41]:
# Ensemble with standard SVM (default parameters) [96.4%]
# Ensemble weights adjusted accordingly --> {"svm": 0.329, "bi_lstm": 0.332, "bert": 0.339}
results = ensemble_predict_with_probabilities(test_texts)
interpretations = interpret_results(results, test_texts)

for interpretation in interpretations:
    for key, value in interpretation.items():
        print(f"{key}: {value}")
    print("\n")

Text: Al-Harir base in Erbil, Iraq is overcrowded after the arrival of a new batch of troops from the US
SVM Probability (%): 7.59
Bi-LSTM Probability (%): 0.02
BERT Probability (%): 3.27
Ensemble Probability (%): 3.61
Ensemble Prediction: Fake


Text: Excessive use of earphones or earbuds can lead to deafness, ear infection and pain and stress, but the claim that it causes facial paralysis is false.
SVM Probability (%): 3.43
Bi-LSTM Probability (%): 2.2
BERT Probability (%): 0.39
Ensemble Probability (%): 1.99
Ensemble Prediction: Fake


Text: Sea level rise over the last 20,000 years shows climate change is a 'scam'
SVM Probability (%): 6.91
Bi-LSTM Probability (%): 0.1
BERT Probability (%): 0.41
Ensemble Probability (%): 2.44
Ensemble Prediction: Fake


Text: The total number of Ukrainian and Russian troops killed or wounded since the war in Ukraine began 18 months ago is nearing 500,000, U.S. officials said, a staggering toll as Russia assaults its next-door neighbor and tries to s