# 📰 Fake News Detection Using Ensemble Learning

---

| S. No |    ID No.     | Name             |
| ----: | :-----------: | :--------------- |
|    1. | 2022A7PS0003U | Yusra Hakim      |
|    2. | 2022A7PS0019U | Joseph Cijo      |
|    3. | 2022A7PS0031U | Ritvik Bhatnagar |

This Jupyter Notebook is for the project in Data Mining (CS F415) course. It contains the code used for preparing the ensemble model using the [dataset](#dataset-used) mentioned below.

### Dataset Details

**Dataset introduced in:**
V. Pawan Kumar, A. Prateek, A. Ivone and P. Radu, "WELFake: Word Embedding Over Linguistic Features for Fake News Detection," _IEEE Transactions on Computational Social Systems_, vol. 8, no. 4, pp. 881-893, 2021.
[Kaggle | WELFake Dataset](https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification)

### Setting Up an Environment for the Jupyter notebook

Firstly install Miniconda from [here](https://docs.anaconda.com/miniconda/install/).

Then open a command prompt in this directory, and run the following. This will create and activate an environment called "PROJ".

```bash
    conda create -n proj python=3.12
    conda activate proj
```

After running this, your CMD prompt should have a "`(proj)`" prefixed at the start.

Run the following command to install packages, such as [PyTorch](https://pytorch.org/get-started/locally/). This will take some time.


In [None]:
%conda install -n proj ipykernel ipywidgets --update-deps --force-reinstall
%conda install -n proj nltk
%conda install -n proj conda-forge::textblob
%pip install scikit-learn matplotlib pandas pyperclip contractions scipy numpy
%conda install -n proj conda-forge::transformers
%pip install torch --index-url https://download.pytorch.org/whl/cu124


Remember to select the PROJ environment at the bottom-right.

---

# 1. 📚 Outline

To train the model, the following steps must be followed:

<img src="assets/archdiag.png" size=64/>

The model will accept two inputs:
1. The headline of the article
2. Content of the article (Optional)

And give an output if it is fake news or factual (real) news.

---

# 2. 📑 Dataset Tomfoolery

The dataset is a CSV file with the following columns:

- _title_: the title of the article
- _text_: the text of the article
- _label_: the label of the article (0 for fake, 1 for real)

The dataset used is the WELFake dataset [available here](https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification)

This dataset is a collection of news articles, with each article labeled as either fake or real. The dataset contains 10,000 articles, with 5,000 fake and 5,000 real articles. The articles are in English and cover a wide range of topics, including politics, sports, entertainment, and technology.

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np
import scipy as sp

import nltk
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from IPython.display import display, HTML
import ipywidgets as widgets

import random
import re
import contractions

from transformers.models.distilbert import DistilBertTokenizer, DistilBertModel

import torch
import pickle


In [None]:
df = pd.read_csv("dataset_and_corpora/WELFake_Dataset.csv", index_col=0)

df.head()


## ✂️ Splitting Dataset

The dataset is split as
-  80% → Training + Validation Set
-  20% → Testing Set

In [None]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)


In [None]:
print(f"Training set size: {train_df.shape}")
train_df.to_csv("dataset_and_corpora/train.csv", index=False)
train_df.head()


In [None]:
print(f"\nTesting set size: {test_df.shape}")
test_df.to_csv("dataset_and_corpora/test.csv", index=False)
test_df.head()


## 👾 Data Preprocessing

The data is preprocessed using the following steps
1. [Text Cleaning](#text-cleaning)
   1. [Lowercasing & URL Removal](#lowercasing-url-removal)
   2. [Contractions Expansion](#contractions-expansion)
   3. [Tokenization](#tokenization)
   4. [Lemmatization](#lemmatization)
   5. [Stopword Removal](#stopword-removal)
2. [Data Augmentaiton](#data-augmentation)
   1. [Synonym Replacement](#synonym-replacement)
   2. [Random Insertion](#random-insertion)
   3. [Random Swap](#random-swap)
   4. [Random Deletion](#random-deletion)

---

### 🧹 Data Cleaning

The text in the datast has to be cleaned before it can be used for training the model. The following steps are used for cleaning the text.

[Kaggle | Getting started with Text Preprocessing](https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing)


#### Merging Text Values

In [None]:
def merge_title_and_text(df: pd.DataFrame) -> pd.DataFrame:
    """
    Merge the 'title' and 'text' fields into a single 'content' field.

    :param df: Input DataFrame
    :return: DataFrame with a new 'content' field
    """

    df["content"] = df["title"].fillna("") + " " + df["text"].fillna("")
    return df


#### Handling Missing Values

In [None]:
def handle_missing_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Handle missing data in the DataFrame by removing rows with any empty, NaN values, or rows containing only whitespaces.

    :param df: DataFrame to process
    :return: DataFrame with rows containing any empty, NaN values, or whitespaces-only content removed
    """

    missing_values = df.isnull().sum()
    print("Missing values in each column:\n", missing_values)

    df = df.replace("", np.nan)

    df = df.dropna()

    return df


#### Lowercasing & URL Removal

The text is converted to lowercase and URLs are removed from the text.

In [None]:
def lowercase_and_remove_urls(text: str) -> str:
    """
    Convert text to lowercase and remove URLs.

    :param text: Input text
    :return: Processed text with URLs removed
    """

    text = text.lower()

    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    return text


#### Contractions Expansion

Contractions are expanded to their full form. For example, "I'm" is expanded to "I am". Using the [`contractions`](https://github.com/kootenpv/contractions/tree/master) library for this purpose.

In [None]:
def expand_contractions(text: str) -> str:
    """
    Expand contractions in the text using the `contractions` library.

    :param text: Input text
    :return: Text with contractions expanded
    """

    expanded_words = [contractions.fix(word) for word in text.split()]
    expanded_text = " ".join(expanded_words)  # type: ignore
    return expanded_text


#### Tokenization

The text is tokenized into words. This also removes punctuation and any special characters.

The [`RegexpTokenizer`](https://www.nltk.org/api/nltk.tokenize.RegexpTokenizer.html) tokenizer is used for tokenization by aplphanumeric words and to ignore all other characters.

In [None]:
def tokenize_text(text: str) -> list:
    """
    Tokenize the input text using NLTK's RegexpTokenizer.

    :param text: Input text
    :return: List of tokens
    """

    tokenizer = RegexpTokenizer(r"\w+")
    tokens = tokenizer.tokenize(text)
    return tokens


#### Lemmatization

The words are lemmatized to their root form. This is done using the [`WordNetLemmatizer`](https://www.nltk.org/api/nltk.stem.html#nltk.stem.wordnet.WordNetLemmatizer) from the NLTK library.

In [None]:
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("wordnet")


In [None]:
lemmatizer = WordNetLemmatizer()


def nltkToWordnet(nltk_tag: str) -> str:
    """
    Convert NLTK POS tags to WordNet POS tags.

    :param nltk_tag: NLTK POS tag
    :return: WordNet POS tag
    """

    if nltk_tag.startswith("J"):
        return wordnet.ADJ
    elif nltk_tag.startswith("V"):
        return wordnet.VERB
    elif nltk_tag.startswith("N"):
        return wordnet.NOUN
    elif nltk_tag.startswith("R"):
        return wordnet.ADV
    else:
        return None  # type: ignore


def lemmatize_text(tokens: list) -> list:
    """
    Lemmatize the tokens using NLTK's WordNetLemmatizer.

    :param tokens: List of tokens
    :return: List of lemmatized words
    """

    pos_tags = nltk.pos_tag(tokens)
    res_words = []
    for word, tag in pos_tags:
        tag = nltkToWordnet(tag)
        if tag is None:
            res_words.append(word)
        else:
            res_words.append(lemmatizer.lemmatize(word, tag))
    return res_words


#### Stopword Removal

Stopwords are removed from the text. Stopwords are common words that do not add much meaning to the text. The [`stopwords`](https://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.wordlist.WordListCorpusReader) from the NLTK library are used for this purpose.


In [None]:
nltk.download("words")
nltk.download("stopwords")


In [None]:
englishWords = set(nltk.corpus.words.words())
stop_words = set(stopwords.words("english"))


def remove_stopwords(tokens: list) -> str:
    """
    Remove stop words from the list of tokens.

    :param tokens: List of tokens
    :return: List of tokens without stop words
    """

    filtered_words = [w for w in tokens if (w in englishWords and w not in stop_words)]
    filtered_text = " ".join(filtered_words)
    return filtered_text


---

#### Performing Text Cleaning


Function to perform text cleaning on the dataset.

In [None]:
def clean_text(text: str) -> str:
    """
    Clean the input text by applying various preprocessing steps.

    :param text: Input text
    :return: Cleaned text
    """
    text = lowercase_and_remove_urls(text)
    text = expand_contractions(text)
    tokens = tokenize_text(text)
    tokens = lemmatize_text(tokens)
    text = remove_stopwords(tokens)
    return text


In [None]:
progress = widgets.IntProgress(
    value=0, min=0, max=len(train_df), description="Progress:"
)


In [None]:
display(progress)

cleaned_data = []
train_df = merge_title_and_text(train_df)
train_df = handle_missing_data(train_df)

for index, row in train_df.iterrows():
    content = row["content"] if pd.notnull(row["content"]) else ""

    cleaned_content = clean_text(content)

    cleaned_data.append({"content": cleaned_content, "label": row["label"]})

    progress.value += 1

cleaned_train_df = pd.DataFrame(cleaned_data)
cleaned_train_df = handle_missing_data(cleaned_train_df)

cleaned_train_df.head()


---

### 🥀 Data Augmntation

Data augmentation is a technique used to increase the size of the dataset by adding slightly modified copies of the data. This is done to improve the performance of the model by providing more data for training.

Jason Wei, Kai Zou, "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks"
[Available here](https://arxiv.org/abs/1901.11196)

#### Synonym Replacement

In this technique, we replace n words in the sentence with synonyms from WordNet. We choose a random n words from the sentence that are not stop words. We then replace each of these words with a random synonym that is also present in the sentence.

In [None]:
def get_synonyms(word: str) -> list:
    """
    Get synonyms for a given word using WordNet.

    :param word: Input word
    :return: List of synonyms
    """

    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():  # type: ignore
            synonyms.add(lemma.name())
    if word in synonyms:
        synonyms.remove(word)
    return list(synonyms)


def synonym_replacement(text: str, n: int) -> str:
    """
    Randomly replace words in the text with their synonyms.

    :param text: Input text
    :param n: Number of words to replace
    :return: Text with words replaced by synonyms
    """

    words = text.split()
    new_words = words.copy()
    random_word_list = list(set([word for word in words if wordnet.synsets(word)]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(synonyms)
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        if num_replaced >= n:
            break

    sentence = " ".join(new_words)

    return sentence


#### Random Insertion

In random insertion, we randomly insert synonyms of a word into the sentence n times. This is done by first finding a synonym of a word and then inserting it into the sentence at a random position.

In [None]:
def add_word(word: str, sentence: str) -> str:
    """
    Add a word at a random position in the sentence.

    :param word: Word to add
    :param sentence: Original sentence
    :return: Sentence with the word added
    """

    words = sentence.split()
    random_idx = random.randint(0, len(words) - 1)
    words.insert(random_idx, word)
    return " ".join(words)


def random_insertion(text: str, n: int) -> str:
    """
    Randomly insert words into the text.

    :param text: Input text
    :param n: Number of words to insert
    :return: Text with words inserted
    """

    words = text.split()
    new_words = words.copy()
    for _ in range(n):
        random_word = random.choice(words)
        synonyms = get_synonyms(random_word)
        if synonyms:
            synonym = random.choice(synonyms)
            new_words = add_word(synonym, " ".join(new_words)).split()

    return " ".join(new_words)


#### Random Swap

In random swap, we randomly swap two words in the sentence n times. This is done by randomly choosing two words in the sentence and then swapping their positions.

In [None]:
def random_swap(text: str, n: int) -> str:
    """
    Randomly swap words in the text.

    :param text: Input text
    :param n: Number of words to swap
    :return: Text with words swapped
    """

    words = text.split()
    new_words = words.copy()
    for _ in range(n):
        idx1, idx2 = random.sample(range(len(words)), 2)
        new_words[idx1], new_words[idx2] = new_words[idx2], new_words[idx1]

    return " ".join(new_words)


#### Random Deletion

In random deletion, we randomly delete each word in the sentence with a probability p.

In [None]:
def random_deletion(text: str, p: float) -> str:
    """
    Randomly delete words from the text with a given probability.

    :param text: Input text
    :param p: Probability of deletion
    :return: Text with words deleted
    """

    words = text.split()
    if len(words) == 1:
        return text

    new_words = []
    for word in words:
        r = random.uniform(0, 1)
        if r > p:
            new_words.append(word)

    if len(new_words) == 0:
        return random.choice(words)

    return " ".join(new_words)


---

#### Performing Data Augmentation

In [None]:
synonymReplaced_data = []
randomInserted_data = []
randomSwapped_data = []
randomDeleted_data = []


In [None]:
progress = widgets.IntProgress(
    value=0, min=0, max=len(cleaned_train_df), description="Progress:"
)


In [None]:
display(progress)

for index, row in cleaned_train_df.iterrows():
    content = row["content"] if pd.notnull(row["content"]) else row["content"]

    try:
        if pd.notnull(content):
            try:
                synonymReplaced_content = synonym_replacement(
                    content, (len(content.split()) // 3)
                )
            except Exception as e:
                print(f"Synonym replacement error for content: {e}")
                synonymReplaced_content = content

            try:
                randomInserted_content = random_insertion(
                    content, (len(content.split()) // 3)
                )
            except Exception as e:
                print(f"Random insertion error for content: {e}")
                randomInserted_content = content

            try:
                randomSwapped_content = random_swap(
                    content, (len(content.split()) // 3)
                )
            except Exception as e:
                print(f"Random swap error for content: {e}")
                randomSwapped_content = content

            try:
                randomDeleted_content = random_deletion(content, 0.1)
            except Exception as e:
                print(f"Random deletion error for content: {e}")
                randomDeleted_content = content
        else:
            synonymReplaced_content = content
            randomInserted_content = content
            randomSwapped_content = content
            randomDeleted_content = content
    except Exception as e:
        print(f"Augmentation error for content: {e}")
        synonymReplaced_content = content
        randomInserted_content = content
        randomSwapped_content = content
        randomDeleted_content = content

    synonymReplaced_data.append(
        {
            "content": synonymReplaced_content,
            "label": row["label"],
        }
    )

    randomInserted_data.append(
        {
            "content": randomInserted_content,
            "label": row["label"],
        }
    )

    randomSwapped_data.append(
        {
            "content": randomSwapped_content,
            "label": row["label"],
        }
    )

    randomDeleted_data.append(
        {
            "content": randomDeleted_content,
            "label": row["label"],
        }
    )

    progress.value += 1


In [None]:
synonymReplaced_df = pd.DataFrame(synonymReplaced_data)
randomInserted_df = pd.DataFrame(randomInserted_data)
randomSwapped_df = pd.DataFrame(randomSwapped_data)
randomDeleted_df = pd.DataFrame(randomDeleted_data)

augmented_df = pd.concat(
    [synonymReplaced_df, randomInserted_df, randomSwapped_df, randomDeleted_df]
)

final_train_df = (
    pd.concat([cleaned_train_df, augmented_df]).sample(frac=1).reset_index(drop=True)
)


In [None]:
print(f"Final training set size: {final_train_df.shape}")
final_train_df.to_csv("dataset_and_corpora/augmented_train.csv")
final_train_df.head()


---

### 🎭 Sentiment Analysis

TextBlob is a Python library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In [None]:
from textblob import TextBlob


def analyze_sentiment(text: str) -> TextBlob.sentiment:  # type: ignore
    """
    Analyze the sentiment of the given text using TextBlob.

    :param text: Input text
    :return: Sentiment analysis result
    """

    blob = TextBlob(text)
    return blob.sentiment


In [None]:
def add_sentiment(df: pd.DataFrame) -> pd.DataFrame:
    """
    Add sentiment analysis results (polarity and subjectivity) to the DataFrame.

    :param df: DataFrame to process
    :return: DataFrame with sentiment analysis results
    """

    progress = widgets.IntProgress(value=0, min=0, max=len(df), description="Progress:")
    display(progress)

    polarity = []
    subjectivity = []

    for _index, row in df.iterrows():
        sentiment = analyze_sentiment(row["content"])
        polarity.append(sentiment.polarity)
        subjectivity.append(sentiment.subjectivity)
        progress.value += 1

    df["polarity"] = polarity
    df["subjectivity"] = subjectivity
    df = df[["content", "polarity", "subjectivity", "label"]]

    return df


In [None]:
final_train_df = add_sentiment(final_train_df)
final_train_df.to_csv("dataset_and_corpora/augmented_train_senti.csv", index=False)
final_train_df.head()


---

## 🔢 Feature Extraction

Extracting features for model training
- TF-IDF (for SVM)
- Word2Vec + GloVe (for RF and Gradient Boosting)
- SmallBERT (for LR)

In [3]:
final_train_df = pd.read_csv("dataset_and_corpora/augmented_train_senti.csv")


### TF-IDF Feature Extarction

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in natural language processing and information retrieval to evaluate the importance of a word in a document relative to a collection of documents (corpus).


In [None]:
def extract_tfidf_features(
    df: pd.DataFrame, max_features: int = 5000
) -> sp.sparse.csr.csr_matrix:
    """
    Extract TF-IDF features from the 'content' column of the DataFrame.

    :param df: Input DataFrame with a 'content' column
    :param max_features: Maximum number of features to extract
    :return: TF-IDF features as a sparse matrix
    """

    progress = widgets.IntProgress(value=0, min=0, max=1, description="TF-IDF:")
    display(progress)

    df["content"] = df["content"].fillna("")
    vectorizer = TfidfVectorizer(max_features=max_features)
    tfidf_features = vectorizer.fit_transform(df["content"])

    progress.value = 1
    return tfidf_features


### GloVe Feature Extraction

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

**Introduced in** Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf).

Before running the cell, please download and move the pre-trained word vectors (Wikipedia 2014 + Gigaword 5) from [here](https://nlp.stanford.edu/data/glove.6B.zip) to the `datasets_and_corpora` folder

In [None]:
def extract_glove_features(
    df: pd.DataFrame,
    glove_path: str = "dataset_and_corpora/glove.6B.100d.txt",
    embedding_dim: int = 100,
) -> np.ndarray:
    """
    Extract GloVe features from the 'content' column of the DataFrame.

    :param df: Input DataFrame with a 'content' column
    :param glove_path: Path to the GloVe embeddings file
    :param embedding_dim: Dimension of the GloVe embeddings
    :return: GloVe features as a NumPy array
    """
    progress = widgets.IntProgress(value=0, min=0, max=len(df), description="GloVe:")
    display(progress)

    # Load GloVe embeddings
    glove_embeddings = {}
    with open(glove_path, "r", encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype="float32")
            glove_embeddings[word] = vector

    # Compute sentence embeddings
    sentences = [content.split() for content in df["content"]]
    glove_features = np.array(
        [
            np.mean(
                [
                    glove_embeddings[word]
                    for word in sentence
                    if word in glove_embeddings
                ]
                or [np.zeros(embedding_dim)],
                axis=0,
            )
            for sentence in sentences
        ]
    )

    for _ in range(len(df)):
        progress.value += 1

    return glove_features


### DistilBERT Feature Extraction

[DistilBERT](https://huggingface.co/docs/transformers/en/model_doc/distilbert) is pretrained by knowledge distillation to create a smaller model with faster inference and requires less compute to train. Through a triple loss objective during pretraining, language modeling loss, distillation loss, cosine-distance loss, DistilBERT demonstrates similar performance to a larger transformer language model.

**Introduced in** Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. [DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter](https://arxiv.org/abs/1910.01108).

In [None]:
def extract_bert_features(
    df: pd.DataFrame,
    model_name: str = "distilbert-base-uncased",
    chunk_size: int = 10,
    output_file: str = "features/distilbert_vectorizer.pkl",
) -> str:
    """
    Extract BERT features from the 'content' column of the DataFrame.

    :param df: Input DataFrame with a 'content' column
    :param model_name: Name of the pre-trained BERT model
    :param chunk_size: Number of samples to process at once
    :param output_file: Path to save the extracted features
    :return: Path to the saved features file
    """

    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    model = DistilBertModel.from_pretrained(model_name)
    model.eval()
    progress = widgets.IntProgress(value=0, min=0, max=len(df), description="BERT:")
    display(progress)

    with open(output_file, "wb") as f:
        for i in range(0, len(df), chunk_size):
            chunk = df.iloc[i : i + chunk_size]
            chunk_features = []

            outputs = None
            inputs = None

            for content in chunk["content"]:
                inputs = tokenizer(
                    content,
                    return_tensors="pt",
                    truncation=True,
                    padding=True,
                    max_length=512,
                )
                with torch.no_grad():
                    outputs = model(**inputs)
                cls_embedding = (
                    outputs.last_hidden_state[:, 0, :].squeeze().cpu().numpy()
                )
                chunk_features.append(cls_embedding)
                progress.value += 1

            # Save chunk features to pickle file
            pickle.dump(chunk_features, f)
            del inputs, outputs, chunk_features
            torch.cuda.empty_cache()

    print(f"Features saved to {output_file}")
    return output_file


### Performing Feature Extraction

In [None]:
tfidf_features = extract_tfidf_features(final_train_df)
with open("features/tfidf_vectorizer.pkl", "wb") as file:
    pickle.dump(tfidf_features, file)
del tfidf_features


In [None]:
glove_features = extract_glove_features(final_train_df, glove_path="glove.6B.100d.txt")
with open("features/glove_vectorizer.pkl", "wb") as file:
    pickle.dump(glove_features, file)
del glove_features


In [None]:
opfile = extract_bert_features(final_train_df)
print(f"Features saved to {opfile}\n")
# The function already saves the features to a file


IntProgress(value=0, description='BERT:', max=285640)

KeyboardInterrupt: 

In [None]:
# with open("dataset/tfidf_vectorizer.pkl", "rb") as file:
#     loaded_vectorizer_pickle = pickle.load(file)
# with open("dataset/GloVe.pkl", "rb") as file:
#     loaded_glove_pickle = pickle.load(file)


---

## 📱 Display Sample Data

In [None]:
def displayRandomSample(df: pd.DataFrame, dataType: int) -> None:
    """
    Display a random sample from the DataFrame.

    :param df: DataFrame to sample from
    :param dataType: Type of DataFrame to sample from (0 for original, 1 for augmented)
    """

    if dataType == 0:
        sample = df.sample(n=1)
        print(f"idx: {sample.index[0]}")
        text = sample.iloc[0]["text"]
        if isinstance(text, str):
            text = f"{text[:200]}..."
        else:
            text = "N/A"
        display(
            HTML(
                f"<b>Original:</b><br>"
                f"<b>Title:</b> {sample.iloc[0]['title']}<br>"
                f"<b>Text:</b> {text}<br>"
                f"<b>Label:</b> {sample.iloc[0]['label']}<br>"
            )
        )
    if dataType == 1:
        sample = df.sample(n=1)
        print(f"idx: {sample.index[0]}")
        text = sample.iloc[0]["content"]
        if isinstance(text, str):
            text = f"{text[:200]}..."
        else:
            text = "N/A"
        display(
            HTML(
                f"<b>Augmented:</b><br>"
                f"<b>Content:</b> {text}<br>"
                f"<b>Polarity:</b> {sample.iloc[0]['polarity']}<br>"
                f"<b>Subjectivity:</b> {sample.iloc[0]['subjectivity']}<br>"
                f"<b>Label:</b> {sample.iloc[0]['label']}"
            )
        )


In [None]:
displayRandomSample(final_train_df, 1)
print(final_train_df.shape)


---

# 3. 🧠 Model Building




Preparing 5 models for ensemble learning model:
1. Random Forest (RF)
2. Support Vector Machine (SVM)
3. Gradient Boosting (XGBoost/LightGBM)
4. Logistic Regression (LR)
5. Small BERT (DistilBERT)

## Random Forest

yes

In [None]:
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import KFold
import numpy as np
import pickle

# Load GloVe features and labels
with open("features/glove_vectorizer.pkl", "rb") as file:
    glove_features = pickle.load(file)

labels = final_train_df["label"].to_numpy()

# Convert data to PyTorch tensors
glove_features_tensor = torch.tensor(glove_features, dtype=torch.float32)
labels_tensor = torch.tensor(labels, dtype=torch.long)

# Initialize k-fold cross-validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize variables to store results
fold_accuracies = []
ensemble_predictions = np.zeros((len(labels), len(np.unique(labels))))

# Perform k-fold cross-validation
for fold, (train_idx, test_idx) in enumerate(kf.split(glove_features)):
    print(f"Fold {fold + 1}/{k}")

    # Split the data into training and testing sets for this fold
    X_train, X_test = glove_features_tensor[train_idx], glove_features_tensor[test_idx]
    y_train, y_test = labels_tensor[train_idx], labels_tensor[test_idx]

    # Create DataLoader for training and testing
    train_dataset = TensorDataset(X_train, y_train)
    test_dataset = TensorDataset(X_test, y_test)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

    # Initialize the Random Forest model (using sklearn for simplicity)
    from sklearn.ensemble import RandomForestClassifier

    rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

    # Train the model
    for X_batch, y_batch in train_loader:
        rf_model.fit(X_batch.numpy(), y_batch.numpy())

    # Make predictions
    y_pred = []
    y_pred_proba = []
    for X_batch, _ in test_loader:
        preds = rf_model.predict(X_batch.numpy())
        preds_proba = rf_model.predict_proba(X_batch.numpy())
        y_pred.extend(preds)
        y_pred_proba.extend(preds_proba)

    # Store predictions for ensemble learning
    ensemble_predictions[test_idx] = np.array(y_pred_proba)

    # Evaluate the model
    accuracy = accuracy_score(y_test.numpy(), y_pred)
    fold_accuracies.append(accuracy)
    print(f"Accuracy for fold {fold + 1}: {accuracy:.4f}")
    print("\nClassification Report:\n", classification_report(y_test.numpy(), y_pred))

# Calculate and print the average accuracy across all folds
average_accuracy = np.mean(fold_accuracies)
print(f"\nAverage Accuracy across {k} folds: {average_accuracy:.4f}")

# Save the ensemble predictions for later use
with open("features/rf_ensemble_predictions.pkl", "wb") as file:
    pickle.dump(ensemble_predictions, file)


## SVM

yes


In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, classification_report
from torch.utils.data import DataLoader, TensorDataset

import torch.nn as nn
import torch.optim as optim

# Load TF-IDF features and labels
with open("features/tfidf_vectorizer.pkl", "rb") as file:
    tfidf_features = pickle.load(file)

labels = final_train_df["label"].to_numpy()

# Convert data to PyTorch tensors
tfidf_features = torch.tensor(tfidf_features.toarray(), dtype=torch.float32)
labels = torch.tensor(labels, dtype=torch.long)

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


# Define the SVM model
class SVMModel(nn.Module):
    def __init__(self, input_dim):
        super(SVMModel, self).__init__()
        self.fc = nn.Linear(input_dim, 2)  # Binary classification (2 classes)

    def forward(self, x):
        return self.fc(x)


# Initialize k-fold cross-validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize variables to store results
fold_accuracies = []
ensemble_predictions = torch.zeros((len(labels), 2), dtype=torch.float32)

# Perform k-fold cross-validation
for fold, (train_idx, test_idx) in enumerate(kf.split(tfidf_features.numpy())):
    print(f"Fold {fold + 1}/{k}")

    # Split the data into training and testing sets for this fold
    X_train, X_test = tfidf_features[train_idx], tfidf_features[test_idx]
    y_train, y_test = labels[train_idx], labels[test_idx]

    # Create DataLoaders for batch processing
    train_dataset = TensorDataset(X_train, y_train)
    test_dataset = TensorDataset(X_test, y_test)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

    # Initialize the model, loss function, and optimizer
    model = SVMModel(input_dim=X_train.shape[1]).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Train the model
    num_epochs = 10
    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0
        for batch_X, batch_y in train_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {epoch_loss:.4f}")

        # Save the model after each epoch
        torch.save(
            model.state_dict(), f"models/svm_model_fold{fold + 1}_epoch{epoch + 1}.pth"
        )

    # Evaluate the model
    model.eval()
    all_outputs = []
    all_y_test = []
    with torch.no_grad():
        for batch_X, batch_y in test_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model(batch_X)
            all_outputs.append(outputs)
            all_y_test.append(batch_y)

    # Concatenate outputs and labels
    outputs = torch.cat(all_outputs, dim=0)
    y_test = torch.cat(all_y_test, dim=0)
    _, y_pred = torch.max(outputs, 1)
    y_pred_proba = torch.softmax(outputs, dim=1)

    # Store predictions for ensemble learning
    ensemble_predictions[test_idx] = y_pred_proba

    # Evaluate the model
    accuracy = accuracy_score(y_test.cpu(), y_pred.cpu())
    fold_accuracies.append(accuracy)
    print(f"Accuracy for fold {fold + 1}: {accuracy:.4f}")
    print(
        "\nClassification Report:\n", classification_report(y_test.cpu(), y_pred.cpu())
    )

# Calculate and print the average accuracy across all folds
average_accuracy = sum(fold_accuracies) / len(fold_accuracies)
print(f"\nAverage Accuracy across {k} folds: {average_accuracy:.4f}")

# Save the ensemble predictions for later use
with open("features/svm_ensemble_predictions.pkl", "wb") as file:
    pickle.dump(ensemble_predictions.cpu().numpy(), file)


## Logistic regression

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, classification_report
from torch.utils.data import DataLoader, TensorDataset

import torch.nn as nn
import torch.optim as optim

# Load DistilBERT features and labels
with open("features/distilbert_vectorizer.pkl", "rb") as file:
    distilbert_features = pickle.load(file)

labels = final_train_df["label"].to_numpy()

# Convert data to PyTorch tensors
distilbert_features = torch.tensor(distilbert_features, dtype=torch.float32)
labels = torch.tensor(labels, dtype=torch.long)

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


# Define the Logistic Regression model
class LogisticRegressionModel(nn.Module):
    def __init__(self, input_dim):
        super(LogisticRegressionModel, self).__init__()
        self.fc = nn.Linear(input_dim, 2)  # Binary classification (2 classes)

    def forward(self, x):
        return self.fc(x)


# Initialize k-fold cross-validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize variables to store results
fold_accuracies = []
ensemble_predictions = torch.zeros((len(labels), 2), dtype=torch.float32)

# Perform k-fold cross-validation
for fold, (train_idx, test_idx) in enumerate(kf.split(distilbert_features.numpy())):
    print(f"Fold {fold + 1}/{k}")

    # Split the data into training and testing sets for this fold
    X_train, X_test = distilbert_features[train_idx], distilbert_features[test_idx]
    y_train, y_test = labels[train_idx], labels[test_idx]

    # Create DataLoaders for batch processing
    train_dataset = TensorDataset(X_train, y_train)
    test_dataset = TensorDataset(X_test, y_test)
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

    # Initialize the model, loss function, and optimizer
    model = LogisticRegressionModel(input_dim=X_train.shape[1]).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Train the model
    num_epochs = 10
    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0
        for batch_X, batch_y in train_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {epoch_loss:.4f}")

        # Save the model after each epoch
        torch.save(
            model.state_dict(),
            f"models/logistic_regression_fold{fold + 1}_epoch{epoch + 1}.pth",
        )

    # Evaluate the model
    model.eval()
    all_outputs = []
    all_y_test = []
    with torch.no_grad():
        for batch_X, batch_y in test_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)
            outputs = model(batch_X)
            all_outputs.append(outputs)
            all_y_test.append(batch_y)

    # Concatenate outputs and labels
    outputs = torch.cat(all_outputs, dim=0)
    y_test = torch.cat(all_y_test, dim=0)
    _, y_pred = torch.max(outputs, 1)
    y_pred_proba = torch.softmax(outputs, dim=1)

    # Store predictions for ensemble learning
    ensemble_predictions[test_idx] = y_pred_proba

    # Evaluate the model
    accuracy = accuracy_score(y_test.cpu(), y_pred.cpu())
    fold_accuracies.append(accuracy)
    print(f"Accuracy for fold {fold + 1}: {accuracy:.4f}")
    print(
        "\nClassification Report:\n", classification_report(y_test.cpu(), y_pred.cpu())
    )

# Calculate and print the average accuracy across all folds
average_accuracy = sum(fold_accuracies) / len(fold_accuracies)
print(f"\nAverage Accuracy across {k} folds: {average_accuracy:.4f}")

# Save the ensemble predictions for later use
with open("features/logistic_regression_ensemble_predictions.pkl", "wb") as file:
    pickle.dump(ensemble_predictions.cpu().numpy(), file)


## Smallbert

In [None]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, classification_report
from torch.utils.data import DataLoader, TensorDataset

# Initialize k-fold cross-validation
k = 5
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize variables to store results
fold_accuracies = []
ensemble_predictions = torch.zeros((len(labels), 2), dtype=torch.float32)

# Perform k-fold cross-validation
for fold, (train_idx, test_idx) in enumerate(kf.split(distilbert_features.numpy())):
    print(f"Fold {fold + 1}/{k}")

    # Split the data into training and testing sets for this fold
    X_train, X_test = distilbert_features[train_idx], distilbert_features[test_idx]
    y_train, y_test = labels[train_idx], labels[test_idx]

    # Create DataLoaders for batch processing
    train_dataset = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
    test_dataset = TensorDataset(torch.tensor(X_test), torch.tensor(y_test))
    train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

    # Initialize the tokenizer
    tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

    # Train the model
    num_epochs = 3
    gradient_accumulation_steps = 2
    model.train()
    for epoch in range(num_epochs):
        epoch_loss = 0
        optimizer.zero_grad()
        for step, (batch_X, batch_y) in enumerate(train_loader):
            # Tokenize the input and move to the GPU
            batch_X = tokenizer(
                batch_X.tolist(), padding=True, truncation=True, return_tensors="pt"
            )
            input_ids = batch_X["input_ids"].to(device)
            attention_mask = batch_X["attention_mask"].to(device)
            batch_y = batch_y.to(device)

            # Forward pass
            outputs = model(
                input_ids=input_ids, attention_mask=attention_mask, labels=batch_y
            )
            loss = outputs.loss / gradient_accumulation_steps
            loss.backward()
            epoch_loss += loss.item()

            if (step + 1) % gradient_accumulation_steps == 0:
                optimizer.step()
                optimizer.zero_grad()

        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {epoch_loss:.4f}")

    # Evaluate the model
    model.eval()
    all_outputs = []
    all_y_test = []
    with torch.no_grad():
        for batch_X, batch_y in test_loader:
            # Tokenize the input and move to the GPU
            batch_X = tokenizer(
                batch_X.tolist(), padding=True, truncation=True, return_tensors="pt"
            )
            input_ids = batch_X["input_ids"].to(device)
            attention_mask = batch_X["attention_mask"].to(device)
            batch_y = batch_y.to(device)

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            all_outputs.append(outputs.logits)
            all_y_test.append(batch_y)

    # Concatenate outputs and labels
    outputs = torch.cat(all_outputs, dim=0)
    y_test = torch.cat(all_y_test, dim=0)
    _, y_pred = torch.max(outputs, 1)
    y_pred_proba = torch.softmax(outputs, dim=1)

    # Store predictions for ensemble learning
    ensemble_predictions[test_idx] = y_pred_proba

    # Evaluate the model
    accuracy = accuracy_score(y_test.cpu(), y_pred.cpu())
    fold_accuracies.append(accuracy)
    print(f"Accuracy for fold {fold + 1}: {accuracy:.4f}")
    print(
        "\nClassification Report:\n", classification_report(y_test.cpu(), y_pred.cpu())
    )

# Calculate and print the average accuracy across all folds
average_accuracy = sum(fold_accuracies) / len(fold_accuracies)
print(f"\nAverage Accuracy across {k} folds: {average_accuracy:.4f}")

# Save the ensemble predictions for later use
with open("features/smallbert_ensemble_predictions.pkl", "wb") as file:
    pickle.dump(ensemble_predictions.cpu().numpy(), file)
