# Overview
The goal of this project is to perform classification of written media as misinformation or not.


## Fetch Data
The [notebook](./index.ipynb) uses a [script](./download_datasets.sh) to automatically fetch the data for the project. It can be run manually to inspect the data beforehand by executing the following command:
```bash
chmod +x download_datasets.sh
sh download_datasets.sh
```

In [60]:
# fetch data using ./download_datasets.sh
!bash ./download_datasets.sh

Checking and downloading datasets...
Downloading Kaggle datasets...
✔ fake-news-classification.zip already exists. Skipping download.
✔ fake-and-real-news-dataset.zip already exists. Skipping download.
Extracting Kaggle datasets...
Downloading additional datasets...
✔ liar_dataset.zip already exists. Skipping download.
✔ All datasets are ready!


## setup for NLP tools
In order to make full use of our NLP tooling we will install:
- `punkt` for tokenization 
- `stopwords` for removing common words
- `wordnet` for lemmatization
- `averaged_perceptron_tagger` for part of speech tagging
- `maxent_ne_chunker` for named entity recognition
- `words` for named entity recognition
- `spacy` for named entity recognition
- `en_core_web_sm` for named entity recognition
```bash

In [None]:
!python -m nltk.downloader punkt stopwords wordnet averaged_perceptron_tagger
!python -m spacy download en_core_web_sm

[nltk_data] Downloading package punkt to /Users/rob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/rob/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0mMB/s[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


# Project

In [None]:
# project imports
import os
import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir)))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import spacy

import re

import nltk
from nltk.corpus import stopwords

from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer


# Data Preprocessing
The data is loaded, inspected, and assembled into a singular dataframe. The data is then preprocessed by removing stopwords, punctuation, and lemmatizing the text. The data is then split into training and testing sets.

## Loading Data
The data is sourced from three different datasets:
- [Fake & Real News]()
- [Fake News Classification]()
- [Liar Dataset]()

The data is loaded, inspected, and assembled into a singular dataframe.

### Define Data Paths

In [63]:
# define datasets paths
datasets = {
    "fake_and_real_news": {
        "fake": "datasets/fake-and-real-news/Fake.csv",
        "real": "datasets/fake-and-real-news/True.csv"
    },
    "fake_news_classification": {
        "train": "datasets/fake-news-classification/train (2).csv",
        "test": "datasets/fake-news-classification/test (1).csv",
        "evaluation": "datasets/fake-news-classification/evaluation.csv"
    },
    "liar_data": {
        "train": "datasets/liar_data/train.tsv",
        "test": "datasets/liar_data/test.tsv",
        "valid": "datasets/liar_data/valid.tsv"
    }
}



### Loading the Fake & Real News Dataset

In [64]:

# Load Fake & Real News Dataset
df_fake = pd.read_csv(datasets["fake_and_real_news"]["fake"])
df_real = pd.read_csv(datasets["fake_and_real_news"]["real"])

# Assign labels
df_fake["label"] = "fake"
df_real["label"] = "real"

# Standardize column names
df_fake.rename(columns={"title": "headline", "text": "content"}, inplace=True)
df_real.rename(columns={"title": "headline", "text": "content"}, inplace=True)

# Merge Fake & Real News
df_news = pd.concat([df_fake, df_real], ignore_index=True)

# print columns
print(df_news.columns)

Index(['headline', 'content', 'subject', 'date', 'label'], dtype='object')


In [65]:
# drop columns
df_news.drop(columns=["subject", "date", "headline"], inplace=True)

# rename content to text
df_news.rename(columns={"content": "text"}, inplace=True)

In [66]:
# print dataset info
print(df_news.head())
print(df_news["label"].value_counts())

                                                text label
0  Donald Trump just couldn t wish all Americans ...  fake
1  House Intelligence Committee Chairman Devin Nu...  fake
2  On Friday, it was revealed that former Milwauk...  fake
3  On Christmas day, Donald Trump announced that ...  fake
4  Pope Francis used his annual Christmas Day mes...  fake
label
fake    23481
real    21417
Name: count, dtype: int64


### Loading the Fake News Classification Dataset

In [67]:
# Load Fake News Classification Dataset with explicit delimiter
df_train = pd.read_csv(datasets["fake_news_classification"]["train"], delimiter=';')
df_test = pd.read_csv(datasets["fake_news_classification"]["test"], delimiter=';')
df_evaluation = pd.read_csv(datasets["fake_news_classification"]["evaluation"], delimiter=';')

# Merge train, test, and evaluation datasets
df_fake_news_class = pd.concat([df_train, df_test, df_evaluation], ignore_index=True)

# print columns
print(df_fake_news_class.columns)

Index(['Unnamed: 0', 'title', 'text', 'label'], dtype='object')


In [68]:
# drop columns
df_fake_news_class.drop(columns=['Unnamed: 0', 'title'], inplace=True)

In [69]:
# print dataset info
print(df_fake_news_class.head())
print(df_fake_news_class["label"].value_counts())
print(df_fake_news_class.columns)

                                                text  label
0  RAMALLAH, West Bank (Reuters) - Palestinians s...      1
1  BEIJING (Reuters) - U.S. President-elect Donal...      1
2  While the controversy over Trump s personal ta...      0
3  BEIJING (Reuters) - A trip to Beijing last wee...      1
4  There has never been a more UNCOURAGEOUS perso...      0
label
1    21924
0    18663
Name: count, dtype: int64
Index(['text', 'label'], dtype='object')


### Loading the Liar Dataset

In [70]:
# Define correct column names (LIAR dataset has 14 columns)
columns = ["id", "label", "statement", "subject", "speaker", "job", "state", "party",
           "venue", "barely-true", "false", "half-true", "mostly-true", "pants-fire", "context"]

# Load datasets with correct delimiter and column assignment
df_liar_train = pd.read_csv(datasets["liar_data"]["train"], delimiter='\t', names=columns, header=None)
df_liar_test = pd.read_csv(datasets["liar_data"]["test"], delimiter='\t', names=columns, header=None)
df_liar_valid = pd.read_csv(datasets["liar_data"]["valid"], delimiter='\t', names=columns, header=None)

# Combine datasets
df_liar = pd.concat([df_liar_train, df_liar_test, df_liar_valid], ignore_index=True)

# Map multi-class labels to binary labels
label_mapping = {
    "pants-fire": "fake",
    "false": "fake",
    "barely-true": "fake",
    "half-true": "real",
    "mostly-true": "real",
    "true": "real"
}
df_liar["label"] = df_liar["label"].map(label_mapping)

# Rename "statement" → "content" to match other datasets
df_liar.rename(columns={"statement": "content"}, inplace=True)

# Drop unnecessary columns
columns_to_drop = ["id", "speaker", "job", "state", "subject", "party", "venue", 
                   "barely-true", "false", "half-true", "mostly-true", "pants-fire", "context"]
df_liar.drop(columns=columns_to_drop, inplace=True)

# standardize column names
df_liar.rename(columns={"content": "text"}, inplace=True)

# Print dataset info
print(df_liar.head())
print(df_liar["label"].value_counts())

  label                                               text
0  fake  Says the Annies List political group supports ...
1  real  When did the decline of coal start? It started...
2  real  Hillary Clinton agrees with John McCain "by vo...
3  fake  Health care reform legislation is likely to ma...
4  real  The economic turnaround started at the end of ...
label
real    7134
fake    5657
Name: count, dtype: int64


In [71]:
print(df_liar.columns)

Index(['label', 'text'], dtype='object')


### Assemble Data
Here we assemble the data into a singular dataframe. This will involve renaming columns, dropping unnecessary columns, and adding a label column. Importantly we will set our target variable to be binary, with 1 representing misinformation and 0 representing accurate information.

In [None]:
# standardize labels in datasets -- convert label to 1 (`fake`) and 0 (`real`)
df_liar["label"] = df_liar["label"].map({"fake": 1, "real": 0})
df_fake_news_class["label"] = df_fake_news_class["label"].map({"fake": 1, "real": 0})
df_news["label"] = df_news["label"].map({"fake": 1, "real": 0})

# merge datasets
df = pd.concat([df_news, df_fake_news_class, df_liar], ignore_index=True)

# check standardized label distribution
print(df["label"].value_counts())

label
0    50475
1    47801
Name: count, dtype: int64


# Data Preprocessing
Our preprocessing steps will involve
- removing stopwords
- removing punctuation
- lemmatization (converting words to their base form)
- removing special characters, numbers, and extra spaces
- converting text to lowercase

## Removing Special Characters (numbers, punctuation, etc.)

In [None]:
def clean_text(text):
    '''
    function to format and clean text by lowercasing text, removing URLs, numbers, and punctuation. 
    '''
    text = text.lower() 
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)  # Remove URLs
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

# apply text cleaning function to text column
df["text"] = df["text"].apply(clean_text)

## Remove Stopwords
Stopwords are common words that do not add much meaning (i.e. articles, prepositions, etc.) to a sentence and can safely be removed.

In [None]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return " ".join([word for word in text.split() if word not in stop_words])

# apply remove_stopwords function to text column
df["text"] = df["text"].apply(remove_stopwords)

## Lemmatization
This process reduces words to their base form (e.g., "running" → "run") which can help reduce the complexity of the data and improve the performance of our models.

In [None]:
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

# apply lemmatize_text function to text column
df["text"] = df["text"].apply(lemmatize_text)

## Shuffle Data
We shuffle the data to ensure that the model does not learn the order of the data.

In [None]:
# Shuffle the data
df_final = df.sample(frac=1, random_state=42).reset_index(drop=True)

## Train-Test Split
We split the data into training and testing sets. The training set will be used to train the model, while the testing set will be used to evaluate its performance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_final["text"], df_final["label"], test_size=0.2, random_state=42)

# Feature Extraction
After preprocessing the data, we need to convert the text data into a format that can be used by machine learning algorithms. We will use the TF-IDF vectorizer to convert the text data into numerical features. This will serve as our baseline model. Later, we will explore more advanced feature extraction techniques with word embeddings.

## TF-IDF Vectorization
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. We will use the TF-IDF vectorizer to convert the text data into numerical features.

In [None]:
# Convert text into TF-IDF features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Limit to top 5000 words
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)