# TruthLens Modelling - Phase 2: Multi-class Classification
The aim of phase 2 is to further classify text which has already been flagged as "fake" into one of four different types of fake news. These four classes - Fabricated, Polarised, Satire and Commentary - are a reduced adaption of the Molina et al. Disinformation Taxonomy.

The dataset used is this phase is the custom dataset I created, which has already been cleaned and preprocessed (see "TruthLens Data Collection" and "TruthLens Data Cleaning" notebooks).


In [None]:
!pip install tensorflow==2.12.0

In [1]:
#required imports
import pandas as pd
import numpy as np
import spacy
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel
from textblob import TextBlob
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_predict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import classification_report, f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier
import time
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

#set a seed value for reproducability
np.random.seed(999)

In [None]:
#spaCy's small English model download
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

In [3]:
#load data
df = pd.read_csv('Data/phase2_final_clean.csv')
df = df.reset_index(drop=True)
print(df.head(3))
print("-" * 50)
print("Class distribution:")
print(df['label'].value_counts(), "\n")
print("-" * 50)
print("Dataset Information:")
print(df.info(), "\n")

                                             content  label  word_count  \
0  Perdue Announces Initiative To Even The Playin...      2         207   
1  Met Police just BLOCKED a pro-Palestine protes...      1         591   
2  Here's the moment Mark Zuckerberg gave away th...      1         515   

   sentence_count  flesch_reading_ease  \
0               8                45.19   
1              22                35.91   
2              25                50.67   

                                       content_lemma  \
0  Perdue Announces Initiative To Even The Playin...   
1  Met Police just BLOCKED a pro-Palestine protes...   
2  Here 's the moment Mark Zuckerberg give away t...   

                                content_lemma_nostop  
0  perdue announces initiative even playing field...  
1  met police blocked propalestine protest march ...  
2  moment mark zuckerberg give away game like res...  
--------------------------------------------------
Class distribution:
label
2    400

In [4]:
X = df['content_lemma']
y = df['label']

#Stratified train-test split helps maintain class balance
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

### Generate baseline
We will generate a simple baseline using Logistic Regression and TF-IDF features.

In [5]:
start_time = time.time()
#pipeline - creates TF-IDF features then creates the logistic regression model
baseline_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000)),
    ("clf", LogisticRegression(max_iter=1000, random_state=42))
])

#stratified k-fold cross-validation - this ensures each fold has a similar class distribution
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

#generate predictions
predicted_labels = cross_val_predict(baseline_pipeline, X_train, y_train, cv=skf, method="predict")

#calculate f1 score
f1_macro = f1_score(y_train, predicted_labels, average="macro")
print("Logistic Regression Macro F1 Score:", f1_macro)

#get classification report
report = classification_report(y_train, predicted_labels)
print("\nClassification Report for Logistic Regression:\n", report)

print("Run time: {:.4f} seconds".format(time.time() - start_time))

Logistic Regression Macro F1 Score: 0.858466261722639

Classification Report for Logistic Regression:
               precision    recall  f1-score   support

           0       0.74      0.86      0.79       320
           1       0.83      0.72      0.77       320
           2       0.92      0.87      0.89       320
           3       0.96      0.98      0.97       320

    accuracy                           0.86      1280
   macro avg       0.86      0.86      0.86      1280
weighted avg       0.86      0.86      0.86      1280

Run time: 16.1295 seconds


### Choose best model
Next we will test three different models to see which performs the best.

#### Multinomial Naive Bayes

In [6]:
start_time = time.time()

mnb_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000)),
    ("clf", MultinomialNB())
])

#generate predictions
predicted_labels_mnb = cross_val_predict(mnb_pipeline, X_train, y_train, cv=skf, method="predict")

#print results
f1_macro_mnb = f1_score(y_train, predicted_labels_mnb, average="macro")
print("Multinomial Naive Bayes Macro F1 Score:", f1_macro_mnb)
print("\nClassification Report for Multinomial Naive Bayes:\n", classification_report(y_train, predicted_labels_mnb))

print("Run time: {:.4f} seconds".format(time.time() - start_time))

Multinomial Naive Bayes Macro F1 Score: 0.6195257542552863

Classification Report for Multinomial Naive Bayes:
               precision    recall  f1-score   support

           0       0.47      0.88      0.62       320
           1       0.61      0.59      0.60       320
           2       0.82      0.64      0.72       320
           3       1.00      0.37      0.54       320

    accuracy                           0.62      1280
   macro avg       0.73      0.62      0.62      1280
weighted avg       0.73      0.62      0.62      1280

Run time: 9.1392 seconds


#### XGBoost

In [7]:
start_time = time.time()
xgb_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000)),
    ("clf", XGBClassifier(eval_metric='mlogloss', random_state=42))
])

#generate predictions
predicted_labels_xgb = cross_val_predict(xgb_pipeline, X_train, y_train, cv=skf, method="predict")

#print results
f1_macro_xgb = f1_score(y_train, predicted_labels_xgb, average="macro")
print("XGBoost Macro F1 Score:", f1_macro_xgb)
print("\nClassification Report for XGBoost:\n", classification_report(y_train, predicted_labels_xgb))

print("Run time: {:.4f} seconds".format(time.time() - start_time))

XGBoost Macro F1 Score: 0.8723158805931358

Classification Report for XGBoost:
               precision    recall  f1-score   support

           0       0.79      0.85      0.82       320
           1       0.82      0.75      0.78       320
           2       0.89      0.91      0.90       320
           3       0.99      0.98      0.99       320

    accuracy                           0.87      1280
   macro avg       0.87      0.87      0.87      1280
weighted avg       0.87      0.87      0.87      1280

Run time: 125.7894 seconds


#### Feed forward neural network

In [9]:
start_time = time.time()


#transformer to convert a sparse matrix (TFIDF) to a dense array because neural networks need dense arrays
class DenseTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        return X.todense()

#feed forward neural network
def create_ffnn_model(input_dim):
    model = Sequential()
    model.add(Dense(64, activation='relu', input_dim=input_dim))
    model.add(Dropout(0.5))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(4, activation='softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

nn_pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=5000)),
    ("to_dense", DenseTransformer()),
    ("clf", KerasClassifier(build_fn=lambda: create_ffnn_model(5000),
                              epochs=5, batch_size=32, verbose=0))
])

#generate predictions
predicted_labels_nn = cross_val_predict(nn_pipeline, X_train, y_train, cv=skf, method="predict")

#print results
f1_macro_nn = f1_score(y_train, predicted_labels_nn, average="macro")
print("Feedforward Neural Network Macro F1 Score:", f1_macro_nn)
print("\nClassification Report for Feedforward Neural Network:\n", classification_report(y_train, predicted_labels_nn))
print("Run time: {:.4f} seconds".format(time.time() - start_time))

  ("clf", KerasClassifier(build_fn=lambda: create_ffnn_model(5000),


Feedforward Neural Network Macro F1 Score: 0.8497909187063192

Classification Report for Feedforward Neural Network:
               precision    recall  f1-score   support

           0       0.71      0.87      0.78       320
           1       0.83      0.70      0.76       320
           2       0.90      0.85      0.88       320
           3       0.99      0.98      0.99       320

    accuracy                           0.85      1280
   macro avg       0.86      0.85      0.85      1280
weighted avg       0.86      0.85      0.85      1280

Run time: 16.0026 seconds


#### Conclusion
The baseline macro F1 score of 0.86 with Logistic Regression set the bar very high. Class 3 is the strongest class, while classes 0 and 1 are a bit behind with F1 scores of 0.79 and 0.77 respectively, but still well above our success metric of 0.6 for each class.

The worst performing model by far was Multinomial Naive Bayes with a macro F1 score of 0.62. The inconsistency between precision and recall for different classes - class 0 has high recall but low precision while class 3 is the opposite - suggests that the naive assumption that each feature is independent of other features doesn't hold well here.

The feedforward neural network is competitive with the others with a macro F1 of 0.85, but given that it shows no real advantage, and the dataset is relatively small with only 1,600 lines, the added complexity of a neural network isn't justified.

XGBoost had the best macro score of 0.87 with a pretty balanced performance in each class. However, the runtime is significantly higher than the other models tested, coming in over 100 seconds while the others were all below 23 seconds.

For the next phase I will take both Logistic Regression and XGBoost and experiment with some feature engineering to see if I can get improvements in the weaker classes (0 and 1).