# TruthLens Modelling - Phase 1 : Binary Classification
The aim of phase 1 is to classify text into real or fake. "Fake" content will move to phase 2 for multiclass classification, so essentially what we are trying to do with this stage is to filter out real news. 

The dataset used is the Misinformation & Fake News text dataset, which has already been cleaned and preprocessed (see "TruthLens Data Cleaning" notebook).

In [35]:
#imports
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy.sparse import hstack, csr_matrix
import time
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import csv
import random
import pickle
pd.set_option('display.max_colwidth', None)
from lime.lime_text import LimeTextExplainer

### Feature Extraction Using TF-IDF, n-grams and readability metrics

In [36]:
#load data
df = pd.read_csv('Data/phase1_final_clean.csv')
df = df.reset_index(drop=True)
print(df.head(3))

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [37]:
#TF-IDF feature extraction with n-grams
start_time = time.time()
#replace NaN values with an empty string to resolve NaN ValueError
df['content_lemma_nostop'] = df['content_lemma_nostop'].fillna('')
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))
X_tfidf = vectorizer.fit_transform(df['content_lemma_nostop'])
#get pre-calculated readability features
readability_features = df[['word_count', 'sentence_count', 'flesch_reading_ease']].values
#standardise readability features
scaler = StandardScaler()
readability_scaled = scaler.fit_transform(readability_features)
#convert to a sparse matrix
readability_sparse = csr_matrix(readability_scaled)
#combine TF-IDF features with the readability metrics
X_combined = hstack([X_tfidf, readability_sparse])

y = df['label']
print("Feature extraction: {:.4f} seconds".format(time.time() - start_time))

Feature extraction: 471.0503 seconds


### Split dataset

In [38]:
#retain the indices as we need these for looking up explanations later
train_indices, test_indices = train_test_split(df.index, test_size=0.2, random_state=999)
#split X and y using the train/test indices
X_train = X_combined[train_indices]
X_test = X_combined[test_indices]
y_train = y.iloc[train_indices]
y_test = y.iloc[test_indices]

### Phase 1 Modelling
I will test three different models on the data to see which is best. The models to be tested are:
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)

#### Logistic Regression

In [39]:
start_time = time.time()
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Fit Logistic Regression model: {:.4f} seconds".format(time.time() - start_time))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Fit Logistic Regression model: 23.1854 seconds
Accuracy: 0.9419773508079908
Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.93      0.93      6931
           1       0.94      0.95      0.95      8787

    accuracy                           0.94     15718
   macro avg       0.94      0.94      0.94     15718
weighted avg       0.94      0.94      0.94     15718



#### Random Forest

In [9]:
start_time = time.time()
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("Fit Random Forest model: {:.4f} seconds".format(time.time() - start_time))
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, rf_pred))

Fit Random Forest model: 618.2777 seconds
Random Forest Accuracy: 0.9505026084743606
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.94      0.94      6931
           1       0.95      0.96      0.96      8787

    accuracy                           0.95     15718
   macro avg       0.95      0.95      0.95     15718
weighted avg       0.95      0.95      0.95     15718



#### Support Vector Machine (SVM)

In [34]:
start_time = time.time()
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("Fit SVM model: {:.4f} seconds".format(time.time() - start_time))
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("SVM Classification Report:\n", classification_report(y_test, svm_pred))

Fit SVM model: 4321.4944 seconds
SVM Accuracy: 0.9504389871484922
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.93      0.94      6931
           1       0.95      0.96      0.96      8787

    accuracy                           0.95     15718
   macro avg       0.95      0.95      0.95     15718
weighted avg       0.95      0.95      0.95     15718



#### Conclusion
For phase 1 to be successful according to our success metrics, we want an accuracy of at least 90%. All three algorithms tested have performed better than that. 

Random Forest and SVM both had an accuracy of 0.95 which is impressive. Logistic regression was slightly lower with an accuracy of 0.94. However, when computational efficiency is taken into account, Logistic regression is the preferred model. It took just over 23 seconds to fit the Logistic regression model to our data, compared to 618 seconds (10 minutes) for random forest, and a whooping 4321 seconds (72 minutes) for SVM. 

Next we will use some tuning to try and increase the accuracy of our logistic regression model.

### Tweak model

In [40]:
start_time = time.time()
param_grid = {'C': [0.1, 1, 10, 100]}
grid = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best parameters:", grid.best_params_)
print("Grid Search: {:.4f} seconds".format(time.time() - start_time))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best parameters: {'C': 10}
Grid Search: 660.9683 seconds


In [None]:
start_time = time.time()
#we're going to test for the best combo of regularisation type (l1 or l2), C values, and solvers
param_grid = [
    {
        #l1 regularisation
        'penalty': ['l1'],
        'C': [0.1, 1, 10],
        'solver': ['liblinear', 'saga'] 
    },
    {
        #l2 regularization
        'penalty': ['l2'],
        'C': [0.1, 1, 10],
        'solver': ['newton-cg', 'lbfgs', 'sag']
    }
]

#set up the grid search
grid = GridSearchCV(LogisticRegression(max_iter=2000), param_grid, cv=5)
grid.fit(X_train, y_train)

#print the best parameters and best score, and the time it took
print("Best parameters:", grid.best_params_)
print("Best cross-validation score:", grid.best_score_)
print("Grid Search: {:.4f} seconds".format(time.time() - start_time))



#### Tuning Results
As we can see from the above results XXXX

### Create and save final Phase 1 model

In [11]:
#instantiate the final model with the best hyperparameters as found by the grid search
final_model = LogisticRegression(C=10, max_iter=1000)

#fit the model
final_model.fit(X_train, y_train)

#evaluate on the test set
final_predictions = final_model.predict(X_test)
print("Final Model Accuracy:", accuracy_score(y_test, final_predictions))
print("Final Model Classification Report:\n", classification_report(y_test, final_predictions))

Final Model Accuracy: 0.9487212113500445
Final Model Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.94      0.94      6931
           1       0.95      0.96      0.95      8787

    accuracy                           0.95     15718
   macro avg       0.95      0.95      0.95     15718
weighted avg       0.95      0.95      0.95     15718



In [None]:
#save our model
with open('final_model.pkl', 'wb') as file:
    pickle.dump(final_model, file)

### Validate model
Extra data from https://www.kaggle.com/datasets/stevenpeutz/misinformation-fake-news-text-dataset-79k

In [33]:
#load in validation dataset
external_df = pd.read_csv('Data/extra_data_final_clean.csv')
external_df = external_df.dropna(subset=['content_lemma_nostop']).reset_index(drop=True)
external_text = external_df['content_lemma_nostop']
external_labels = external_df['label']

# Transform the external data using the already fitted vectorizer (or pipeline)
X_external = vectorizer.transform(external_text)

# Predict using the trained model
external_predictions_lr = final_model.predict(X_external)

# Evaluate the performance on the external dataset
print("External Dataset Classification Report:\n", classification_report(external_labels, external_predictions_lr))


ValueError: X has 5000 features, but LogisticRegression is expecting 5003 features as input.