# Project Approach and Description

In this project, we compare two machine learning algorithms: Naive Bayes (NB) and Logistic Regression (LR), for classifying news articles as "Fake" or "Real" ones using a fake/real news dataset.

Dataset link: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset/data

The goal is to assess the performance of both modls in terms of their accuracy and predictive capabilities.

### 1. Naïve Bayes
Naïve Bayes algorithm is a supervised machine learning algorithm that are based on Baye's Theorem to find probabilities and perform predictions. The reason why we decided to use this model was because Naïve Bayes Classifier was taught in the lectures, and we wanted to use something we were familiar with and compare it with something that is not taught in the course.

We used the MultinominalNB variant of of Naïve Bayes for this project as it works well with word frequency representations, which is what we want for text classification. This model assumes each word is a token, where each token appears with some probability in each class. It uses word frequencies to determine the class probailities, and the class with the highest probability is assigned to the news article (Real or fake).


### 2. Logistic Regression
Logistic Regression is a statistical model that is mainly used for binary classification problems, which we felt was suitable for our project. Logistic regression is based on weights, where each word in the news article is assigned a weight, and the sum of all the weights (with some additional exponential computations) will determine if the news article is real or fake (depending on the threshold, but by default, the threshold is 0.5).

The reason why we chose Logistic Regression as our second model is not just due to our curiosity of learning how the model works, but also because on average, Logistic Regression can capture word dependencies better due to handling sparse matrices (of words) well and that it performs better when there is a large amount of training data compared to Naïve Bayes (which we will see later in the model performance results).

### Reference Links (Learning about the models and their implementation)
- Scikit-learn (Naïve Bayes): https://scikit-learn.org/stable/modules/naive_bayes.html
- Geeks4Geeks (Naïve Bayes): https://www.geeksforgeeks.org/naive-bayes-classifiers/
- Scikit-learn (Logistic Regression): https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
- Geeks4Geeks (Logistic Regression): https://www.geeksforgeeks.org/understanding-logistic-regression/

In [1]:
# Installing required libraries - numpy, pandas and scikit-learn
!pip install --user pandas
!pip install --user numpy
!pip install --user scikit-learn



In [2]:
import pandas as pd
import numpy as np
from sklearn import *

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score, f1_score

import string

## Data Preparation/Cleaning

In [3]:
# Loading in our datasets in the form of DataFrames via pandas
fake_df = pd.read_csv("dataset/Fake.csv") 
real_df = pd.read_csv("dataset/Real.csv")

fake_df.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [4]:
# In order to train the model, we need to combine the 2 datasets, and label them as 0 (Fake) or 1 (Real) news.
fake_df["label"] = 0 # for Fake news
real_df["label"] = 1 # for Real news

data = pd.concat([fake_df, real_df], axis=0)

data

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
...,...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",1
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",1
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",1
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",1


In [5]:
# TODO: Ask if we want to take the title into consideration, or just the contents of the news article
# data["content"] = data["title"] + " " + data["text"]  # Uncomment if yes
# data["content"] = data["title"]

data = data[["text", "label"]] # We only need the text and label, so we drop everything else
data

Unnamed: 0,text,label
0,Donald Trump just couldn t wish all Americans ...,0
1,House Intelligence Committee Chairman Devin Nu...,0
2,"On Friday, it was revealed that former Milwauk...",0
3,"On Christmas day, Donald Trump announced that ...",0
4,Pope Francis used his annual Christmas Day mes...,0
...,...,...
21412,BRUSSELS (Reuters) - NATO allies on Tuesday we...,1
21413,"LONDON (Reuters) - LexisNexis, a provider of l...",1
21414,MINSK (Reuters) - In the shadow of disused Sov...,1
21415,MOSCOW (Reuters) - Vatican Secretary of State ...,1


In [6]:
# Now, let's clean the dataset. We only want to use the words in the text column, hence we will need to clean it:

# 1. Making all worlds lower case
data["text"] = data["text"].str.lower()

# 2. Removing all punctuations
data['text'] = data['text'].str.translate(str.maketrans('', '', string.punctuation))

# 3. Removing any unnecessary spaces using .strip()
data['text'] = data['text'].str.strip()

data["text"]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["text"] = data["text"].str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].str.translate(str.maketrans('', '', string.punctuation))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text'] = data['text'].str.strip()


0        donald trump just couldn t wish all americans ...
1        house intelligence committee chairman devin nu...
2        on friday it was revealed that former milwauke...
3        on christmas day donald trump announced that h...
4        pope francis used his annual christmas day mes...
                               ...                        
21412    brussels reuters  nato allies on tuesday welco...
21413    london reuters  lexisnexis a provider of legal...
21414    minsk reuters  in the shadow of disused soviet...
21415    moscow reuters  vatican secretary of state car...
21416    jakarta reuters  indonesia will buy 11 sukhoi ...
Name: text, Length: 44898, dtype: object

In [7]:
# Splitting the cleaned dataset into test and training datasets

# Explaination:
# X: data["text"], and Y: the labels (0: Fake, 1: Real)
# X_train: Training set of the article text. We will feed this into our model so that the model can use the data to learn text patterns.
# Y_train: Training set for target labels (0 or 1). The model will use this to learn which text corresponds to real or fake news.
# X_test: Testing set for the article text. We will use this to test our model for the accuracy, precision, recall and F1-score!
# Y_test: Testing set for the article text. We will use this to test our model to compare the model's predictions with the actual labels during the evaluation stage!

X_train, X_test, y_train, y_test = train_test_split(data["text"], data["label"], test_size=0.2, random_state=30)

# As suggested in Scikit-learn's documentation, we need to Vectorize the text data, as ML models only works with numerical data and not text data. 
# Vectorization transforms the text into a numerical format that that model can understand and work with.
# This will make the data more structured for the models to process.
vectorizer = TfidfVectorizer(stop_words="english", max_features=5000) # I set to 5000 as a default, but it can be lesser. Lesser = less compuational time!
X_train_vectorised = vectorizer.fit_transform(X_train)
X_test_vectorised = vectorizer.transform(X_test)

## Naïve Bayes and Logistic Regression Models

In [8]:
# Training the Naïve Bayes Model - via Scikit-learn
nb_model = MultinomialNB()
nb_model.fit(X_train_vectorised, y_train)

MultinomialNB()

In [9]:
# Training the Logistic Regression model - via Scikit-learn
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_vectorised, y_train)

LogisticRegression(max_iter=1000)

In [10]:
# Evaluating the models!
nb_preds = nb_model.predict(X_test_vectorised)
lr_preds = lr_model.predict(X_test_vectorised)

In [11]:
# Let's see the performance of our models

print("Naïve Bayes Accuracy:", accuracy_score(y_test, nb_preds))
print("Logistic Regression Accuracy:", accuracy_score(y_test, lr_preds))
print("")
print("Naïve Bayes Precision:", precision_score(y_test, nb_preds))
print("Logistic Regression Precision:", precision_score(y_test, lr_preds))
print("")
print("Naïve Bayes Recall:", recall_score(y_test, nb_preds))
print("Logistic Regression Recall:", recall_score(y_test, lr_preds))
print("")
print("Naïve Bayes F1 Score:", f1_score(y_test, nb_preds))
print("Logistic Regression F1 Score:", f1_score(y_test, lr_preds))
print("")
print("Naïve Bayes Report:\n", classification_report(y_test, nb_preds))
print("Logistic Regression Report:\n", classification_report(y_test, lr_preds))

Naïve Bayes Accuracy: 0.9237193763919822
Logistic Regression Accuracy: 0.9858574610244989

Naïve Bayes Precision: 0.923696682464455
Logistic Regression Precision: 0.9804741980474198

Naïve Bayes Recall: 0.9148087303449894
Logistic Regression Recall: 0.9899084721896269

Naïve Bayes F1 Score: 0.9192312227331684
Logistic Regression F1 Score: 0.9851687492701157

Naïve Bayes Report:
               precision    recall  f1-score   support

           0       0.92      0.93      0.93      4719
           1       0.92      0.91      0.92      4261

    accuracy                           0.92      8980
   macro avg       0.92      0.92      0.92      8980
weighted avg       0.92      0.92      0.92      8980

Logistic Regression Report:
               precision    recall  f1-score   support

           0       0.99      0.98      0.99      4719
           1       0.98      0.99      0.99      4261

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99 

### Remarks
After seeing the chart above, we can conclude that Logistic regression is a better option for the text classification of real and fake news! 

## Making this prediction model/function usable!

In [12]:
# Making a function to clean the dataset
def clean_text(text):
    # 1. Making all worlds lower case
    text = text.lower()

    # 2. Removing all punctuations
    text = text.translate(str.maketrans('', '', string.punctuation))

    # 3. Removing any unnecessary spaces using .strip()
    text = text.strip()
    
    return text

In [13]:
# Let's make the prediction function which takes in a text (article)
def predict_news(article):
    article_clean = clean_text(article)
    article_vectorised = vectorizer.transform([article_clean])
    nb_result = nb_model.predict(article_vectorised)[0]
    lr_result = lr_model.predict(article_vectorised)[0]
    
    if nb_result == 1:
        print("Naïve Bayes Prediction: Real")
    else:
        print("Naïve Bayes Prediction: Fake")
        
    if lr_result == 1:
        print("Logistic Regression Prediction: Real")
    else:
        print("Logistic Regression Prediction: Fake")
    
    return

In [14]:
predict_news("WASHINGTON (Reuters) - The special counsel investigation of links between Russia and President Trumpâ€™s 2016 election campaign should continue without interference in 2018, despite calls from some Trump administration allies and Republican lawmakers to shut it down, a prominent Republican senator said on Sunday. Lindsey Graham, who serves on the Senate armed forces and judiciary committees, said Department of Justice Special Counsel Robert Mueller needs to carry on with his Russia investigation without political interference. â€œThis investigation will go forward. It will be an investigation conducted without political influence,â€ Graham said on CBSâ€™s Face the Nation news program. â€œAnd we all need to let Mr. Mueller do his job. I think heâ€™s the right guy at the right time.â€  The question of how Russia may have interfered in the election, and how Trumpâ€™s campaign may have had links with or co-ordinated any such effort, has loomed over the White House since Trump took office in January. It shows no sign of receding as Trump prepares for his second year in power, despite intensified rhetoric from some Trump allies in recent weeks accusing Muellerâ€™s team of bias against the Republican president. Trump himself seemed to undercut his supporters in an interview last week with the New York Times in which he said he expected Mueller was â€œgoing to be fair.â€    Russiaâ€™s role in the election and the question of possible links to the Trump campaign are the focus of multiple inquiries in Washington. Three committees of the Senate and the House of Representatives are investigating, as well as Mueller, whose team in May took over an earlier probe launched by the U.S. Federal Bureau of Investigation (FBI). Several members of the Trump campaign and administration have been convicted or indicted in the investigation.  Trump and his allies deny any collusion with Russia during the campaign, and the Kremlin has denied meddling in the election. Graham said he still wants an examination of the FBIâ€™s use of a dossier on links between Trump and Russia that was compiled by a former British spy, Christopher Steele, which prompted Trump allies and some Republicans to question Muellerâ€™s inquiry.   On Saturday, the New York Times reported that it was not that dossier that triggered an early FBI probe, but a tip from former Trump campaign foreign policy adviser George Papadopoulos to an Australian diplomat that Russia had damaging information about former Trump rival Hillary Clinton.  â€œI want somebody to look at the way the Department of Justice used this dossier. It bothers me greatly the way they used it, and I want somebody to look at it,â€ Graham said. But he said the Russia investigation must continue. â€œAs a matter of fact, it would hurt us if we ignored it,â€ he said. ")

Naïve Bayes Prediction: Real
Logistic Regression Prediction: Real


Note: The news article above is sourced from the Real.csv dataset for testing purposes.

In [15]:
predict_news("The special counsel investigation")

Naïve Bayes Prediction: Real
Logistic Regression Prediction: Fake


Note: The short text above is a random sentence I fed into the model for testing purposes.

### Learning Point
As we have seen in the model performance chart above, we know that on average, the Logistic Regression model performs better than Naïve Bayes. Hence, in this situation where Naïve Bayes and Logistic Regression gives different prediction results, we will choose to believe the Logistic Regression model's one.