# **Project on Fake and Real News Detection using Naive Bayes**

## Import Libraries
In this step, we import all the essential libraries required for data manipulation, text vectorization, machine learning modeling, and evaluation. These libraries are part of the Scikit-learn and Pandas ecosystem.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

## Load dataset
We load our dataset using pandas.read_csv(). Make sure the path points to your dataset file. This dataset should contain at least a column for text data and a column with the corresponding labels (e.g., 'FAKE' or 'REAL').

In [3]:
df = pd.read_csv("fake_or_real_news_data.csv")
df

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [4]:
df.label.value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

In [5]:
df.shape

(6335, 4)

In [6]:
df

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6335 entries, 0 to 6334
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      6335 non-null   int64 
 1   title   6335 non-null   object
 2   text    6335 non-null   object
 3   label   6335 non-null   object
dtypes: int64(1), object(3)
memory usage: 198.1+ KB


## Preprocess the Data
We check for and handle any missing values. Then we define the features (x) and the target labels (y). If the labels are in a non-numeric format, we encode them using LabelEncoder for compatibility with Scikit-learn models.

In [8]:
# Check for missing values
print(df.isnull().sum())

# Drop or fill missing values if needed
df.dropna(inplace=True)

# Define features (X) and target (y)
x = df['text']  # Replace with the name of your text column
y = df['label']  # Replace with the name of your label column

# Encode the labels if they are not numerical
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

id       0
title    0
text     0
label    0
dtype: int64


## Convert Text Data to Numeric Form
We use the `CountVectorizer` from Scikit-learn to convert our text data into a bag-of-words representation. This transformation allows our Naive Bayes model to process the data in a numerical format.

In [9]:
# Convert text data into a bag-of-words model
vectorizer = CountVectorizer()
x_vectorized = vectorizer.fit_transform(x)

## Split the Data
We split our dataset into training and testing subsets using `train_test_split`. This helps us train our model on one part of the data and test it on a separate part to evaluate its performance.

In [10]:
# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x_vectorized, y, test_size=0.2, random_state=42)

In [11]:
x_train.shape

(5068, 67659)

In [12]:
x_test.shape

(1267, 67659)

In [13]:
x_test

<1267x67659 sparse matrix of type '<class 'numpy.int64'>'
	with 444948 stored elements in Compressed Sparse Row format>

In [14]:
x_train

<5068x67659 sparse matrix of type '<class 'numpy.int64'>'
	with 1713334 stored elements in Compressed Sparse Row format>

In [15]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

## Train the Naive Bayes Model
We initialize and train a `MultinomialNB` (Multinomial Naive Bayes) model using the training data. This model is suitable for classification with discrete features like word counts.

In [16]:
# Initialize and train the model
model = MultinomialNB()
model.fit(x_train, y_train)

# Output model details
print("Model trained successfully!")

Model trained successfully!


## Evaluate the Naive Bayes Model
After training, we use the model to make predictions on the test set. We then evaluate the model’s accuracy and generate a detailed classification report that includes precision, recall, and F1-score.

In [17]:
# Make predictions on the test set
y_pred = model.predict(x_test)

# Evaluate the predictions
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Display a detailed classification report
print(classification_report(y_test, y_pred))

Accuracy: 0.90
              precision    recall  f1-score   support

           0       0.92      0.87      0.89       628
           1       0.88      0.93      0.90       639

    accuracy                           0.90      1267
   macro avg       0.90      0.90      0.90      1267
weighted avg       0.90      0.90      0.90      1267



# Logistic Regression Model
Logistic Regression is a linear model commonly used for binary classification. It estimates the probability that a given input belongs to a particular class.

In [18]:
from sklearn.linear_model import LogisticRegression

# Initialize and train Logistic Regression
log_model = LogisticRegression(max_iter=1000)
log_model.fit(x_train, y_train)

# Predict and evaluate
y_pred_log = log_model.predict(x_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log))
print(classification_report(y_test, y_pred_log))

Logistic Regression Accuracy: 0.9171270718232044
              precision    recall  f1-score   support

           0       0.91      0.92      0.92       628
           1       0.92      0.92      0.92       639

    accuracy                           0.92      1267
   macro avg       0.92      0.92      0.92      1267
weighted avg       0.92      0.92      0.92      1267



# Support Vector Classifier (Linear Kernel)
Support Vector Machines (SVM) with a linear kernel find a hyperplane to classify data points. It's effective for high-dimensional spaces like text data.

In [20]:
from sklearn.svm import SVC

# Initialize and train Linear SVM
svc_model = SVC(kernel='linear')
svc_model.fit(x_train, y_train)

# Predict and evaluate
y_pred_svc = svc_model.predict(x_test)
print("Linear SVC Accuracy:", accuracy_score(y_test, y_pred_svc))
print(classification_report(y_test, y_pred_svc))

Linear SVC Accuracy: 0.8863456985003947
              precision    recall  f1-score   support

           0       0.88      0.89      0.89       628
           1       0.89      0.89      0.89       639

    accuracy                           0.89      1267
   macro avg       0.89      0.89      0.89      1267
weighted avg       0.89      0.89      0.89      1267



# Decision Tree Classifier
A Decision Tree splits the data into branches to make predictions. It's a non-linear model and interpretable, but can overfit on training data.

In [21]:
from sklearn.tree import DecisionTreeClassifier

# Initialize and train Decision Tree
tree_model = DecisionTreeClassifier()
tree_model.fit(x_train, y_train)

# Predict and evaluate
y_pred_tree = tree_model.predict(x_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_tree))
print(classification_report(y_test, y_pred_tree))

Decision Tree Accuracy: 0.7924230465666929
              precision    recall  f1-score   support

           0       0.79      0.80      0.79       628
           1       0.80      0.79      0.79       639

    accuracy                           0.79      1267
   macro avg       0.79      0.79      0.79      1267
weighted avg       0.79      0.79      0.79      1267



# Passive Aggressive Classifier
The Passive Aggressive algorithm is useful for large-scale learning and online learning tasks. It remains passive for correct classifications and aggressive for misclassified examples.

In [22]:
from sklearn.linear_model import PassiveAggressiveClassifier

# Initialize and train Passive Aggressive Classifier
pa_model = PassiveAggressiveClassifier(max_iter=1000)
pa_model.fit(x_train, y_train)

# Predict and evaluate
y_pred_pa = pa_model.predict(x_test)
print("Passive Aggressive Classifier Accuracy:", accuracy_score(y_test, y_pred_pa))
print(classification_report(y_test, y_pred_pa))

Passive Aggressive Classifier Accuracy: 0.9037095501183899
              precision    recall  f1-score   support

           0       0.91      0.89      0.90       628
           1       0.90      0.91      0.91       639

    accuracy                           0.90      1267
   macro avg       0.90      0.90      0.90      1267
weighted avg       0.90      0.90      0.90      1267

