# Fake News Detection with Machine Learning

Mini Project - Indraja Nambiar and Royce Geo


## Dataset Variables :
    1. Text - Text contains content of the news
    2. Author - Author contains the source of news
    3. Title - Title contains headline of the news
    4. Label- Label marks the news as Fake or Real(Fake=0 and Real = 1)

## Machine Learning Models used :
    1. Logistic Regression
    2. Decision Tree
    3. Random Forest
    4. KNN
    

## Import required packages

In [1]:
import pandas as pd
import numpy as np

import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
nltk.download('punkt')
nltk.download('wordnet')
from nltk import sent_tokenize, word_tokenize

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn import metrics


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/indrajanambiar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/indrajanambiar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
data= pd.read_csv("/Users/indrajanambiar/Downloads/raifinaldataset.csv")   # Load dataset

In [3]:
data

Unnamed: 0.1,Unnamed: 0,text,author,title,label
0,0,American football is not a real sport.,misbar.com,American Football is Considered a Sport | Misbar,0
1,1,नीता अंबानी को खेल रत्न पुरस्कार दिया गया,hindi.boomlive.in,"वायरल पोस्ट का दावा, नीता अंबानी को दिया गया ख...",0
2,2,U.S. President Joe Biden’s January executive o...,snopes.com,Did Biden's Exec Order Say Schools Should Incl...,1
3,3,The Lanka Premier League was the biggest sport...,boomlive.in,Was Lanka Premier League The Biggest Sporting ...,0
4,4,Badminton player PV Sindhu announces retirement,thequint.com,PV Sindhu Didn't Announce Retirement From Spor...,0
...,...,...,...,...,...
60679,60658,UNITED NATIONS (Reuters) - Two North Korean sh...,,North Korea shipments to Syria chemical arms a...,1
60680,60659,"LONDON (Reuters) - LexisNexis, a provider of l...",,LexisNexis withdrew two products from Chinese ...,1
60681,60660,MINSK (Reuters) - In the shadow of disused Sov...,,Minsk cultural hub becomes haven from authorities,1
60682,60661,MOSCOW (Reuters) - Vatican Secretary of State ...,,Vatican upbeat on possibility of Pope Francis ...,1


In [4]:
data.head()     # Head of data

Unnamed: 0.1,Unnamed: 0,text,author,title,label
0,0,American football is not a real sport.,misbar.com,American Football is Considered a Sport | Misbar,0
1,1,नीता अंबानी को खेल रत्न पुरस्कार दिया गया,hindi.boomlive.in,"वायरल पोस्ट का दावा, नीता अंबानी को दिया गया ख...",0
2,2,U.S. President Joe Biden’s January executive o...,snopes.com,Did Biden's Exec Order Say Schools Should Incl...,1
3,3,The Lanka Premier League was the biggest sport...,boomlive.in,Was Lanka Premier League The Biggest Sporting ...,0
4,4,Badminton player PV Sindhu announces retirement,thequint.com,PV Sindhu Didn't Announce Retirement From Spor...,0


In [5]:
data.shape   # shape of dataset

(60684, 5)

In [6]:
data.describe()   # Data description

Unnamed: 0.1,Unnamed: 0,label
count,60684.0,60684.0
mean,30320.159861,0.517385
std,17518.488604,0.499702
min,0.0,0.0
25%,15148.75,0.0
50%,30320.5,1.0
75%,45491.25,1.0
max,60662.0,1.0


In [7]:
data.info()   # info of dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60684 entries, 0 to 60683
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  60684 non-null  int64 
 1   text        60683 non-null  object
 2   author      38253 non-null  object
 3   title       60683 non-null  object
 4   label       60684 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 2.3+ MB


In [8]:
data["label"].value_counts()

1    31397
0    29287
Name: label, dtype: int64


 > In the dataset there are 60684 rows and 5 columns.

 > There are 60684 labels in which 31397 are Real and 29287 are Fake news.

 > The text,author and  title are categorical variable and label is numerical variable.



In [9]:
data.isnull().sum()   # Finding number of null values in each column

Unnamed: 0        0
text              1
author        22431
title             1
label             0
dtype: int64

There are 22431 missing values in author and 1 missing values in text and title.

In [10]:
data.drop('author',axis=1,inplace =True)

# Droped column auther.

In [11]:
data.dropna(axis=0,inplace=True) # Droped the missing value

In [12]:
data.isnull().sum()

Unnamed: 0    0
text          0
title         0
label         0
dtype: int64

In [13]:
data

Unnamed: 0.1,Unnamed: 0,text,title,label
0,0,American football is not a real sport.,American Football is Considered a Sport | Misbar,0
1,1,नीता अंबानी को खेल रत्न पुरस्कार दिया गया,"वायरल पोस्ट का दावा, नीता अंबानी को दिया गया ख...",0
2,2,U.S. President Joe Biden’s January executive o...,Did Biden's Exec Order Say Schools Should Incl...,1
3,3,The Lanka Premier League was the biggest sport...,Was Lanka Premier League The Biggest Sporting ...,0
4,4,Badminton player PV Sindhu announces retirement,PV Sindhu Didn't Announce Retirement From Spor...,0
...,...,...,...,...
60679,60658,UNITED NATIONS (Reuters) - Two North Korean sh...,North Korea shipments to Syria chemical arms a...,1
60680,60659,"LONDON (Reuters) - LexisNexis, a provider of l...",LexisNexis withdrew two products from Chinese ...,1
60681,60660,MINSK (Reuters) - In the shadow of disused Sov...,Minsk cultural hub becomes haven from authorities,1
60682,60661,MOSCOW (Reuters) - Vatican Secretary of State ...,Vatican upbeat on possibility of Pope Francis ...,1


In [14]:
data.drop("Unnamed: 0",axis=1,inplace=True)  # Droped unnamed column that is Id .

In [15]:
data

Unnamed: 0,text,title,label
0,American football is not a real sport.,American Football is Considered a Sport | Misbar,0
1,नीता अंबानी को खेल रत्न पुरस्कार दिया गया,"वायरल पोस्ट का दावा, नीता अंबानी को दिया गया ख...",0
2,U.S. President Joe Biden’s January executive o...,Did Biden's Exec Order Say Schools Should Incl...,1
3,The Lanka Premier League was the biggest sport...,Was Lanka Premier League The Biggest Sporting ...,0
4,Badminton player PV Sindhu announces retirement,PV Sindhu Didn't Announce Retirement From Spor...,0
...,...,...,...
60679,UNITED NATIONS (Reuters) - Two North Korean sh...,North Korea shipments to Syria chemical arms a...,1
60680,"LONDON (Reuters) - LexisNexis, a provider of l...",LexisNexis withdrew two products from Chinese ...,1
60681,MINSK (Reuters) - In the shadow of disused Sov...,Minsk cultural hub becomes haven from authorities,1
60682,MOSCOW (Reuters) - Vatican Secretary of State ...,Vatican upbeat on possibility of Pope Francis ...,1


# Text processing

Combine title and text as all_text



In [16]:
data['all_text'] = data['title'] + data['text']

In [17]:
#  tokenization

def tokenize(column):
    """Tokenizes a Pandas dataframe column and returns a list of tokens.
    
    Args:
        column: Pandas dataframe column (i.e. df['text']).
    
    Returns:
        tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
    
    """
    
    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]    

In [18]:
data['tokenized'] = data.apply(lambda x: tokenize(x['all_text']), axis=1)
data[['title', 'tokenized']].head()

Unnamed: 0,title,tokenized
0,American Football is Considered a Sport | Misbar,"[American, Football, is, Considered, a, Sport,..."
1,"वायरल पोस्ट का दावा, नीता अंबानी को दिया गया ख...",[]
2,Did Biden's Exec Order Say Schools Should Incl...,"[Did, Biden, Exec, Order, Say, Schools, Should..."
3,Was Lanka Premier League The Biggest Sporting ...,"[Was, Lanka, Premier, League, The, Biggest, Sp..."
4,PV Sindhu Didn't Announce Retirement From Spor...,"[PV, Sindhu, Did, Announce, Retirement, From, ..."


In [19]:
# Create punctuation features

def punctuation_to_features(data, column):
    
    data[column] = data[column].replace('!', ' exclamation ')
    data[column] = data[column].replace('?', ' question ')
    data[column] = data[column].replace('\'', ' quotation ')
    data[column] = data[column].replace('\"', ' quotation ')
    
    return data[column]


In [20]:
data['all_text'] = punctuation_to_features(data, 'all_text')

In [21]:
data['all_text']

0        American Football is Considered a Sport | Misb...
1        वायरल पोस्ट का दावा, नीता अंबानी को दिया गया ख...
2        Did Biden's Exec Order Say Schools Should Incl...
3        Was Lanka Premier League The Biggest Sporting ...
4        PV Sindhu Didn't Announce Retirement From Spor...
                               ...                        
60679    North Korea shipments to Syria chemical arms a...
60680    LexisNexis withdrew two products from Chinese ...
60681    Minsk cultural hub becomes haven from authorit...
60682    Vatican upbeat on possibility of Pope Francis ...
60683    Indonesia to buy $1.14 billion worth of Russia...
Name: all_text, Length: 60682, dtype: object

In [22]:
# Remove stopwords

nltk.download('stopwords');

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/indrajanambiar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
def remove_stopwords(tokenized_column):
    
    stops = set(stopwords.words("english"))
    return [word for word in tokenized_column if not word in stops]

In [24]:
data['stopwords_removed'] = data.apply(lambda x: remove_stopwords(x['tokenized']), axis=1)
data[['title', 'stopwords_removed']].head()

Unnamed: 0,title,stopwords_removed
0,American Football is Considered a Sport | Misbar,"[American, Football, Considered, Sport, Misbar..."
1,"वायरल पोस्ट का दावा, नीता अंबानी को दिया गया ख...",[]
2,Did Biden's Exec Order Say Schools Should Incl...,"[Did, Biden, Exec, Order, Say, Schools, Should..."
3,Was Lanka Premier League The Biggest Sporting ...,"[Was, Lanka, Premier, League, The, Biggest, Sp..."
4,PV Sindhu Didn't Announce Retirement From Spor...,"[PV, Sindhu, Did, Announce, Retirement, From, ..."


In [25]:
# Apply stemming

def apply_stemming(tokenized_column):
    
    stemmer = PorterStemmer() 
    return [stemmer.stem(word).lower() for word in tokenized_column]

In [None]:
data['porter_stemmed'] = data.apply(lambda x: apply_stemming(x['stopwords_removed']), axis=1)
data[['title', 'porter_stemmed']].head()

In [None]:
# Rejoin words

def rejoin_words(tokenized_column):
    return ( " ".join(tokenized_column))

In [None]:
data['all_text'] = data.apply(lambda x: rejoin_words(x['porter_stemmed']), axis=1)
data[['title', 'all_text']].head()

In [None]:
# Dependend and independed variable
X = data['all_text']
y = data['label']

## Feature Vectorization

In [None]:
# Converting  textual data to numerical data

vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)



In [None]:
print(X)

## Train-Test Split

In [None]:
# spliting dataset 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, shuffle=True)

# Traning model

## 1 . Logistic regression



In [None]:
from sklearn.linear_model import LogisticRegression


logistic_model = LogisticRegression(random_state=1)
logistic_model.fit(X_train,y_train)

pred_test_logistic=logistic_model.predict(X_test)



In [None]:
# accuracy score on training data

score_logistic =accuracy_score(pred_test_logistic,y_test)*100
print("Accuracy score:",score_logistic)


# confusion matrix
print("Confusion matrix:\n")
confusion_matrix(y_test,pred_test_logistic )
print(confusion_matrix(y_test,pred_test_logistic))

#classification report for logistic regression model
print("\nClassification report:\n")
print(classification_report(y_test,pred_test_logistic))

## 2. Decision Tree

In [None]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier

tree_model = DecisionTreeClassifier(random_state=1)
tree_model.fit(X_train,y_train)

pred_test_tree=tree_model.predict(X_test)


In [None]:
# Model Accuracy
score_tree =accuracy_score(pred_test_tree,y_test)*100
print("Accuracy score :",score_tree)


# confusion matrix
print("Confusion matrix:\n")
confusion_matrix(y_test,pred_test_tree )
print(confusion_matrix(y_test,pred_test_tree))

#classification report for logistic regression model
print("\nClassification report:\n")
print(classification_report(y_test,pred_test_tree))

## 3. Random Forest

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

In [None]:
forest_model = RandomForestClassifier(random_state=1,max_depth=10,n_estimators=50)
forest_model.fit(X_train,y_train)

pred_test_forest=forest_model.predict(X_test)


In [None]:
# Model Accuracy
score_forest = accuracy_score(pred_test_forest,y_test)*100
print("Accuracy score: ",score_forest)


# confusion matrix
print("confusion matrix:\n")
confusion_matrix(y_test,pred_test_forest )
print(confusion_matrix(y_test, pred_test_forest))


#classification report for random forest model
print("\nclassification report:\n")
print(classification_report(y_test,pred_test_forest ))


## 4. KNN

In [None]:
# KNN
from sklearn.neighbors import KNeighborsClassifier

#  create model instance 
knn= KNeighborsClassifier(n_neighbors=22)

#  Model Fitting
knn=knn.fit(X_train, y_train)

knn_pred = knn.predict(X_test)

In [None]:
# Model Accuracy

knn_accuracy=(accuracy_score(y_test,knn_pred))*100
print("Accuracy score:",knn_accuracy)

# confusion matrix
print("Confusion matrix:\n")
confusion_matrix(y_test, knn_pred)
print(confusion_matrix(y_test,knn_pred))

#classification report for knn model
print("\nClassification report:\n")
print(classification_report(y_test,knn_pred ))

# Making a predictive System

In [None]:
X_new = X_test[0]


prediction = logistic_model.predict(X_new)
print(prediction)


if prediction[0]==0):
    print("The news is real")
else:
    print("The news is fake")

In [None]:
print(y_test[0])