<a href="https://colab.research.google.com/github/naman1gupta/ML-projects/blob/main/Fake_News_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





Importing the Dependencies

In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [3]:
# printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing

In [4]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/train.csv',quoting=3, error_bad_lines=False)



  news_dataset = pd.read_csv('/content/train.csv',quoting=3, error_bad_lines=False)
Skipping line 13: expected 7 fields, saw 41
Skipping line 26: expected 7 fields, saw 9
Skipping line 30: expected 7 fields, saw 14
Skipping line 34: expected 7 fields, saw 10
Skipping line 55: expected 7 fields, saw 16
Skipping line 63: expected 7 fields, saw 59
Skipping line 64: expected 7 fields, saw 51
Skipping line 65: expected 7 fields, saw 115
Skipping line 66: expected 7 fields, saw 73
Skipping line 67: expected 7 fields, saw 27
Skipping line 68: expected 7 fields, saw 9
Skipping line 70: expected 7 fields, saw 12
Skipping line 71: expected 7 fields, saw 9
Skipping line 80: expected 7 fields, saw 11
Skipping line 81: expected 7 fields, saw 12
Skipping line 83: expected 7 fields, saw 13
Skipping line 127: expected 7 fields, saw 10
Skipping line 170: expected 7 fields, saw 15
Skipping line 171: expected 7 fields, saw 13
Skipping line 179: expected 7 fields, saw 9
Skipping line 188: expected 7 fie

In [5]:
news_dataset.shape

(5402, 5)

In [6]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0,Unnamed: 1,id,title,author,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It,Darrell Lucus,"""House Dem Aide: We Didn’t Even See Comey’s Le...",2016 Subscribe Jason Chaffetz on the stump in...,Utah ( image courtesy Michael Jolley,available under a Creative Commons-BY license)
With apologies to Keith Olbermann,there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide,it looks like we also know who the second-wor...,the ranking Democrats on the relevant committ...,,,
As we now know,Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence,Judiciary,and Oversight committees that his agency was ...,Oversight Committee Chairman Jason Chaffetz s...,"""""The FBI has learned of the existence of ema...",
— Jason Chaffetz (@jasoninthehouse) October 28,2016,,,,,
Of course,we now know that this was not the case . Comey was actually saying that it was reviewing the emails in light of “an unrelated case”–which we now know to be Anthony Weiner’s sexting with a teenager. But apparently such little things as facts didn’t matter to Chaffetz. The Utah Republican had already vowed to initiate a raft of investigations if Hillary wins–at least two years’ worth,and possibly an entire term’s worth of them. ...,,,,


In [7]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

id        2704
title     3538
author    4297
text      4819
label     5190
dtype: int64

In [8]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [9]:
# merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [10]:
print(news_dataset['content'])

0                                                                                                                                                  House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It                                                                                                                                                                                                                                                                                                                     2016 Subscribe Jason Chaffetz on the stump in...
With apologies to Keith Olbermann                                                                                                                   there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide                                                                                                                                       

In [11]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [12]:
print(X)
print(Y)

                                                                                                                                                      id  \
0                                                  House Dem Aide: We Didn’t Even See Comey’s Lett...                                      Darrell Lucus   
With apologies to Keith Olbermann                   there is no doubt who the Worst Person in The ...   it looks like we also know who the second-wor...   
As we now know                                      Comey notified the Republican chairmen and Dem...                                          Judiciary   
— Jason Chaffetz (@jasoninthehouse) October 28      2016                                                                                                   
Of course                                           we now know that this was not the case . Comey...   and possibly an entire term’s worth of them. ...   
...                                                             

Stemming:

Stemming is the process of reducing a word to its Root word

example:
actor, actress, acting --> act

In [13]:
port_stem = PorterStemmer()

In [14]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [15]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [16]:
print(news_dataset['content'])

0                                                                                                                                                  House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It                                                                                                                                                                                                                                                                                                                    subscrib jason chaffetz stump american fork ho...
With apologies to Keith Olbermann                                                                                                                   there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide                                                                                                                                       

In [17]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [18]:
print(X)

['subscrib jason chaffetz stump american fork hous dem aid even see comey letter jason chaffetz tweet darrel lucu octob'
 'rank democrat relev committe hear comey found via tweet one republican committe chairmen'
 'oversight committe chairman jason chaffetz set polit world ablaz tweet fbi dir inform oversight committe agenc review email recent discov order see contain classifi inform long letter went'
 ... '' '' '']


In [19]:
print(Y)

[' available under a Creative Commons-BY license) ' '' '' ... '' '' '']


In [20]:
Y.shape

(5402,)

In [21]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [22]:
print(X)

  (0, 5091)	0.1977748772888625
  (0, 4726)	0.23633771760547492
  (0, 4714)	0.23633771760547492
  (0, 4338)	0.16721454937729818
  (0, 3391)	0.16292650766148709
  (0, 2903)	0.2283352052004268
  (0, 2815)	0.2057773896939106
  (0, 2603)	0.4255365114627152
  (0, 2342)	0.17420541541474518
  (0, 1942)	0.247616625358733
  (0, 1690)	0.1559356416240401
  (0, 1273)	0.247616625358733
  (0, 1198)	0.2283352052004268
  (0, 923)	0.2090537850421206
  (0, 768)	0.4255365114627152
  (0, 178)	0.15752562918186308
  (0, 120)	0.2001952909691642
  (1, 5252)	0.2649954428873426
  (1, 5091)	0.2546899893963563
  (1, 4106)	0.246696458654994
  (1, 4070)	0.29404488471188883
  (1, 3971)	0.27399812119186856
  (1, 3416)	0.17767825928096206
  (1, 2244)	0.27399812119186856
  (1, 1961)	0.23138440122556259
  :	:
  (5385, 405)	0.34360635564448894
  (5387, 4914)	0.19476717508685715
  (5387, 4897)	0.21924061302742678
  (5387, 4692)	0.18775962243909095
  (5387, 4657)	0.2377540541406176
  (5387, 4653)	0.2132806162000479
  (5387,

Splitting the dataset to training & test data

In [23]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=2)

Training the Model: Logistic Regression

In [24]:
model = LogisticRegression()

In [25]:
model.fit(X_train, Y_train)

Evaluation

accuracy score

In [26]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [27]:
print('Accuracy score of the training data : ', training_data_accuracy)

Accuracy score of the training data :  0.9592686878037492


In [28]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [29]:
print('Accuracy score of the test data : ', test_data_accuracy)

Accuracy score of the test data :  0.9666975023126735


Making a Predictive System

In [30]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

['']
The news is Fake


In [32]:
print(Y_test[3])


