# Introduction


In this analysis, we will discuss how you can use NLP to determine whether the news is real or fake. Nowadays, fake news has become a common problem. Even respected media organizations are known to propagate fake news and are losing credibility. It can be difficult to trust news, because it can be difficult to know whether a news story is real or fake.

# Dataset
1.train.csv: A full training dataset with the following attributes                                         
2.id: unique id for a news article                                                                         
3.title: the title of a news article                                                                       
4.author: author of the news article                                                                       
5.text: the text of the article; could be incomplete                                                       
6.label: a label that marks the article as potentially unreliable. Where 0: reliable and 1: unreliable.

# Importing important libraries

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

In [2]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

# Reading dataset

In [3]:
train_df=pd.read_csv('../input/fake-news/train.csv')
test_df=pd.read_csv('../input/fake-news/test.csv')
sub_df=pd.read_csv('../input/fake-news/submit.csv')

In [4]:
# here we are printing first five lines of our train dataset
train_df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


# Data Pre-Processing

In [5]:
#filling nan values with space(' ')
train_df.fillna(' ',inplace=True)

In [6]:
#combining title and author,title and summary is formed
train_df['summary']=train_df['title']+' '+train_df['author']+' '+train_df['text']
train_df.head()

Unnamed: 0,id,title,author,text,label,summary
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"FLYNN: Hillary Clinton, Big Woman on Campus - ..."
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,Why the Truth Might Get You Fired Consortiumne...
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,15 Civilians Killed In Single US Airstrike Hav...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1,Iranian woman jailed for fictional unpublished...


In [7]:
train_df['summary'][1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart Daniel J. Flynn Ever get the feeling your life circles the roundabout rather than heads in a straight line toward the intended destination? [Hillary Clinton remains the big woman on campus in leafy, liberal Wellesley, Massachusetts. Everywhere else votes her most likely to don her inauguration dress for the remainder of her days the way Miss Havisham forever wore that wedding dress.  Speaking of Great Expectations, Hillary Rodham overflowed with them 48 years ago when she first addressed a Wellesley graduating class. The president of the college informed those gathered in 1969 that the students needed “no debate so far as I could ascertain as to who their spokesman was to be” (kind of the like the Democratic primaries in 2016 minus the   terms unknown then even at a Seven Sisters school). “I am very glad that Miss Adams made it clear that what I am speaking for today is all of us —  the 400 of us,” Miss Rodham told her classmates

In [8]:
train_df.isnull().sum()

id         0
title      0
author     0
text       0
label      0
summary    0
dtype: int64

In [9]:
train_df['summary']==' '

0        False
1        False
2        False
3        False
4        False
         ...  
20795    False
20796    False
20797    False
20798    False
20799    False
Name: summary, Length: 20800, dtype: bool

**Removel of stop words and Stemming the words**

In [10]:
# here we are importing nltk,stopwords and porterstemmer we are using stemming on the text 
# we have and stopwords will help in removing the stopwords in the text

#re is regular expressions used for identifying only words in the text and ignoring anything else
import nltk
import re
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
ps=PorterStemmer()

In [11]:
x=train_df['summary']
y=train_df['label']

In [12]:
x.head()

0    House Dem Aide: We Didn’t Even See Comey’s Let...
1    FLYNN: Hillary Clinton, Big Woman on Campus - ...
2    Why the Truth Might Get You Fired Consortiumne...
3    15 Civilians Killed In Single US Airstrike Hav...
4    Iranian woman jailed for fictional unpublished...
Name: summary, dtype: object

In [13]:
x[1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart Daniel J. Flynn Ever get the feeling your life circles the roundabout rather than heads in a straight line toward the intended destination? [Hillary Clinton remains the big woman on campus in leafy, liberal Wellesley, Massachusetts. Everywhere else votes her most likely to don her inauguration dress for the remainder of her days the way Miss Havisham forever wore that wedding dress.  Speaking of Great Expectations, Hillary Rodham overflowed with them 48 years ago when she first addressed a Wellesley graduating class. The president of the college informed those gathered in 1969 that the students needed “no debate so far as I could ascertain as to who their spokesman was to be” (kind of the like the Democratic primaries in 2016 minus the   terms unknown then even at a Seven Sisters school). “I am very glad that Miss Adams made it clear that what I am speaking for today is all of us —  the 400 of us,” Miss Rodham told her classmates

In [14]:
# here we are creating corpus for the test dataset exactly the same as we created for the 
# training dataset
corpus=[]
for i in range(0,len(train_df)):
    review=re.sub('[^a-zA-Z]',' ',x[i])
    review=review.lower()
    review=review.split()
    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus.append(review)

In [15]:
corpus[1]

'flynn hillari clinton big woman campu breitbart daniel j flynn ever get feel life circl roundabout rather head straight line toward intend destin hillari clinton remain big woman campu leafi liber wellesley massachusett everywher els vote like inaugur dress remaind day way miss havisham forev wore wed dress speak great expect hillari rodham overflow year ago first address wellesley graduat class presid colleg inform gather student need debat far could ascertain spokesman kind like democrat primari minu term unknown even seven sister school glad miss adam made clear speak today us us miss rodham told classmat appoint edger bergen charli mccarthi mortim snerd attend bespectacl granni glass award matronli wisdom least john lennon wisdom took issu previou speaker despit becom first win elect seat u senat sinc reconstruct edward brook came critic call empathi goal protestor critic tactic though clinton senior thesi saul alinski lament black power demagogu elitist arrog repress intoler with

**Word Embedding — One hot encoding**

The machine cannot understand words and therefore it needs numerical values so as to make it easier for the machine to process the data. To apply any type of algorithm to the data, we need to convert the categorical data to numbers. To achieve this, one hot ending is one way as it converts categorical variables to binary vectors.

In [16]:
#vocabulary size
voc_size=10000

In [17]:
# TensorFlow has an operation for one-hot encoding
one_hot_reps1=[one_hot(word,voc_size) for word in corpus]
one_hot_reps1[1]

[2472,
 1082,
 8431,
 5908,
 9842,
 2932,
 6821,
 9913,
 6512,
 2472,
 2409,
 7172,
 6651,
 5097,
 5207,
 9691,
 5184,
 4289,
 1039,
 1079,
 7171,
 5939,
 9115,
 1082,
 8431,
 9242,
 5908,
 9842,
 2932,
 8770,
 9816,
 7610,
 3397,
 8500,
 5400,
 6443,
 5647,
 7815,
 5742,
 7637,
 3126,
 2580,
 1514,
 9733,
 6865,
 2322,
 4847,
 5742,
 2330,
 3167,
 5363,
 1082,
 6756,
 4530,
 7587,
 6844,
 1936,
 8372,
 7610,
 2368,
 6307,
 4548,
 3300,
 3506,
 401,
 49,
 9529,
 9620,
 6631,
 118,
 7691,
 9520,
 9728,
 5647,
 5024,
 9247,
 9262,
 4470,
 740,
 8912,
 2310,
 7871,
 7263,
 9978,
 1514,
 9627,
 7988,
 5301,
 2330,
 2470,
 8598,
 8598,
 1514,
 6756,
 5867,
 7187,
 6393,
 802,
 7152,
 2770,
 1465,
 7479,
 1137,
 4865,
 5537,
 8592,
 2894,
 4279,
 6921,
 422,
 957,
 5569,
 75,
 422,
 517,
 7003,
 5928,
 6059,
 628,
 9207,
 1936,
 7747,
 366,
 1611,
 728,
 702,
 855,
 8667,
 3738,
 1555,
 7942,
 7613,
 39,
 5537,
 7848,
 3270,
 7613,
 1657,
 7321,
 8431,
 9058,
 4262,
 1226,
 8285,
 4227,
 587

**Word Embedding**

In [18]:
# here we are specifying a sentence length so that every sentence in the corpus will be of same length
sent_length=500
#making all the sentence as equall size vector
#two types of padding pre and post
embedded_docs1=pad_sequences(one_hot_reps1,padding='pre',maxlen=sent_length)
embedded_docs1

array([[   0,    0,    0, ..., 4978, 6875, 2627],
       [   0,    0,    0, ...,  351, 9708, 7073],
       [9687, 7684, 9629, ..., 3391, 4128, 6540],
       ...,
       [   0,    0,    0, ..., 3176, 5673, 2252],
       [   0,    0,    0, ..., 8876, 4456, 1200],
       [7587, 5972, 3122, ..., 9661,  294,  665]], dtype=int32)

In [19]:
x=np.array(embedded_docs1)
y=np.array(y)

In [20]:
x

array([[   0,    0,    0, ..., 4978, 6875, 2627],
       [   0,    0,    0, ...,  351, 9708, 7073],
       [9687, 7684, 9629, ..., 3391, 4128, 6540],
       ...,
       [   0,    0,    0, ..., 3176, 5673, 2252],
       [   0,    0,    0, ..., 8876, 4456, 1200],
       [7587, 5972, 3122, ..., 9661,  294,  665]], dtype=int32)

# Data Pre-Processing for testing data

In [21]:
test_df.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


In [22]:
test_df.isnull().sum()

id          0
title     122
author    503
text        7
dtype: int64

In [23]:
#filling nan values with space(' ')
test_df=test_df.fillna(' ')
test_df.isnull().sum()

id        0
title     0
author    0
text      0
dtype: int64

In [24]:
#combining title and author,title and summary is formed
test_df['summary']=test_df['title']+' '+test_df['author']+' '+test_df['title']
test_df.head()

Unnamed: 0,id,title,author,text,summary
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning...","Specter of Trump Loosens Tongues, if Not Purse..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...,#NoDAPL: Native American Leaders Vow to Stay A...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different...","Tim Tebow Will Attempt Another Comeback, This ..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...,Keiser Report: Meme Wars (E995) Truth Broadcas...


In [25]:
test_df['summary'][0]

'Specter of Trump Loosens Tongues, if Not Purse Strings, in Silicon Valley - The New York Times David Streitfeld Specter of Trump Loosens Tongues, if Not Purse Strings, in Silicon Valley - The New York Times'

**Removel of stop words and Stemming the words**

In [26]:
# here we are creating corpus for the test dataset exactly the same as we created for the 
# testinging dataset
corpus_test=[]

for i in range(0,len(test_df)):
    review=re.sub('[^a-zA-Z]',' ',test_df['summary'][i])
    review=review.lower()
    review=review.split()
    
    review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
    review=' '.join(review)
    corpus_test.append(review)
    

In [27]:
corpus_test[1]

'russian warship readi strike terrorist near aleppo russian warship readi strike terrorist near aleppo'

In [28]:
# TensorFlow has an operation for one-hot encoding
one_hot_reps2=[one_hot(word,voc_size) for word in corpus_test]
one_hot_reps2[1]

[1750,
 1835,
 5514,
 8086,
 5100,
 1393,
 2698,
 1750,
 1835,
 5514,
 8086,
 5100,
 1393,
 2698]

In [29]:
# here we are specifying a sentence length so that every sentence in the corpus will be of same length
sent_length=500
# here we are using padding for creating equal length sentences
embedded_docs2=pad_sequences(one_hot_reps2,padding='pre',maxlen=sent_length)
embedded_docs2

array([[   0,    0,    0, ..., 9796, 7396, 2040],
       [   0,    0,    0, ..., 5100, 1393, 2698],
       [   0,    0,    0, ..., 1244, 6425, 2652],
       ...,
       [   0,    0,    0, ..., 9796, 7396, 2040],
       [   0,    0,    0, ..., 1750, 6889,  937],
       [   0,    0,    0, ..., 9796, 7396, 2040]], dtype=int32)

In [30]:
embedded_docs2.shape

(5200, 500)

In [31]:
test_df=np.array(embedded_docs2)
test_df

array([[   0,    0,    0, ..., 9796, 7396, 2040],
       [   0,    0,    0, ..., 5100, 1393, 2698],
       [   0,    0,    0, ..., 1244, 6425, 2652],
       ...,
       [   0,    0,    0, ..., 9796, 7396, 2040],
       [   0,    0,    0, ..., 1750, 6889,  937],
       [   0,    0,    0, ..., 9796, 7396, 2040]], dtype=int32)

# LSTM

# Building Models

In [32]:
#Creating model
from tensorflow.keras.layers import Dropout
import warnings
warnings.filterwarnings('ignore')
embedded_feature_vector=300
nn=Sequential([
    
    Embedding(voc_size,embedded_feature_vector,input_length=sent_length),
    Dropout(0.5),
    LSTM(199),
    Dropout(0.4),
    Dense(399,activation='relu'),
    Dense(43,activation='relu'),
    Dense(1,activation='sigmoid')])

2022-10-01 16:21:13.906560: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-01 16:21:14.005517: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-01 16:21:14.006272: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-01 16:21:14.008064: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

In [33]:
nn.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 300)          3000000   
_________________________________________________________________
dropout (Dropout)            (None, 500, 300)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 199)               398000    
_________________________________________________________________
dropout_1 (Dropout)          (None, 199)               0         
_________________________________________________________________
dense (Dense)                (None, 399)               79800     
_________________________________________________________________
dense_1 (Dense)              (None, 43)                17200     
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 4

# Spiliting and Training

In [34]:
# here we are splitting the data for training and testing the model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [35]:
nn.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
nn.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=20,batch_size=64)

Epoch 1/20


2022-10-01 16:21:17.053901: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2022-10-01 16:21:19.730656: I tensorflow/stream_executor/cuda/cuda_dnn.cc:369] Loaded cuDNN version 8005


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f18cddf17d0>

In [36]:
y_pred=nn.predict(test_df)
y_pred

array([[0.06256308],
       [1.        ],
       [0.99999726],
       ...,
       [0.9747313 ],
       [0.99999917],
       [0.5257501 ]], dtype=float32)

In [37]:
y_pred=(y_pred>0.5)
y_pred

array([[False],
       [ True],
       [ True],
       ...,
       [ True],
       [ True],
       [ True]])

In [38]:
y_pred=y_pred.reshape(-1,)
y_pred

array([False,  True,  True, ...,  True,  True,  True])

# Submission File

In [39]:
submission=pd.DataFrame({'id':sub_df['id'],'label':y_pred})
submission

Unnamed: 0,id,label
0,20800,False
1,20801,True
2,20802,True
3,20803,True
4,20804,True
...,...,...
5195,25995,False
5196,25996,False
5197,25997,True
5198,25998,True


In [40]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
submission['label']=le.fit_transform(submission['label'])
submission.head()

Unnamed: 0,id,label
0,20800,0
1,20801,1
2,20802,1
3,20803,1
4,20804,1


In [41]:
#saving the submission file
submission.to_csv('submission.csv',index=None)

In [42]:
sent_length=500

# Bidirectional LSTM

In [43]:
#Creating model
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Bidirectional
embedding_vector_features=40
model1=Sequential()
model1.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model1.add(Bidirectional(LSTM(100)))
model1.add(Dropout(0.3))
model1.add(Dense(1,activation='sigmoid'))
model1.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model1.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 40)           400000    
_________________________________________________________________
bidirectional (Bidirectional (None, 200)               112800    
_________________________________________________________________
dropout_2 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 201       
Total params: 513,001
Trainable params: 513,001
Non-trainable params: 0
_________________________________________________________________
None


In [44]:
model1.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=20,batch_size=64)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7f1838409550>

In [45]:
y_pred=model1.predict(X_test)

y_pred = (y_pred > 0.5)

In [46]:
y_pred=y_pred.reshape(-1,)
y_pred

array([ True,  True, False, ..., False, False, False])