This notebook will guide throw the process of detecting Fake news on Social media using Machine learning algorithms and Deep LSTM using Tensorflow

In [1]:
import numpy as np
import pandas as pd

In [2]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/fake-news/submit.csv
/kaggle/input/fake-news/test.csv
/kaggle/input/fake-news/train.csv


In [3]:
train=pd.read_csv('/kaggle/input/fake-news/train.csv')
test=pd.read_csv('/kaggle/input/fake-news/test.csv')
submit=pd.read_csv('/kaggle/input/fake-news/submit.csv')

In [None]:
train.head()

In [5]:
df=train.dropna() # Removing every missing value existing on the dataframe 

In [None]:
X=df['title'] # Make Text column in Variable X
y=df['label'] # Make the labels on variable Y

We can see that the features ‘title’, ‘author’ and ‘text’ are important and all are in text form. So, we can combine these features to make one final feature which we will use to train the model. Let’s call the feature ‘total’.


In [7]:
# Firstly, fill all the null spaces with a space
train = train.fillna(' ')
train['total'] = train['title'] + ' ' + train['author'] + ' ' +  train['text']


In [9]:
pip install nltk 

You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


# Machine learning experiments 

In [10]:
# cleaning our dataset 
import nltk 
from nltk.corpus import stopwords # to remove stop words such as ' the , they , it, a ...'
from nltk.stem import WordNetLemmatizer # for lemmatization task 


In [11]:
stop_words = stopwords.words('english')


Tokenization: Word tokenization is the process of splitting a large sample of text into words.
For example:

In [13]:

word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
print(nltk_tokens)


['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the', 'comforts', 'of', 'their', 'drawing', 'rooms']


In [14]:
lemmatizer = WordNetLemmatizer()


In [None]:
import re
for index, row in train.iterrows():
    filter_sentence = ''
    sentence = row['total']
    # Cleaning the sentence with regex
    sentence = re.sub(r'[^\w\s]', '', sentence)
    # Tokenization
    words = nltk.word_tokenize(sentence)
    # Stopwords removal
    words = [w for w in words if not w in stop_words]
    # Lemmatization
    for words in words:
        filter_sentence = filter_sentence  + ' ' +  str(lemmatizer.lemmatize(words)).lower()
    train.loc[index, 'total'] = filter_sentence
train = train[['total', 'label']]

In [None]:
X_train = train['total']
Y_train = train['label']


**Vectorizer**

For converting this text data into numerical data, we will use two vectorizers.
* Count Vectorizer

In order to use textual data for predictive modelling, the text must be parsed to remove certain words — this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).

* **TF-IDF Vectorizer**
TF-IDF stands for Term Frequency — Inverse Document Frequency. It is one of the most important techniques used for information retrieval to represent how important a specific word or phrase is to a given document.
Read more about this here.

In [None]:
# this could take a while.
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(X_train)
freq_term_matrix = count_vectorizer.transform(X_train)
tfidf = TfidfTransformer(norm = "l2")
tfidf.fit(freq_term_matrix)
tf_idf_matrix = tfidf.fit_transform(freq_term_matrix)


Models 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tf_idf_matrix,  Y_train, random_state=0)


Logistic Regression Classifier 

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Accuracy = logreg.score(X_test, y_test)
print( 'LogisticRegression Accuracy :  ',Accuracy )

Multinomial Naive Bayes Classifier 

In [None]:
from sklearn.naive_bayes import MultinomialNB
NB = MultinomialNB()
NB.fit(X_train, y_train)
Accuracy = NB.score(X_test, Y_test)
print( 'Multinomial NB Accuracy :  ',Accuracy )

# Deep learning experiments 

In [4]:
import tensorflow as tf # Import Latest tensorflow version 
tf.__version__

'2.1.0'

In [None]:
from tensorflow.keras.layers import Embedding, LSTM, Dense ## Neural networks layers 
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot # to encode the depending variable 
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
voc_size=5000 # max num words to take into consideration while training your model

In [None]:
X=[i.lower() for i in X] # lowercase each text 

In [None]:
onehot=[one_hot(words,voc_size) for words in X] 

In [None]:
sen_len=30
embedded_doc=pad_sequences(onehot, padding='pre', maxlen=sen_len) # pad sequence your texts
print(embedded_doc)

Creating the model

In [None]:
embedding_vector_feature=40 
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_feature, input_length=sen_len))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
# sigmoid : to handle the output ( binary case )
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# binary_crossentropy : because we have a binaray classification task 
# Adam : Stochastic gradient decenet optimizatiion 
print(model.summary())

In [None]:
X_final=np.array(embedded_doc)
y_final=np.array(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X_final, y_final, test_size=0.20, random_state=0)

In [None]:
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)

https://medium.com/@ishantjuyal/fake-news-detector-nlp-project-9d67e0177075
and many other Kaggle Kernels 