## This notebook contains the detailed analysis of fake and true news data

In [2]:
import pandas as pd

In [3]:
fake_df=pd.read_csv("../data/fake.csv")
true_df=pd.read_csv("../data/True.csv")

<p>After loading the data into pandas dataframe, we have to add a label column to classify the 0 as fake and 1 as true news</p>

In [4]:
fake_df["label"]=0
true_df["label"]=1

After adding the label column, let's join both the data sets so that our model can be trained later on. We are using concat function from Pandas to do that.

In [5]:
df=pd.concat([fake_df, true_df], ignore_index=True)

In [6]:
df.shape

(44898, 5)

In [7]:
df.head()

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0


Its clear from avobe data that our data frame has 44898 rows and each row has 5 following columns:
1. title
2. dext
3. subject
4. date
5. label


Next step would be to clean the data. We need the data to be all in lowercase and we also need to clear the punctuations like (!,@, #, $, etc.)
following steps should be taken to clean the data
1. Lowercase text (Normalize the text data)
2. Remove punctuations
3. remove stopwords (is, the, etc). They are common useless words. 
4. remove numbers and URLs
5. Tokenize (split into individual words)

We will use nltk library to achieve all this.


In [8]:
import string
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize



stop_words=set(stopwords.words("english"))

Now, lets define a function to achieve the text cleaning. 

In [9]:
def clean_text(text):
    text=text.lower() #Converts the text to lower case
    text=re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) #removes the links i.e strings starting with http, www or https and replace it with empty strings '' or means delete the links
    text= text.translate(str.maketrans('','',string.punctuation))
    tokens=word_tokenize(text) # Tokenizes the text and returns a list of words from given text
    filtered_tokens=[word for word in tokens if word not in stop_words and word.isalpha()]  #Returns the list of tokens if the token is not present in list of stopwords and if token is alphabetic not number
    return " ".join(filtered_tokens) 
    

In [10]:
df["clean_text"]=df["text"].apply(clean_text)
print(df.head())

                                               title  \
0   Donald Trump Sends Out Embarrassing New Year’...   
1   Drunk Bragging Trump Staffer Started Russian ...   
2   Sheriff David Clarke Becomes An Internet Joke...   
3   Trump Is So Obsessed He Even Has Obama’s Name...   
4   Pope Francis Just Called Out Donald Trump Dur...   

                                                text subject  \
0  Donald Trump just couldn t wish all Americans ...    News   
1  House Intelligence Committee Chairman Devin Nu...    News   
2  On Friday, it was revealed that former Milwauk...    News   
3  On Christmas day, Donald Trump announced that ...    News   
4  Pope Francis used his annual Christmas Day mes...    News   

                date  label                                         clean_text  
0  December 31, 2017      0  donald trump wish americans happy new year lea...  
1  December 31, 2017      0  house intelligence committee chairman devin nu...  
2  December 30, 2017      0  friday

Since the data is clean, we need to vectorize the text and add label encoding to df['label']. In this case labels are 0 and 1. (0 for fake news and 1 for real news)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer
verctorizer=TfidfVectorizer(max_features=5000)

In [12]:
x=verctorizer.fit_transform(df["clean_text"])

In [13]:
x.shape

(44898, 5000)

In [14]:
y=df["label"].values

Now since our data is ready we are ready to train our model on x and y

In [15]:
from sklearn.model_selection import  train_test_split
x_train, x_test, y_train, y_test=train_test_split(x,y,test_size=0.2, random_state=42, stratify=y) # type: ignore

In [16]:
#Time to train model

from sklearn.linear_model import LogisticRegression
model=LogisticRegression()

In [17]:
model.fit(x_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


In [18]:
y_pred=model.predict(x_test)

In [20]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
accuracy= accuracy_score(y_test, y_pred)
print(accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9863028953229399
[[4613   83]
 [  40 4244]]
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      4696
           1       0.98      0.99      0.99      4284

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980



In [21]:
from sklearn.ensemble import RandomForestClassifier
rf_model= RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42, n_jobs=-1)
rf_model.fit(x_train, y_train)

0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


In [22]:
y_pred_rf=rf_model.predict(x_test)

In [23]:
type(x_test)

scipy.sparse._csr.csr_matrix

In [24]:
print(accuracy_score(y_test,y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

0.9975501113585746
[[4686   10]
 [  12 4272]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4696
           1       1.00      1.00      1.00      4284

    accuracy                           1.00      8980
   macro avg       1.00      1.00      1.00      8980
weighted avg       1.00      1.00      1.00      8980



In [25]:
import joblib
import os

os.makedirs("../Models", exist_ok=True)

joblib.dump(rf_model,"../Models/randomforestmodel.pkl")
joblib.dump(verctorizer,"../Models/tfidf_vectorizer.pkl")

['../Models/tfidf_vectorizer.pkl']