<h1 style="text-align:center">   
      <font color = Black >
                Fake and Real News 
        </font>    
</h1>   
<hr style="width:100%;height:5px;border-width:0;color:gray;background-color:gray">
<center><img style = "height:450px;" src="https://www.txstate.edu/cache78a0c25d34508c9d84822109499dee61/imagehandler/scaler/gato-docs.its.txstate.edu/jcr:21b3e33f-31c9-4273-aeb0-5b5886f8bcc4/fake-fact.jpg?mode=fit&width=1600"></center>

# Introduction

This table of contents gives an overview about different sections in the notebook.

1. [Load Required Libraries](#1)
2. [Import the Dataset](#2)
3. [Exploratory Data Analysis](#3)
4. [Data Cleaning](#4)
    * [Removing Stopwords](#5)
    * [Lemmatization](#6)
    * [Word Cloud](#7)
5. [N-gram Analysis](#8)  
    * [Unigram Analysis](#9)
    * [Bigram Analysis](#10)
    * [Trigram Analysis](#11)
6. [Modeling](#12)
7. [Conclusion](#13)

<a id = "1" ></a>
# Load Required Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

from bs4 import BeautifulSoup
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from wordcloud import WordCloud, STOPWORDS
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

<a id = "2" ></a>
# Import the Dataset

There are two datasets seperate for fake and real news. We will import them into the environment

In [None]:
#import dataset
fake = pd.read_csv("../input/fake-and-real-news-dataset/Fake.csv")
true = pd.read_csv("../input/fake-and-real-news-dataset/True.csv")

In [None]:
#data exploration
fake.head()

In [None]:
true.head()

The columns in the datasets are:
* **title** - The title of the article
* **text** - The text of the article
* **subject** - The subject of the article
* **date** - The date at which the article was posted

The dataset contains no target variable. We need to create manually and add it to the datasets. We will create a binary variable called label. The label variable will have '0' for real news and '1' for fake news. 

In [None]:
#adding label columns to both fake news and true news dataset
fake["label"] = 1
true["label"] = 0

We will combine the seperate datasets into one for our further analysis

In [None]:
#combining both the datasets into one
df = pd.concat([fake, true], ignore_index = True)
df

<a id = "3" ></a>
# Exploratory Data Analysis

In [None]:
#EDA
#checking for missing values in the combined dataset
df.isnull().sum()

There are no null/missing values in the dataset.

In [None]:
#checking for imbalance in the dataset
count = df['label'].value_counts().values
sns.barplot(x = [0,1], y = count)
plt.title('Target variable count')

From the plot above, you can see there is no class imbalance in the target variable. We have almost equal instances for negative class ("0" - Real) and the class of interest ("1" - Fake).

In [None]:
#distribution of fake and real news among subjects
plt.figure(figsize=(12,8))
sns.countplot(x = "subject", data=df, hue = "label")
plt.show()

<a id = "4" ></a>
# Data Cleaning
We will begin with the preprocessing steps before the text is fed into the model for prediction. 

In [None]:
#data cleaning
#combining the title and text columns
df['text'] = df['title'] + " " + df['text']
#deleting few columns from the data 
del df['title']
del df['subject']
del df['date']

<a id = "5" ></a>
## Removing stopwords
One of the major forms of pre-processing is to filter out useless data. In NLP, useless words, are referred to as stop words. We will use the `nltk` library for this purpose. This is how we are making our processed content more efficient by removing words that do not contribute to any future operations.

In [None]:
#choosing the language as english
stop = set(stopwords.words('english'))
#removing punctuation marks
punctuation = list(string.punctuation)
#adding punctuations to the list of stop words 
stop.update(punctuation)

#Removing the square brackets
def remove_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

# Removing URL's
def remove_urls(text):
    return re.sub(r'http\S+', '', text)

#Removing the stopwords from text
def remove_stopwords(text):
    final_text = []
    text = text.lower()
    for i in text.split():
        if i.strip() not in stop:
            final_text.append(i.strip())
    return " ".join(final_text)

#Removing the noisy text
def clean_text(text):
    text = remove_brackets(text)
    text = remove_urls(text)
    text = remove_stopwords(text)
    return text

#Apply function on text column
df['text']=df['text'].apply(clean_text)
df['text']

<a id = "6" ></a>
## Lemmatization
The next step is to perform Lemmatization. It is the process of converting a word to its base form. For example: 'Caring' -> 'Care'; 'hanging' -> 'hang'

In [None]:
#lemmatization
# Init the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

#A function which takes a sentence/corpus and gets its lemmatized version.
def lemmatize_text(text):
    token_words=word_tokenize(text) 
#we need to tokenize the sentence or else lemmatizing will return the entire sentence as is.
    lemma_sentence=[]
    for word in token_words:
        lemma_sentence.append(lemmatizer.lemmatize(word))
        lemma_sentence.append(" ")
    return "".join(lemma_sentence)

#Apply function on text column
df['text']=df['text'].apply(lemmatize_text)
df

<a id = "7" ></a>
## Word Cloud
### Fake News Word Cloud


In [None]:
#word cloud for fake news
cloud = WordCloud(max_words = 500, stopwords = STOPWORDS, background_color = "white").generate(" ".join(df[df.label == 1].text))
plt.figure(figsize=(40, 30))
plt.imshow(cloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

### Real News Word Cloud

In [None]:
#word cloud for real news
cloud = WordCloud(max_words = 500, stopwords = STOPWORDS, background_color = "white").generate(" ".join(df[df.label == 0].text))
plt.figure(figsize=(40, 30))
plt.imshow(cloud, interpolation="bilinear")
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

<a id = "8" ></a>
# N-gram Analysis

In [None]:
#finding n-grams
texts = ''.join(str(df['text'].tolist()))

# first get individual words
tokenized = texts.split()

<a id = "9" ></a>
## Unigram Analysis

In [None]:
#unigram
unigram = (pd.Series(nltk.ngrams(tokenized, 1)).value_counts())[:20]
unigram.sort_values().plot.barh(width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Unigrams')
plt.ylabel('Unigram')
plt.xlabel('# of Occurances')

<a id = "10" ></a>
## Bigram Analysis

In [None]:
#bigrams
bigram = (pd.Series(nltk.ngrams(tokenized, 2)).value_counts())[:20]
bigram.sort_values().plot.barh(width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# of Occurances')

<a id = "11" ></a>
## Trigram Analysis

In [None]:
#trigrams
trigram = (pd.Series(nltk.ngrams(tokenized, 3)).value_counts())[:20]
trigram.sort_values().plot.barh(width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Trigrams')
plt.ylabel('Trigram')
plt.xlabel('# of Occurances')

<a id = "12" ></a>
# Modeling
In this step, I am making use of various Classification models for prediction. The models use cleaned text data for analysis.

#### Using TF-IDF Vectorizer
This is an acronym than stands for "Term Frequency – Inverse Document Frequency" which are the components of the resulting scores assigned to each word.The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents.

In [None]:
#modeling
def get_prediction(vectorizer, classifier, X_train, X_test, y_train, y_test):
    pipe = Pipeline([('vector', vectorizer),
                    ('model', classifier)])
    model = pipe.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print("Accuarcy: {}".format(round(accuracy_score(y_test, y_pred)*100,2)))
    cm = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix: \n", cm)


In [None]:
#pipeline implementation
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size = 0.3, random_state= 0)
classifiers = [LogisticRegression(),KNeighborsClassifier(n_neighbors=5), DecisionTreeClassifier(), GradientBoostingClassifier(), 
               RandomForestClassifier()]
for classifier in classifiers:
    print("\n\n", classifier)
    get_prediction(TfidfVectorizer(), classifier, X_train, X_test, y_train, y_test)

<a id = "13" ></a>
# Conclusion
Decision Tree, Gradient Boosting and Random Forest Algorithms are giving an accuracy above 99% which is a really good score. There might be chances of overfitting which can be explored using validation curve. I will explore overfitting furthur. 

**Upvote if you like this notebook. Happy Learning!**