#### Importing Libraries

In [None]:
!python -m nltk.downloader stopwords
!python -m nltk.downloader punkt
!python -m nltk.downloader wordnet

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
import numpy as np, pandas as pd
import seaborn as sns, matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

In [None]:
from google.colab import drive
drive.mount('/content/drive')

ValueError: mount failed

#### Importing Data

In [None]:
df_fake = pd.read_csv("/content/Fake.csv")
df_true = pd.read_csv("/content/True.csv")
print("Original 'Fake' and 'True' dataframes have the shapes:", df_fake.shape, " and ", df_true.shape, "respectively.")

ParserError: Error tokenizing data. C error: EOF inside string starting at row 11600

In [None]:
df_fake.head(2)

from google.colab import drive
drive.mount('/content/drive')

#### Inserting a column "label" as target feature and Combining DataFrames
As there is no specific column for the lebel for False and True texts, we'll add it considering the names of the data sets. Since we are trying to find False text, we'll lable False and True with "1" and "0", respectively. Later we'll concatenate the 2 data sets and randomly shuffle.

In [None]:
df_fake["label"] = "1"
df_true["label"] = "0"
df = pd.concat([df_fake, df_true])
df = df.sample(frac=1).reset_index(drop=True)

print("Combined dataframe has shape of ", df.shape)
df.head(3)

In [None]:
df.at[2,'text']

#### Keeping necessary columns

To classify the text we will also consider it's title. So we will combine the mentioned columns, and drop 'date' and 'subject' columns.

In [None]:
df.text = df.title+df.text
df.drop(columns=["title", "subject", "date"], axis = 1, inplace=True)
df.head(3)

In [None]:
#There is no missing data
df.isnull().sum()

In [None]:
X = df["text"]
y = df["label"]
print(X.shape, y.shape)

#### Creating a function to clean and Lemmatize the texts
To clean the data we use Lemmatization which is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. In this process we also remove stop-words (useless words such as “the”, “a”, “an”, “in” which don't bring value), and keep only letters in lower-case form.

In [None]:
stop_words = set(stopwords.words('english'))
def LemmSentence(sentence):
    lemma_words = []
    wordnet_lemmatizer = WordNetLemmatizer()
    word_tokens = word_tokenize(sentence)
    for word in word_tokens:
        if word not in stop_words:
            new_word = re.sub('[^a-zA-Z]', '',word)
            new_word = new_word.lower()
            new_word = wordnet_lemmatizer.lemmatize(new_word)
            lemma_words.append(new_word)
    return " ".join(lemma_words)

X = [LemmSentence(i) for i in X]

#### Defining dependent and independent variables and training the sets

In [None]:
X = pd.DataFrame(X)
y = pd.DataFrame(y)
X.shape, y.shape

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=7)
x_train.shape, x_test.shape, y_train.shape, y_test.shape

#### Converting text to vectors using Tfidf vectorizer

Term Frequency (TF) = (Frequency of a term in the document)/(Total number of terms in documents)

Inverse Document Frequency(IDF) = log( (total number of documents)/(number of documents with term t))

In [None]:
# create the transform
vectorizer = TfidfVectorizer()

# transforming
tfidf_train = vectorizer.fit_transform(x_train.iloc[:,0])
tfidf_test = vectorizer.transform(x_test.iloc[:,0])

In [None]:
tfidf_train.shape, tfidf_test.shape

#### Applying PassiveAggressiveClassifier and visualizing heatmap for Confusion matrix

Passive-Aggressive algorithms are called so because :

Passive: If the prediction is correct, keep the model and do not make any changes. i.e., the data in the example is not enough to cause any changes in the model.

Aggressive: If the prediction is incorrect, make changes to the model. i.e., some change to the model may correct it.

C : This is the regularization parameter, and denotes the penalization the model will make on an incorrect prediction

max_iter : The maximum number of iterations the model makes over the training data.

tol : The stopping criterion. If it is set to None, the model will stop when (loss > previous_loss  –  tol). By default, it is set to 1e-3.

In [None]:
pac = PassiveAggressiveClassifier(random_state = 7,loss = 'squared_hinge',  max_iter = 50, C = 0.16)
pac.fit(tfidf_train, y_train.values.ravel())

#Predict on the test set and calculate accuracy
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)

print(f'Accuracy: {round(score*100, 2)}%')

Finally, let's visualize the result with Confusion Matrix in terms of True Positive, False Positive, True Negative, False Negative.

In [None]:
ax = sns.heatmap(confusion_matrix(y_test,y_pred), annot=True, fmt="d")
ax.set(xlabel='Prediction', ylabel='Actual')
plt.show()

#### More frequent words in WordClound

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud

word = X[0] #the first news
wc = WordCloud(background_color="black", max_words=3000, max_font_size=256,
               random_state=13, width=1500, height=1500, prefer_horizontal=0.5)
wc.generate(' '.join(word))
plt.imshow(wc)
plt.axis('off')
plt.show()

#### Checking the model with random Text

In [None]:
def output_lable(n):
    if n == '1':
        return "\n\tFake News (((("
    if n == '0':
        return "\n\tNot A Fake News))))"

def manual_testing(news):
    lnews = LemmSentence(news)

    df = pd.DataFrame([lnews])

    x = df.iloc[:,0]
    x = vectorizer.transform(x)

    x_pred = pac.predict(x)

    return output_lable(x_pred)


In [None]:
news = 'Turkish hunger striker found guilty of militant links, freedISTANBUL (Reuters) - A Turkish professor who has been on hunger strike since losing her job in a purge following last year s failed coup was convicted on Friday of belonging to a banned far-left group but the court ordered her release pending an appeal. Nuriye Gulmen, 35, was sentenced to six years and three months in jail for being a member of the militant leftist DHKP-C group, deemed a terrorist organization by Turkey, defense lawyers told Reuters. She was found not guilty of lesser charges including organizing illegal rallies.  The literature professor had been hospitalized before the trial began due to her worsening health after seven months of surviving on water, herbal tea and sugar and salt solutions. Primary school teacher Semih Ozakca, 28, who has also been on hunger strike since losing his job in the crackdown, was acquitted on similar charges. The Ankara court had ordered his release on Oct. 21 for the remainder of the trial, on condition that he wear an ankle monitor. Both deny any links to DHKP-C. A third defendant, Acun Karadag, was acquitted on a lesser charge of participating in illegal rallies.  The teachers have said their hunger strike aimed to highlight the plight of some 150,000 state employees   including academics, civil servants, judges and soldiers   suspended or sacked since the abortive coup in July 2016. The pair were detained in May and jailed pending the start of the trial in September. On Sept. 12, days before the teachers were due in court, Turkey issued detention warrants for the lawyers who were set to defend them. Turkish authorities blame the coup attempt on U.S.-based Muslim cleric Fethullah Gulen and his supporters. Gulen condemned the coup and denies involvement. Human rights groups and the European Union have said President Tayyip Erdogan is using the crackdown to stifle dissent in Turkey, an assertion that he denies.'
print(manual_testing(news))

I hope you found this post interesting and useful.

if so, please upvote. Your **upvote** is highly **appreciated**. And any comments are always welcome)))

Thanks!