# Describtion
### Do you trust all the news you hear from social media? All news are not real, right? So how will you detect the fake news? The answer is ML. By practicing this advanced ML project of detecting fake news, you will easily make a difference between real and fake news. Before moving ahead in this advanced ML project, get aware of the terms related to it like fake news, tfidfvectorizer, PassiveAggressive Classifier.

# Columns :
### Title: The title of the article
### Text: The text of the article
### Subject: The subject of the article
### Date: The date at which the article was posted

# Problem Statemtent
### We will use these news record to detect if news is fake or real.

# Constraints:
### The cost of a mis-classification can be high, it can cause chaos.

# Benefits:
### Can prevent misleading information that can cause political problems.

## Importing Liabraries

In [50]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

## Loading dataset

In [19]:
fake_test = pd.read_csv('Fake.csv')
fake_test['label'] = 0 #Adding label: 0  to fake news
fake_test = fake_test[['text', 'label']]

In [20]:
true_test = pd.read_csv('True.csv')
true_test['label'] = 1 #Adding label: 1  to real news
true_test = true_test[['text', 'label']]

In [21]:
data = pd.concat([true_test, fake_test])
data = data.sample(frac = 1) 

## Performing simple EDA

In [22]:
data.columns #Checking columns

Index(['text', 'label'], dtype='object')

In [23]:
data.isnull().sum() #Checking for null values

text     0
label    0
dtype: int64

## To solve this problem we will focus on Text and Label columns

In [27]:
X = data.text 
y = data.label
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.30, random_state = 100)
X_cv, X_test, y_cv, y_test = tts(X_train, y_train, train_size = 0.30, random_state = 100)

In [None]:
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7) # Creating tfidf object
tfidf_train = tfidf_vectorizer.fit_transform(X_train) #Converting Text to tfidf vectors
tfidf_cv = tfidf_vectorizer.transform(X_cv)
tfidf_test = tfidf_vectorizer.transform(X_test)

## Using PassiveAggressiveClassifier

In [42]:
pac = PassiveAggressiveClassifier(max_iter = 50)
pac.fit(tfidf_train, y_train)
cv_pred = pac.predict(tfidf_cv)
score = accuracy_score(y_cv, cv_pred)
print(f'Accuracy: {(score*100)}%')
print(confusion_matrix(y_cv, cv_pred))

Accuracy: 99.97524752475248%
[[2120    0]
 [   1 1919]]


In [26]:
test_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test, test_pred)
print(f'Accuracy: {(score*100)}%')
print(confusion_matrix(y_test, test_pred))

Accuracy: 100.0%
[[4956    0]
 [   0 4473]]


## Using Logistic Reggression

In [46]:
lr = LogisticRegression()
lr.fit(tfidf_train, y_train)

cv_pred = lr.predict(tfidf_cv)
score = accuracy_score(y_cv, cv_pred)
print(f'Accuracy: {(score*100)}%')
print(confusion_matrix(y_cv, cv_pred))

test_pred = lr.predict(tfidf_test)
score = accuracy_score(y_test, test_pred)
print(f'Accuracy: {(score*100)}%')
print(confusion_matrix(y_test, test_pred))

Accuracy: 98.8118811881188%
[[2098   22]
 [  26 1894]]
Accuracy: 98.72733057588292%
[[4880   76]
 [  44 4429]]


## Using KNN

In [51]:
k = []
for n in range(1, 30, 2):
    k.append(n)

In [52]:
k = {'n_neighbors': k}
k

{'n_neighbors': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29]}

In [None]:
grid = GridSearchCV(KNeighborsClassifier(), param_grid = k, scoring = 'accuracy', refit = True, cv = 10)
grid.fit(tfidf_train, y_train)
cv_pred = grid.predict(tfidf_cv)

In [49]:
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(tfidf_train, y_train)
    cv_pred = knn.predict(tfidf_cv)
    score = accuracy_score(y_cv, cv_pred)
    print('Accuracy: ',(score*100), 'K: ', k)
    print(confusion_matrix(y_cv, cv_pred))

Accuracy:  99.97524752475248 K:  1
[[2120    0]
 [   1 1919]]
Accuracy:  68.8118811881188 K:  3
[[2102   18]
 [1242  678]]
Accuracy:  62.227722772277225 K:  5
[[2105   15]
 [1511  409]]
Accuracy:  59.65346534653465 K:  7
[[2112    8]
 [1622  298]]
Accuracy:  57.89603960396039 K:  9
[[2112    8]
 [1693  227]]
Accuracy:  56.78217821782178 K:  11
[[2115    5]
 [1741  179]]
Accuracy:  56.188118811881196 K:  13
[[2116    4]
 [1766  154]]
Accuracy:  55.693069306930695 K:  15
[[2118    2]
 [1788  132]]
Accuracy:  55.02475247524753 K:  17
[[2118    2]
 [1815  105]]
Accuracy:  54.579207920792086 K:  19
[[2118    2]
 [1833   87]]
Accuracy:  54.10891089108911 K:  21
[[2118    2]
 [1852   68]]
Accuracy:  53.83663366336634 K:  23
[[2118    2]
 [1863   57]]
Accuracy:  53.71287128712871 K:  25
[[2118    2]
 [1868   52]]
Accuracy:  53.490099009900995 K:  27
[[2118    2]
 [1877   43]]
Accuracy:  53.31683168316832 K:  29
[[2118    2]
 [1884   36]]


In [58]:
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(tfidf_train, y_train)
test_pred = knn.predict(tfidf_test)
score = accuracy_score(y_test, test_pred)
print(f'Accuracy: {(score*100)}%')
print(confusion_matrix(y_test, test_pred))

Accuracy: 62.05324000424223%
[[4932   24]
 [3554  919]]
