##### Fake News Detection 

This project deals with the correct building a model to accurately classify whether a piece of news is REAL or FAKE.

In this project:
  
   1.We build a **TfidfVectorizer** on our dataset.
    
   2.We build the **PassiveAggressiveClassifier** model.
    
   3.We evaluate our model using **accuracy score** and **confusion matrix**.

**TfidfVectorizer:** It converts a collection of raw documents into matrix of TF-IDF features.

**TF(Term Frequency):** The number of times a word appears in a document is called Term Frequency.

**IDF(Inverse Document Frequency):** Words occurring many times in one document, but also occurs many times in other documents, may be irrelevant. IDF is a measure of significance of a term in the entire corpus.




*Let us begin with importing the required libraries*

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score,confusion_matrix

*Let us read the data as a dataframe and understand it*

In [2]:
news=pd.read_csv('news.csv')
print(news)

      Unnamed: 0                                              title  \
0           8476                       You Can Smell Hillary’s Fear   
1          10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2           3608        Kerry to go to Paris in gesture of sympathy   
3          10142  Bernie supporters on Twitter erupt in anger ag...   
4            875   The Battle of New York: Why This Primary Matters   
...          ...                                                ...   
6330        4490  State Department says it can't find emails fro...   
6331        8062  The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...   
6332        8622  Anti-Trump Protesters Are Tools of the Oligarc...   
6333        4021  In Ethiopia, Obama seeks progress on peace, se...   
6334        4330  Jeb Bush Is Suddenly Attacking Trump. Here's W...   

                                                   text label  
0     Daniel Greenfield, a Shillman Journalism Fello...  FAKE  
1     Google Pinter

In [3]:
print(news.shape)

(6335, 4)


In [4]:
print(news.describe())

         Unnamed: 0
count   6335.000000
mean    5280.415627
std     3038.503953
min        2.000000
25%     2674.500000
50%     5271.000000
75%     7901.000000
max    10557.000000


In [5]:
labels=news.label
print(labels)

0       FAKE
1       FAKE
2       REAL
3       FAKE
4       REAL
        ... 
6330    REAL
6331    FAKE
6332    FAKE
6333    REAL
6334    REAL
Name: label, Length: 6335, dtype: object


*Let us split our datset for training and testing*

In [6]:
xtrain,xtest,ytrain,ytest=train_test_split(news['text'],labels,test_size=0.2,random_state=1)

*Now,We build our TfidfVectorizer for our data to avoid the stop words in English language since our data is in English,fit and tranform our train data,transform our test data.*

In [7]:
tfidf_vector=TfidfVectorizer(stop_words='english',max_df=0.7)
tfidf_train=tfidf_vector.fit_transform(xtrain)
tfidf_test=tfidf_vector.transform(xtest)

*Next,let us build our PassiveAggressiveClassifier model with our vectorized data.*

In [8]:
model=PassiveAggressiveClassifier(max_iter=20).fit(tfidf_train,ytrain)

In [9]:
ypred=model.predict(tfidf_test)

*Finally,let us evaluate our model using accuracy score and confusion*

In [10]:
confusion_mat=confusion_matrix(ytest,ypred)
print(confusion_mat)

[[615  36]
 [ 33 583]]


In [11]:
accuracy=accuracy_score(ytest,ypred)
print(accuracy)

0.9455406471981057


*Thus we have finally achieved an accuracy of 94% using our model.*