<a href="https://colab.research.google.com/github/navdeep-github/Python_Projects/blob/master/FakeVsRealNewsDetectionProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#                  **FAKE AND REAL NEWS DETECTION PROJECT**






### Importing Required Libraries



In [97]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer    # Convert a collection of raw documents to a matrix of TF(Total Freq)-IDF features
from sklearn.linear_model import PassiveAggressiveClassifier   #The passive-aggressive algorithms are a family of algorithms for large-scale learning
from sklearn.metrics import accuracy_score, confusion_matrix   #Accuracy classification score. In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.


### Read dataset into Pandas dataframe

In [98]:
news_dataset = pd.read_csv('news_dataset.csv')  # Read dataset into pandas dataframe using read_csv function

news_dataset.describe()     # Describing our dataframe using describe function


Unnamed: 0.1,Unnamed: 0
count,6335.0
mean,5280.415627
std,3038.503953
min,2.0
25%,2674.5
50%,5271.0
75%,7901.0
max,10557.0


In [99]:
news_dataset.shape   # Using shape to know size of Dataframe

news_dataset.head()  # Showing first five rows(by default) of dataset using head function

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [100]:
X = news_dataset.iloc[:,2].values   # Extractiong Text column from dataset into X
y = news_dataset.iloc[:,-1].values  # Extracting Depandent variable 'label' into y



### Splitting Dataset into training and test sets

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 6) # Splitting dataset into training and testing datasets

print(X_train)


['Hillary Clinton sought to minimize new disclosures that top secret\xa0 government information passed through the private email server she used when she was secretary of state, dismissing the controversy as an “inter-agency dispute” that\xa0 pales next to the larger issues on the minds of voters.\n\nIn an interview with NBC News on Saturday morning, two days before the Iowa caucuses, Mrs. Clinton said, “It’s the same story that’s been going on for months now. And I just don’t think most people are as concerned about that as they are about what we’re going to do to get the economy going.”'
 'Washington (CNN) Vermont Sen. Bernie Sanders on Saturday said he supports Democratic National Committee Chairwoman Debbie Wasserman Schultz\'s Democratic opponent in her August 30 primary, adding that if he is elected president, he would effectively terminate her chairmanship of the DNC.\n\nSanders, whose campaign has engaged in an increasingly bitter feud with the DNC chairwoman during his preside

In [102]:
print(X_train)

['Hillary Clinton sought to minimize new disclosures that top secret\xa0 government information passed through the private email server she used when she was secretary of state, dismissing the controversy as an “inter-agency dispute” that\xa0 pales next to the larger issues on the minds of voters.\n\nIn an interview with NBC News on Saturday morning, two days before the Iowa caucuses, Mrs. Clinton said, “It’s the same story that’s been going on for months now. And I just don’t think most people are as concerned about that as they are about what we’re going to do to get the economy going.”'
 'Washington (CNN) Vermont Sen. Bernie Sanders on Saturday said he supports Democratic National Committee Chairwoman Debbie Wasserman Schultz\'s Democratic opponent in her August 30 primary, adding that if he is elected president, he would effectively terminate her chairmanship of the DNC.\n\nSanders, whose campaign has engaged in an increasingly bitter feud with the DNC chairwoman during his preside

In [103]:
print(X_test)

['The 2016 presidential election is shaping up as an unpopularity contest of unprecedented proportions.\n\nAssuming, as now appears most likely, that Hillary Clinton will win the Democratic nomination and that either Donald Trump or Ted Cruz becomes the Republican nominee, the general-election ballot is set to feature a choice between two candidates more negatively viewed than any major-party nominee in the history of polling.\n\nTrump is, by far, the furthest underwater: The latest Wall Street Journal-NBC poll puts his net favorability rating at minus-41. A breathtaking 65 percent of registered voters see him negatively, versus 24 percent with a positive view, making him the most unpopular major party presidential candidate ever recorded. Cruz is at minus-23, with 49\xa0percent viewing him negatively, 26 percent in a positive light.\n\nTo underscore the challenge facing the GOP, neither candidate has been viewed more positively than negatively by voters since the start of the campaign

In [104]:
print(y_train)

['REAL' 'REAL' 'FAKE' ... 'FAKE' 'FAKE' 'FAKE']


In [105]:
print(y_test)

['REAL' 'REAL' 'REAL' ... 'FAKE' 'REAL' 'REAL']


### Fit and transform the vectorizer 

In [106]:
tdidf_obj = TfidfVectorizer(stop_words='english') # Creating TfidfVectorizer object to transform text data into vector form 

tdidf_train = tdidf_obj.fit_transform(X_train)                 # Converting Training textual data into vectorized form

tdidf_test = tdidf_obj.transform(X_test)                        # Converting Test textual data into vectorized form

In [107]:
print(tdidf_train)

  (0, 17250)	0.13199964210096007
  (0, 11737)	0.14679778769868168
  (0, 39113)	0.06286583064797176
  (0, 52595)	0.08330263325895607
  (0, 16364)	0.08158188283651771
  (0, 28802)	0.06642358974414087
  (0, 34650)	0.10630667459943405
  (0, 22678)	0.24213989809313427
  (0, 50314)	0.11775015641568287
  (0, 45737)	0.05853532433531772
  (0, 34976)	0.18096344138299011
  (0, 9599)	0.16995283080345885
  (0, 27712)	0.13740904813151972
  (0, 13953)	0.09930336749088668
  (0, 34741)	0.1258609334610067
  (0, 46083)	0.1382319132130149
  (0, 36011)	0.08227105065138604
  (0, 35695)	0.15971642316583848
  (0, 27505)	0.12135854963673877
  (0, 56527)	0.10095010573690297
  (0, 33953)	0.18004376018137952
  (0, 27935)	0.10921375876799505
  (0, 30266)	0.15402743438144637
  (0, 38291)	0.26973767390164416
  (0, 15934)	0.18884106620523128
  :	:
  (4750, 3578)	0.03291073441696575
  (4750, 33354)	0.0532713940066702
  (4750, 52872)	0.02623068845568102
  (4750, 57207)	0.04649963333534945
  (4750, 41476)	0.064856948293

In [108]:
print(tdidf_test)

  (0, 58084)	0.10490196444587989
  (0, 57639)	0.058033807138333136
  (0, 57638)	0.03313397924467706
  (0, 57575)	0.023414767977566368
  (0, 57011)	0.050491461328057284
  (0, 56978)	0.017015999821753794
  (0, 56725)	0.027488898312367664
  (0, 56527)	0.13173046414760706
  (0, 56466)	0.04829983583491358
  (0, 56211)	0.10201357793817131
  (0, 56206)	0.11538341005160205
  (0, 56204)	0.08307509536444964
  (0, 56064)	0.04231182529199775
  (0, 55904)	0.02107081074601598
  (0, 55396)	0.045806290631788
  (0, 55134)	0.0723687677612936
  (0, 55132)	0.1780324037692727
  (0, 55131)	0.08846504778746107
  (0, 55092)	0.0639257680403735
  (0, 55016)	0.02606246735950871
  (0, 54986)	0.03765097129803996
  (0, 54866)	0.049325210254193716
  (0, 54841)	0.16631391196612896
  (0, 54839)	0.04566342503733216
  (0, 54739)	0.0586637473396429
  :	:
  (1582, 2217)	0.04478953310431864
  (1582, 2094)	0.0285230792830654
  (1582, 952)	0.027057119010390467
  (1582, 835)	0.032276555672628784
  (1582, 816)	0.04515185722502

### Classifier in action, fitting using classifier, predictions and accuracy score calculation

In [109]:

# Creating PassiveAggressiveClassifier object with maximum iteration 50
classifier_pac = PassiveAggressiveClassifier(max_iter= 50)     # Creating classifier object

# Training classifier on vectorized training set and real training set
classifier_pac.fit(tdidf_train, y_train)                        

# Predicting result using classifier object and vectorized test set 'tdidf_test'
y_prediction = classifier_pac.predict(tdidf_test) 

# Calculating accuracy score by comparing predicted result set 'y_prediction' with real result set 'y_test'
algo_score = accuracy_score(y_test, y_prediction) 


print('Accuracy score of PassiveAggresiveClassifier is ---- ', algo_score*100)




Accuracy score of PassiveAggresiveClassifier is ----  93.93939393939394


In [110]:
confusion_matrix(y_test,y_prediction)

array([[746,  44],
       [ 52, 742]])