# All the steps to be followed
* **Problem Defindation** - Here we want to predict/conclude whether the given news is real or fake ? 

* **Data** - Check whether any missing data is there in your database. Check if any categorical data is there to be transformed. 

* **Features** - 
                 Feature Dictionary:
                                     * Urls     -Here we have the source of the news. Where we have got this news from.
                                     * Headline - Basic Headline we have at the top of the article.
                                     * Body     - Body has the entire article.
                                     * Fake     - 1-genuine , 0-Fake


In [1]:
import pandas as pd
import numpy as np

In [2]:
fake_data=pd.read_csv("Fake_news_detection_dataset.csv")
fake_data.head()


Unnamed: 0,URLs,Headline,Body,Label
0,http://www.bbc.com/news/world-us-canada-414191...,Four ways Bob Corker skewered Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,1
1,https://www.reuters.com/article/us-filmfestiva...,Linklater's war veteran comedy speaks to moder...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",1
2,https://www.nytimes.com/2017/10/09/us/politics...,Trump’s Fight With Corker Jeopardizes His Legi...,The feud broke into public view last week when...,1
3,https://www.reuters.com/article/us-mexico-oil-...,Egypt's Cheiron wins tie-up with Pemex for Mex...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,1
4,http://www.cnn.com/videos/cnnmoney/2017/10/08/...,Jason Aldean opens 'SNL' with Vegas tribute,"Country singer Jason Aldean, who was performin...",1


In [3]:
fake_data.rename(columns={"Label": "Fake"},inplace=True)

In [4]:
fake_data.describe()

Unnamed: 0,Fake
count,4009.0
mean,0.466949
std,0.498969
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


In [5]:
fake_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4009 entries, 0 to 4008
Data columns (total 4 columns):
URLs        4009 non-null object
Headline    4009 non-null object
Body        3988 non-null object
Fake        4009 non-null int64
dtypes: int64(1), object(3)
memory usage: 125.4+ KB


In [6]:
fake_data.isna().sum()

URLs         0
Headline     0
Body        21
Fake         0
dtype: int64

In [7]:
fake_data.Body[0]

'Image copyright Getty Images\nOn Sunday morning, Donald Trump went off on a Twitter tirade against a member of his own party.\nThis, in itself, isn\'t exactly huge news. It\'s far from the first time the president has turned his rhetorical cannons on his own ranks.\nThis time, however, his attacks were particularly biting and personal. He essentially called Tennessee Senator Bob Corker, the chair of the powerful Senate Foreign Relations Committee, a coward for not running for re-election.\nHe said Mr Corker "begged" for the president\'s endorsement, which he refused to give. He wrongly claimed that Mr Corker\'s support of the Iranian nuclear agreement was his only political accomplishment.\nUnlike some of his colleagues, Mr Corker - free from having to worry about his immediate political future - didn\'t hold his tongue.\nSkip Twitter post by @SenBobCorker It\'s a shame the White House has become an adult day care center. Someone obviously missed their shift this morning. — Senator Bo

In [8]:
# SO we have missing data in body which cannot be filled with any other value cause it doesnt make any sense. here we will simply remove the samples with missing body column.
fake_data.dropna(inplace=True)

In [9]:
fake_data.isna().sum()

URLs        0
Headline    0
Body        0
Fake        0
dtype: int64

In [10]:
# We are rmoving the Url column cause we dont need it here.
fake_data=fake_data.drop("URLs",axis=1)


In [11]:
fake_data.head()

Unnamed: 0,Headline,Body,Fake
0,Four ways Bob Corker skewered Donald Trump,Image copyright Getty Images\nOn Sunday mornin...,1
1,Linklater's war veteran comedy speaks to moder...,"LONDON (Reuters) - “Last Flag Flying”, a comed...",1
2,Trump’s Fight With Corker Jeopardizes His Legi...,The feud broke into public view last week when...,1
3,Egypt's Cheiron wins tie-up with Pemex for Mex...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...,1
4,Jason Aldean opens 'SNL' with Vegas tribute,"Country singer Jason Aldean, who was performin...",1


In [12]:
#There are a lot rows , we just want to learn and practice so we will be using only first 1000 rows.
fake_data=fake_data[0:1000]

In [13]:
fake_data.shape

(1000, 3)

In [14]:
# Here first : means take all the data(samples),second : means take all the columns except last column (Fake_column).
x=fake_data.iloc[:,:-1]  # If we dont write values,x will be considered as dataframe. which we cannot pass as training and testing data.
y=fake_data.iloc[:,-1]

In [15]:
x

Unnamed: 0,Headline,Body
0,Four ways Bob Corker skewered Donald Trump,Image copyright Getty Images\nOn Sunday mornin...
1,Linklater's war veteran comedy speaks to moder...,"LONDON (Reuters) - “Last Flag Flying”, a comed..."
2,Trump’s Fight With Corker Jeopardizes His Legi...,The feud broke into public view last week when...
3,Egypt's Cheiron wins tie-up with Pemex for Mex...,MEXICO CITY (Reuters) - Egypt’s Cheiron Holdin...
4,Jason Aldean opens 'SNL' with Vegas tribute,"Country singer Jason Aldean, who was performin..."
...,...,...
1003,Puckin Hostile Shoutcast - Episode 90,Puckin Hostile Shoutcast – Episode 90\n% of re...
1004,John Green Tells a Story of Emotional Pain and...,"“Turtles All the Way Down,” published on Tuesd..."
1005,Disney's Tiffany Thornton defends remarrying t...,(CNN) A former Disney Channel star has struck ...
1007,North Korean Leader Hails Nuclear Arsenal as ‘...,"Photo\nSEOUL, South Korea — The North Korean l..."


In [16]:
y

0       1
1       1
2       1
3       1
4       1
       ..
1003    0
1004    1
1005    1
1007    1
1008    1
Name: Fake, Length: 1000, dtype: int64

In [17]:
x=fake_data.iloc[:,:-1].values
y=fake_data.iloc[:,-1].values

In [18]:
x[0]

array(['Four ways Bob Corker skewered Donald Trump',
       'Image copyright Getty Images\nOn Sunday morning, Donald Trump went off on a Twitter tirade against a member of his own party.\nThis, in itself, isn\'t exactly huge news. It\'s far from the first time the president has turned his rhetorical cannons on his own ranks.\nThis time, however, his attacks were particularly biting and personal. He essentially called Tennessee Senator Bob Corker, the chair of the powerful Senate Foreign Relations Committee, a coward for not running for re-election.\nHe said Mr Corker "begged" for the president\'s endorsement, which he refused to give. He wrongly claimed that Mr Corker\'s support of the Iranian nuclear agreement was his only political accomplishment.\nUnlike some of his colleagues, Mr Corker - free from having to worry about his immediate political future - didn\'t hold his tongue.\nSkip Twitter post by @SenBobCorker It\'s a shame the White House has become an adult day care center. Som

In [19]:
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=5000)
mat_body=cv.fit_transform(x[:,1]).todense()

In [20]:
mat_body[:10]

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [1, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 1, 0, ..., 0, 0, 0]], dtype=int64)

In [21]:
mat_head=cv.fit_transform(x[:,0]).todense()

In [22]:
mat_head[:2]

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [23]:
mat_head.shape

(1000, 3491)

In [24]:
mat_body.shape

(1000, 5000)

In [25]:
x_mat=np.hstack((mat_head,mat_body))  #basically we are stacking mat_head on mat_body.

In [26]:
x_mat

matrix([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [27]:
x_mat.shape

(1000, 8491)

In [28]:
y.shape

(1000,)

In [29]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x_mat,y,test_size=0.2,random_state=0)

In [30]:
from sklearn.tree import DecisionTreeClassifier
dtc=DecisionTreeClassifier(criterion="entropy")
dtc.fit(x_train,y_train)
y_pred_dtr=dtc.predict(x_test)

In [31]:
y_pred_dtr

array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1], dtype=int64)

In [32]:
y_test

array([1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1], dtype=int64)

In [33]:
y_pred_dtr.shape

(200,)

In [34]:
y_test.shape

(200,)

In [35]:
x_test.shape

(200, 8491)

In [36]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred_dtr)

array([[99,  7],
       [ 6, 88]], dtype=int64)

In [40]:
(99+88)/(99+7+6+88)

0.935