In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('data.csv')

In the given kaggle dataset, label = 1 means real news and label = 0 means fake news.

- Preprocessing:

In [10]:
print("Features in the current dataset:", data.columns)

Features in the current dataset: Index(['URLs', 'Headline', 'Body', 'Label'], dtype='object')


Only label is a continuous feature.URLs can be treated as categorical features. Rest are text data. No preprocessing for continuous features needed here since there are no continuous features. Since URLs can be considered as categorical data, we can use target encoding for its preprocessing in the pipeline.

In [11]:
#exploration of data
data.nunique(axis=0)

URLs        3352
Headline    2831
Body        2863
Label          2
dtype: int64

There are non unique values in all columns

In [12]:
len(data)

4009

In [14]:
len(data)-len(data.drop_duplicates())

0

But there are no duplicate rows

In [15]:
len(data)

4009

In [23]:
#inspecting duplicate columns values for inspection
df_sameURLs = data[data.duplicated('URLs', keep=False)].sort_values('URLs')

In [24]:
df_sameURLs

Unnamed: 0,URLs,Headline,Body,Label
2491,http://beforeitsnews.com/entertainment/2017/09...,10 Shocking Facts About Porn You Probably Didn...,Pink Floyd comfortably numb! Not for long but ...,0
1015,http://beforeitsnews.com/entertainment/2017/09...,10 Shocking Facts About Porn You Probably Didn...,An Embattled Pharmaceutical Company That Sells...,0
1021,http://beforeitsnews.com/entertainment/2017/09...,10 Shocking Facts About Porn You Probably Didn...,"Vietnam Is in Great Danger, You Must Publish a...",0
3699,http://beforeitsnews.com/entertainment/2017/09...,10 Shocking Facts About Porn You Probably Didn...,Shooter's Hotel Room: Bad News After What Was ...,0
3408,http://beforeitsnews.com/entertainment/2017/10...,10 Awful Moments Caught on Video!,10 Awful Moments Caught on Video!\n% of reader...,0
...,...,...,...,...
2180,https://www.reuters.com/article/us-usa-ruralam...,Trump's popularity is slipping in rural Americ...,(Reuters) - Outside the Morgan County fair in ...,1
1766,https://www.reuters.com/article/us-usa-turkey-...,Turkey urges U.S. to review visa suspension as...,ANKARA (Reuters) - Turkey urged the United Sta...,1
1736,https://www.reuters.com/article/us-usa-turkey-...,Turkey urges U.S. to review visa suspension as...,ANKARA (Reuters) - Turkey urged the United Sta...,1
2795,https://www.reuters.com/article/us-usa-vietnam...,"Highlighting Vietnam War's relevance, exhibit ...",An exhibit about the Vietnam War is seen at th...,1


In [25]:
#We see same URLs
df_sameURLs.loc[2491]

URLs        http://beforeitsnews.com/entertainment/2017/09...
Headline    10 Shocking Facts About Porn You Probably Didn...
Body        Pink Floyd comfortably numb! Not for long but ...
Label                                                       0
Name: 2491, dtype: object

In [26]:
df_sameURLs.loc[1015]

URLs        http://beforeitsnews.com/entertainment/2017/09...
Headline    10 Shocking Facts About Porn You Probably Didn...
Body        An Embattled Pharmaceutical Company That Sells...
Label                                                       0
Name: 1015, dtype: object

In [27]:
#inspecting duplicate columns values for inspection in headline
df_sameHeadline = data[data.duplicated('Headline', keep=False)].sort_values('Headline')
df_sameHeadline

Unnamed: 0,URLs,Headline,Body,Label
2175,http://beforeitsnews.com/u-s-politics/2017/10/...,"""Everything was burned"" Rohingya Remain Doubtf...",A Potato Battery Can Light up a Room for Over ...,0
745,http://beforeitsnews.com/u-s-politics/2017/10/...,"""Everything was burned"" Rohingya Remain Doubtf...","Vietnam Is in Great Danger, You Must Publish a...",0
1825,http://beforeitsnews.com/u-s-politics/2017/10/...,"""Everything was burned"" Rohingya Remain Doubtf...",Red Flag Warning: These California Wildfires A...,0
2229,http://beforeitsnews.com/u-s-politics/2017/10/...,"""Everything was burned"" Rohingya Remain Doubtf...",Warning Something Big Is About to Happen in Am...,0
1680,http://beforeitsnews.com/u-s-politics/2017/10/...,"""Everything was burned"" Rohingya Remain Doubtf...",An Embattled Pharmaceutical Company That Sells...,0
...,...,...,...,...
794,http://dailybuzzlive.com/police-wont-release-i...,‘These people need to be protected’: Police wo...,A group of white teens attacked an 8-year-old ...,0
3943,http://beforeitsnews.com/u-s-politics/2017/09/...,“Racist Propaganda”: Librarian Rejects Books D...,“Racist Propaganda”: Librarian Rejects Books D...,0
828,http://beforeitsnews.com/u-s-politics/2017/09/...,“Racist Propaganda”: Librarian Rejects Books D...,Warning Something Big Is About to Happen in Am...,0
2233,http://beforeitsnews.com/u-s-politics/2017/09/...,“Racist Propaganda”: Librarian Rejects Books D...,An Embattled Pharmaceutical Company That Sells...,0


In [30]:
df_sameHeadline.loc[2175]

URLs        http://beforeitsnews.com/u-s-politics/2017/10/...
Headline    "Everything was burned" Rohingya Remain Doubtf...
Body        A Potato Battery Can Light up a Room for Over ...
Label                                                       0
Name: 2175, dtype: object

In [31]:
df_sameHeadline.loc[745]

URLs        http://beforeitsnews.com/u-s-politics/2017/10/...
Headline    "Everything was burned" Rohingya Remain Doubtf...
Body        Vietnam Is in Great Danger, You Must Publish a...
Label                                                       0
Name: 745, dtype: object

On inspection, it seems like dummy data was created. Since the focus of this work is more on merging PageRank algorithm to a classification model, we move on from this non-unique data issue.

In [32]:
#checking for missing data
data.isna().sum()

URLs         0
Headline     0
Body        21
Label        0
dtype: int64

Body column has 21 missing values. For the time being, I am dropping these 21 rows. (in the end, if time remains, I will think of ways like using the nearest sentence (using its vector form from word2vec representation) to find an imputation way). I am also not doing missing data pattern exploration in the interest of time.

In [34]:
#dropping rows with null values
len(data.dropna(axis=0))

3988

In [36]:
data = data.dropna(axis=0)
# len(data)

Baseline Model using the features in the dataset first: (model chosen here is SVC -- on literature review, SVMs performed better for fake news detection)

In [46]:
X = data.drop(columns = "Label")
y = data['Label']

In [55]:
#Simple text-based model using a bag-of-words approach and a linear model
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split


X['headline+body'] = X['Headline'] + X['Body']
X.drop(['Headline','Body'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# vect = CountVectorizer()
# X_train = vect.fit_transform(X_train['headline+body'])

In [57]:
#model
import numpy as np
from sklearn.svm import LinearSVC
from category_encoders import TargetEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer, make_column_transformer

preprocess = make_column_transformer(
    (TargetEncoder(), 'URLs'),(CountVectorizer(),'headline+body'))
model = make_pipeline(preprocess, LinearSVC(max_iter=1000, tol=0.0001, dual = False))
scores = cross_val_score(model, X_train, y_train)
np.mean(scores)

0.964894110027303

Model with the given features gives a cross validation score of 96.49. Now we check with the added feature from PageRank algorithm.

In [75]:
#Creating input dataset to pagerank algorithm
import pandas as pd

def create_page_rank_dummy_relations():
    df = pd.read_csv('data.csv')
    urls = df['URLs']
    labels = df['Label']
    n = len(urls)
    csv = pd.DataFrame()
    res = []

    for i in range(1, n):
        res.append((urls[i - 1], urls[i]))
        if labels[i] == 0:
            res.append((urls[i - 1], urls[i]))
        else:
            # Create more relations for genuine links
            if i < n - 4:
                res.append((urls[i + 1], urls[i]))
                res.append((urls[i + 2], urls[i]))
                res.append((urls[i + 3], urls[i]))
            if i > 5:
                res.append((urls[i - 1], urls[i]))
                res.append((urls[i - 2], urls[i]))
                res.append((urls[i - 3], urls[i]))
                res.append((urls[i - 4], urls[i]))
                
    csv['From'] = [i for i, j in res]
    csv['To'] = [j for i, j in res]
    # save result to result.csv
    csv.to_csv('result.csv',index=False)
create_page_rank_dummy_relations()

In [77]:
input_to_pagerank =  pd.read_csv('result.csv')
input_to_pagerank.head()

Unnamed: 0,From,To
0,http://www.bbc.com/news/world-us-canada-414191...,https://www.reuters.com/article/us-filmfestiva...
1,https://www.nytimes.com/2017/10/09/us/politics...,https://www.reuters.com/article/us-filmfestiva...
2,https://www.reuters.com/article/us-mexico-oil-...,https://www.reuters.com/article/us-filmfestiva...
3,http://www.cnn.com/videos/cnnmoney/2017/10/08/...,https://www.reuters.com/article/us-filmfestiva...
4,https://www.reuters.com/article/us-filmfestiva...,https://www.nytimes.com/2017/10/09/us/politics...


In [88]:
#output from pagerank algorithm
feature = pd.read_csv('newfeature.csv', header = None)

In [89]:
feature.head()

Unnamed: 0,0,1
0,http://www.bbc.com/sport/tennis/41586968,0.182078
1,https://www.reuters.com/article/us-mexico-slim...,0.211446
2,http://www.disclose.tv/news/the_ancient_tomb_o...,0.211171
3,http://www.cnn.com/2017/10/03/health/abortion-...,0.188691
4,https://www.reuters.com/article/us-science-hum...,0.190198


In [90]:
len(feature)

436

In [99]:
feature.columns= ['URLs','rank']

In [100]:
feature

Unnamed: 0,URLs,rank
0,http://www.bbc.com/sport/tennis/41586968,0.182078
1,https://www.reuters.com/article/us-mexico-slim...,0.211446
2,http://www.disclose.tv/news/the_ancient_tomb_o...,0.211171
3,http://www.cnn.com/2017/10/03/health/abortion-...,0.188691
4,https://www.reuters.com/article/us-science-hum...,0.190198
...,...,...
431,http://www.cnn.com/2017/10/07/us/iraq-gay-sold...,0.220944
432,http://www.cnn.com/2017/10/10/politics/hillary...,0.303364
433,http://money.cnn.com/2017/10/12/investing/bitc...,0.180330
434,https://www.reuters.com/article/us-deloitte-cy...,0.242419


In [94]:
X_train

Unnamed: 0,URLs,Headline,Body,headline+body
1393,http://beforeitsnews.com/sports/2017/10/anothe...,Another Team-Wide Poor Effort,Another Team-Wide Poor Effort\n% of readers th...,Another Team-Wide Poor EffortAnother Team-Wide...
563,http://abcnews.go.com/Entertainment/wireStory/...,British novelist Ishiguro wins Nobel Literatur...,"Kazuo Ishiguro, the Japanese-born British nove...",British novelist Ishiguro wins Nobel Literatur...
782,https://www.nytimes.com/2017/10/08/movies/blad...,‘Blade Runner 2049’ Sputters at the Domestic B...,Photo\nLOS ANGELES — The expensive science-fic...,‘Blade Runner 2049’ Sputters at the Domestic B...
2586,http://beforeitsnews.com/u-s-politics/2017/10/...,Aung San Suu Kyi Stripped Of Oxford Honor Amid...,Red Flag Warning: These California Wildfires A...,Aung San Suu Kyi Stripped Of Oxford Honor Amid...
3929,https://www.reuters.com/article/us-people-harv...,Weinstein on indefinite leave as company inves...,FILE PHOTO: Harvey Weinstein poses on the Red ...,Weinstein on indefinite leave as company inves...
...,...,...,...,...
768,https://www.activistpost.com/2017/10/smart-met...,Smart Meter Data: Privacy and Cybersecurity,By Catherine J. Frompovich\nIn February of 201...,Smart Meter Data: Privacy and CybersecurityBy ...
2679,https://www.activistpost.com/2017/09/new-study...,New Study Provides Further Evidence of Low IQ ...,By Derrick Broze\nA new study has found that p...,New Study Provides Further Evidence of Low IQ ...
2299,https://www.reuters.com/article/us-mideast-cri...,"Nusra Front, Islamic State clash in Syria's Ha...",BEIRUT (Reuters) - Islamic State took control ...,"Nusra Front, Islamic State clash in Syria's Ha..."
1094,http://beforeitsnews.com/sports/2017/10/ravens...,Ravens Again Abandon Running Game,Ravens Again Abandon Running Game\n(Before It'...,Ravens Again Abandon Running GameRavens Again ...


In [103]:
type(X_train['URLs'])

pandas.core.series.Series

In [104]:
type(feature['URLs'])

pandas.core.series.Series

In [127]:
final = pd.merge(X_train, feature, on='URLs', how='left')

In [106]:
final.shape

(2991, 5)

those urls that have nan ranks were not linked (it is a small dummy data). so replacing their ranks with 0.15

In [107]:
final

Unnamed: 0,URLs,Headline,Body,headline+body,rank
0,http://beforeitsnews.com/sports/2017/10/anothe...,Another Team-Wide Poor Effort,Another Team-Wide Poor Effort\n% of readers th...,Another Team-Wide Poor EffortAnother Team-Wide...,
1,http://abcnews.go.com/Entertainment/wireStory/...,British novelist Ishiguro wins Nobel Literatur...,"Kazuo Ishiguro, the Japanese-born British nove...",British novelist Ishiguro wins Nobel Literatur...,0.182221
2,https://www.nytimes.com/2017/10/08/movies/blad...,‘Blade Runner 2049’ Sputters at the Domestic B...,Photo\nLOS ANGELES — The expensive science-fic...,‘Blade Runner 2049’ Sputters at the Domestic B...,
3,http://beforeitsnews.com/u-s-politics/2017/10/...,Aung San Suu Kyi Stripped Of Oxford Honor Amid...,Red Flag Warning: These California Wildfires A...,Aung San Suu Kyi Stripped Of Oxford Honor Amid...,
4,https://www.reuters.com/article/us-people-harv...,Weinstein on indefinite leave as company inves...,FILE PHOTO: Harvey Weinstein poses on the Red ...,Weinstein on indefinite leave as company inves...,
...,...,...,...,...,...
2986,https://www.activistpost.com/2017/10/smart-met...,Smart Meter Data: Privacy and Cybersecurity,By Catherine J. Frompovich\nIn February of 201...,Smart Meter Data: Privacy and CybersecurityBy ...,
2987,https://www.activistpost.com/2017/09/new-study...,New Study Provides Further Evidence of Low IQ ...,By Derrick Broze\nA new study has found that p...,New Study Provides Further Evidence of Low IQ ...,
2988,https://www.reuters.com/article/us-mideast-cri...,"Nusra Front, Islamic State clash in Syria's Ha...",BEIRUT (Reuters) - Islamic State took control ...,"Nusra Front, Islamic State clash in Syria's Ha...",
2989,http://beforeitsnews.com/sports/2017/10/ravens...,Ravens Again Abandon Running Game,Ravens Again Abandon Running Game\n(Before It'...,Ravens Again Abandon Running GameRavens Again ...,


In [131]:
final["rank"] = final["rank"].fillna(0.15)

In [132]:
final

Unnamed: 0,URLs,headline+body,rank
0,http://beforeitsnews.com/sports/2017/10/anothe...,Another Team-Wide Poor EffortAnother Team-Wide...,0.150000
1,http://abcnews.go.com/Entertainment/wireStory/...,British novelist Ishiguro wins Nobel Literatur...,0.182221
2,https://www.nytimes.com/2017/10/08/movies/blad...,‘Blade Runner 2049’ Sputters at the Domestic B...,0.150000
3,http://beforeitsnews.com/u-s-politics/2017/10/...,Aung San Suu Kyi Stripped Of Oxford Honor Amid...,0.150000
4,https://www.reuters.com/article/us-people-harv...,Weinstein on indefinite leave as company inves...,0.150000
...,...,...,...
2986,https://www.activistpost.com/2017/10/smart-met...,Smart Meter Data: Privacy and CybersecurityBy ...,0.150000
2987,https://www.activistpost.com/2017/09/new-study...,New Study Provides Further Evidence of Low IQ ...,0.150000
2988,https://www.reuters.com/article/us-mideast-cri...,"Nusra Front, Islamic State clash in Syria's Ha...",0.150000
2989,http://beforeitsnews.com/sports/2017/10/ravens...,Ravens Again Abandon Running GameRavens Again ...,0.150000


In [92]:
y_train.shape

(2991,)

In [129]:
final = final.drop(['Headline','Body'], axis = 1)

In [133]:
final

Unnamed: 0,URLs,headline+body,rank
0,http://beforeitsnews.com/sports/2017/10/anothe...,Another Team-Wide Poor EffortAnother Team-Wide...,0.150000
1,http://abcnews.go.com/Entertainment/wireStory/...,British novelist Ishiguro wins Nobel Literatur...,0.182221
2,https://www.nytimes.com/2017/10/08/movies/blad...,‘Blade Runner 2049’ Sputters at the Domestic B...,0.150000
3,http://beforeitsnews.com/u-s-politics/2017/10/...,Aung San Suu Kyi Stripped Of Oxford Honor Amid...,0.150000
4,https://www.reuters.com/article/us-people-harv...,Weinstein on indefinite leave as company inves...,0.150000
...,...,...,...
2986,https://www.activistpost.com/2017/10/smart-met...,Smart Meter Data: Privacy and CybersecurityBy ...,0.150000
2987,https://www.activistpost.com/2017/09/new-study...,New Study Provides Further Evidence of Low IQ ...,0.150000
2988,https://www.reuters.com/article/us-mideast-cri...,"Nusra Front, Islamic State clash in Syria's Ha...",0.150000
2989,http://beforeitsnews.com/sports/2017/10/ravens...,Ravens Again Abandon Running GameRavens Again ...,0.150000


In [140]:
final.round(3)

Unnamed: 0,URLs,headline+body,rank
0,http://beforeitsnews.com/sports/2017/10/anothe...,Another Team-Wide Poor EffortAnother Team-Wide...,0.150
1,http://abcnews.go.com/Entertainment/wireStory/...,British novelist Ishiguro wins Nobel Literatur...,0.182
2,https://www.nytimes.com/2017/10/08/movies/blad...,‘Blade Runner 2049’ Sputters at the Domestic B...,0.150
3,http://beforeitsnews.com/u-s-politics/2017/10/...,Aung San Suu Kyi Stripped Of Oxford Honor Amid...,0.150
4,https://www.reuters.com/article/us-people-harv...,Weinstein on indefinite leave as company inves...,0.150
...,...,...,...
2986,https://www.activistpost.com/2017/10/smart-met...,Smart Meter Data: Privacy and CybersecurityBy ...,0.150
2987,https://www.activistpost.com/2017/09/new-study...,New Study Provides Further Evidence of Low IQ ...,0.150
2988,https://www.reuters.com/article/us-mideast-cri...,"Nusra Front, Islamic State clash in Syria's Ha...",0.150
2989,http://beforeitsnews.com/sports/2017/10/ravens...,Ravens Again Abandon Running GameRavens Again ...,0.150


In [144]:
final = final.fillna(final.mean())

In [145]:
#model

preprocess = make_column_transformer(
    (TargetEncoder(), 'URLs'),(CountVectorizer(),'headline+body'))
model = make_pipeline(preprocess, LinearSVC(max_iter=1000, tol=0.0001, dual = False))
scores = cross_val_score(model, final, y_train)
np.mean(scores)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').



nan

In [137]:
final = final.replace([np.inf, -np.inf], np.nan)

In [138]:
final.isna().sum()

URLs             0
headline+body    0
rank             0
dtype: int64

In [139]:
final.notnull().values.all()

True

There are no infinity or null values.