# Fake news detection

**Authors:** Peter Mačinec, Simona Miková

## Data preprocessing

In [1]:
import pandas as pd
import sys
sys.path.append('..')

from src.data.preprocessing import preprocess_data

In [2]:
df = pd.read_json('../data/raw/dataset.json', orient='index')

In [3]:
len(df)

195764

In [4]:
df.head()

Unnamed: 0,author,body,id,image,label,perex,source,title
235036,Mike Adams,(NaturalNews) The United States government cla...,235036,https://www.naturalnewsblogs.com/wp-content/up...,unreliable,<p>(NaturalNews) The United States government ...,naturalnewsblogs.com,US government claims 100% ownership over all y...
235037,by Ronica O&rsquo;Hara (info@www.naturalawaken...,DIGITAL KIDS: How to Click With Young Techies\...,235037,http://www.naturalawakeningsmag.com/Healthy-Ki...,unreliable,Many Silicon Valley executives that design dev...,naturalawakeningsmag.com,DIGITAL KIDS: How to Click With Young Techies
235038,by Kathleen Gould and Madalyn Johnson (info@ww...,Herbs: Nature’s Fountain of Youth\n\nby Kathle...,235038,http://www.naturalaz.com/ARIZ/September-2019/H...,unreliable,It seems aging is a two-edge sword. At the sam...,naturalawakeningsmag.com,Herbs: Nature’s Fountain of Youth
235039,Mike Adams,(NaturalNews) Beyond merely inspiring women to...,235039,https://www.naturalnewsblogs.com/wp-content/up...,unreliable,<p>(NaturalNews) Beyond merely inspiring women...,naturalnewsblogs.com,Angelina Jolie copied by men! Surgeons now cut...
235040,by Andrea Purcell (info@www.naturalawakeningsm...,Give Your Brain a Boost\n\nby Andrea Purcell\n...,235040,http://www.naturalaz.com/ARIZ/September-2019/G...,unreliable,"In the United States, there are currently 9.4 ...",naturalawakeningsmag.com,Give Your Brain a Boost


In [5]:
%%time
df_preprocessed = preprocess_data([df])[0]

ColumnsFilter transformation started.
ColumnsFilter transformation ended, took 0.0650336742401123 seconds.
EmptyValuesFilter transformation started.
EmptyValuesFilter transformation ended, took 0.20400023460388184 seconds.
DuplicatesFilter transformation started.
DuplicatesFilter transformation ended, took 6.681593894958496 seconds.
TextPreprocessor transformation started.
TextPreprocessor transformation ended, took 138.9123866558075 seconds.
ArticlesSizeFilter transformation started.
ArticlesSizeFilter transformation ended, took 9.332477569580078 seconds.
ArticlesSentenceLengthFilter transformation started.
ArticlesSentenceLengthFilter transformation ended, took 297.2376461029053 seconds.
ArticlesLanguageFilter transformation started.
ArticlesLanguageFilter transformation ended, took 3513.211053609848 seconds.
Wall time: 1h 6min 9s


In [6]:
len(df_preprocessed)

140647

In [7]:
df_preprocessed.head()

Unnamed: 0,body,label
235036,naturalnews the united states government clai...,unreliable
235037,digital kids how to click with young techies b...,unreliable
235038,herbs nature s fountain of youth by kathleen g...,unreliable
235039,naturalnews beyond merely inspiring women to ...,unreliable
235040,give your brain a boost by andrea purcell 123r...,unreliable


## Data balancing

In [8]:
df_preprocessed.label.value_counts()

unreliable    106003
reliable       32856
Name: label, dtype: int64

In [9]:
df_preprocessed.to_csv('../data/preprocessed/dataset_unbalanced.csv')

In [10]:
fakes_sample = df_preprocessed[df_preprocessed.label == 'unreliable'].sample(32856)

In [11]:
df_preprocessed = df_preprocessed[df_preprocessed.label == 'reliable']

In [12]:
df_preprocessed = df_preprocessed.append(fakes_sample)

In [13]:
df_preprocessed.label.value_counts()

unreliable    32856
reliable      32856
Name: label, dtype: int64

In [14]:
df_preprocessed.head()

Unnamed: 0,body,label
259882,healthline media partners with wellness advoca...,reliable
259883,online patient communities are thriving and ha...,reliable
259884,email marketing is more critical than ever. in...,reliable
259885,industry veterans from netflix and ea join fas...,reliable
259886,the healthline property has risen to 1 in the ...,reliable


In [15]:
df_preprocessed.tail()

Unnamed: 0,body,label
325321,naturalhealth365 evidence of how harmful cert...,unreliable
286399,have you ever heard of human challenge studie...,unreliable
424662,"if you are tired of all of retrogrades, eclips...",unreliable
424204,because it takes 17 years for basic science re...,unreliable
282721,! cdata adsbygoogle window.adsbygoogle .push ...,unreliable


In [16]:
df_preprocessed.to_csv('../data/preprocessed/dataset.csv')