# Data processing

## Background

The fake vs real news dataset has been downloaded from [Kaggle](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset). This data set includes data for 21,417 articles considered to be 'real' news and 23,481 articles considered to be fake news.

The stated task is: can you use this data set to make an algorithm able to determine if an article is fake news or not ?

## Import packages

In [1]:
import pandas as pd

## Read in datasets

In [2]:
real_df = pd.read_csv('~/documents/Data/Fake vs Real News/True.csv')
fake_df = pd.read_csv('~/documents/Data/Fake vs Real News/Fake.csv')

## Exploratory data analysis

#### Shape

In [12]:
print(real_df.shape)
print(fake_df.shape)

(21417, 3)
(23481, 3)


#### Column names

In [3]:
print(real_df.columns, '\n')
print(fake_df.columns)

Index(['title', 'text', 'subject', 'date'], dtype='object') 

Index(['title', 'text', 'subject', 'date'], dtype='object')


#### Data types

In [4]:
print(real_df.dtypes, '\n')
print(fake_df.dtypes)

title      object
text       object
subject    object
date       object
dtype: object 

title      object
text       object
subject    object
date       object
dtype: object


#### Drop irrelevant columns

In [5]:
print(pd.unique(real_df['subject']))
print(pd.unique(fake_df['subject']))

['politicsNews' 'worldnews']
['News' 'politics' 'Government News' 'left-news' 'US_News' 'Middle-east']


There is no overlap of the 'subject' values in the real dataset and those in the fake dataset. This means that you could create a model that correctly predicts real vs fake news in this data set with 100% accuracy. However, this is not a meaningful distinction, and the model would not generalise well to other datasets. Therefore, I choose to discard this column.

In [7]:
real_df = real_df.drop('subject', axis = 1)
fake_df = fake_df.drop('subject', axis = 1)

KeyError: "['subject'] not found in axis"

In [8]:
print(real_df.columns, '\n')
print(fake_df.columns)

Index(['title', 'text', 'date'], dtype='object') 

Index(['title', 'text', 'date'], dtype='object')


#### Rename columns

No need to rename columns in this case. Their meaning is clear.

#### Check for duplicate rows

In [28]:
print(real_df.duplicated().sum())
print(fake_df.duplicated().sum())

0
0


#### Drop duplicate rows

In [None]:
real_df = real_df.drop_duplicates()
fake_df = fake_df.drop_duplicates()

In [24]:
print(real_df.shape)
print(fake_df.shape)

(21200, 3)
(17910, 3)


#### Check for null or NA values

In [30]:
print(real_df.isna().sum(), '\n')
print(fake_df.isna().sum(), '\n')

title    0
text     0
date     0
dtype: int64 

title    0
text     0
date     0
dtype: int64 

