# Data Cleaning and Data Splitting for Fake News Dataset

Instructions:
1. Upload the news_datasets.csv file to `/content/` in the Colab Notebook. (labelled as 'Files' in the left panel)
2. Set the script parameters in the next cell.
3. Run the script and the script will automatically download the output .csv files.

## Global Parameters

In [90]:
train_test_split = 0.7 #@param {type:"number"}

## Libraries Import

In [91]:
# Import libraries
import pandas as pd
import re
from google.colab import files
from datetime import datetime
import shutil

## Extract Data from .csv File

In [92]:
output_train_csv_file = 'train_clean.csv'
output_test_csv_file = 'test_clean.csv'
input_file_path = '/content/news_datasets.csv'

!gdown "19EuhnKnfrJHW2zRlynN_kmcaxdW_Vfga"

# Read `news_datasets.csv` into a panda dataframe
df = pd.read_csv(input_file_path)
df = df.drop(['Unnamed: 0', 'text'], axis=1)
df.head(10)

Downloading...
From: https://drive.google.com/uc?id=19EuhnKnfrJHW2zRlynN_kmcaxdW_Vfga
To: /content/news_datasets.csv
100% 30.7M/30.7M [00:00<00:00, 142MB/s] 


Unnamed: 0,title,label
0,You Can Smell Hillary’s Fear,FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,FAKE
2,Kerry to go to Paris in gesture of sympathy,REAL
3,Bernie supporters on Twitter erupt in anger ag...,FAKE
4,The Battle of New York: Why This Primary Matters,REAL
5,"Tehran, USA",FAKE
6,Girl Horrified At What She Watches Boyfriend D...,FAKE
7,‘Britain’s Schindler’ Dies at 106,REAL
8,Fact check: Trump and Clinton at the 'commande...,REAL
9,Iran reportedly makes new push for uranium con...,REAL


## Check for Data Imbalance

In [93]:
# Check for data imbalance by counting the labels
df['label'].value_counts(normalize=True)

label
REAL    0.500552
FAKE    0.499448
Name: proportion, dtype: float64

## Clean Dataset

In [94]:
# Remove punctuations and convert all to lower case in 'title'
df['title'] = df['title'].str.replace(r'[-–]', ' ' , regex=True)
df['title'] = df['title'].str.replace(r'[^\w\s]+', '', regex=True)
df['title'] = df['title'].str.lower()
df

Unnamed: 0,title,label
0,you can smell hillarys fear,FAKE
1,watch the exact moment paul ryan committed pol...,FAKE
2,kerry to go to paris in gesture of sympathy,REAL
3,bernie supporters on twitter erupt in anger ag...,FAKE
4,the battle of new york why this primary matters,REAL
...,...,...
6330,state department says it cant find emails from...,REAL
6331,the p in pbs should stand for plutocratic or p...,FAKE
6332,anti trump protesters are tools of the oligarc...,FAKE
6333,in ethiopia obama seeks progress on peace secu...,REAL


## Split Dataset into Train/Test Sets

In [95]:
# Split into train/test sets
train_real_df = df[df['label'] == 'REAL'].sample(frac=train_test_split, random_state=42)
train_fake_df = df[df['label'] == 'FAKE'].sample(frac=train_test_split, random_state=42)

test_real_df = df[df['label'] == 'REAL'].drop(train_real_df.index)
test_fake_df = df[df['label'] == 'FAKE'].drop(train_fake_df.index)

train_df = pd.concat([train_real_df, train_fake_df], ignore_index=True)
test_df = pd.concat([test_real_df, test_fake_df] , ignore_index=True)

In [96]:
train_df

Unnamed: 0,title,label
0,going back to the future in 2016,REAL
1,dem insiders sanders failed to dent clinton,REAL
2,fuming over ryan some conservative voices turn...,REAL
3,cruz trump and rubio win in iowa and now we kn...,REAL
4,trump will skip gop debate as feud with fox ne...,REAL
...,...,...
4430,valentin katasonov america is in agony and tru...,FAKE
4431,progressives find white trash more threatening...,FAKE
4432,they said what find out what paul krugman aret...,FAKE
4433,the loosening grip of dc,FAKE


In [97]:
test_df

Unnamed: 0,title,label
0,the battle of new york why this primary matters,REAL
1,fact check trump and clinton at the commander ...,REAL
2,iran reportedly makes new push for uranium con...,REAL
3,with all three clintons in iowa a glimpse at t...,REAL
4,whats in that iran bill that obama doesnt like,REAL
...,...,...
1895,top aide to hillary clinton urges the fbi to d...,FAKE
1896,government confirm compensation for survivors ...,FAKE
1897,thousands of wild american bison appear from n...,FAKE
1898,hillary clinton fbi and the real november surp...,FAKE


## Generate Output .csv Files

In [98]:
# Convert dataframes to .csv files
!mkdir -p output
train_df.to_csv('output/' + output_train_csv_file, index=False)
test_df.to_csv('output/' + output_test_csv_file, index=False)

In [99]:
# Download the .csv files
zip_file_name = 'output' + '_' + datetime.now().strftime('%Y%m%d%H%M%S')
shutil.make_archive(zip_file_name, 'zip', 'output/')
files.download(zip_file_name + '.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>