# Data Cup Fake News Data Profiling and Parsing

## 1. Import packages

In [1]:
import os
import pandas as pd
import numpy as np
import json
import pandas_profiling

## 2. Prepare training data

### 2.1 Load training data

Here, we want to load our train.json file and convert the data from json to a pandas dataframe. This makes it easier to wrangle (convert data formats, create new dimensions, and standardize) the data.

In [3]:
with open("data/fakenews_datacup/train.json") as f:
    train_data = json.load(f)

df = pd.DataFrame.from_records(train_data)

FileNotFoundError: [Errno 2] No such file or directory: 'data/fakenews_datacup/train.json'

### 2.2 Convert 'date' column to date data type

Change date field to datetime data type YYYY-MM-DD. For more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

In [13]:
df['date'] = pd.to_datetime(df['date'])

In [14]:
df.head()

Unnamed: 0,claim,claimant,date,id,label,related_articles
0,A line from George Orwell's novel 1984 predict...,,2017-07-17,0,0,"[122094, 122580, 130685, 134765]"
1,Maine legislature candidate Leslie Gibson insu...,,2018-03-17,1,2,"[106868, 127320, 128060]"
2,A 17-year-old girl named Alyssa Carson is bein...,,2018-07-18,4,1,"[132130, 132132, 149722]"
3,In 1988 author Roald Dahl penned an open lette...,,2019-02-04,5,2,"[123254, 123418, 127464]"
4,"When it comes to fighting terrorism, ""Another ...",Hillary Clinton,2016-03-22,6,2,"[41099, 89899, 72543, 82644, 95344, 88361]"


### 2.3 Profile the training data

Use pandas_profiling: https://github.com/pandas-profiling/pandas-profiling

In [16]:
df.profile_report(style={'full_width':True})
profile = df.profile_report(title='Fake News Data Profile')
# profile.to_file(output_file="data_profile.html")

In [18]:
profile



Information about the dataset:
- 0: false, 1: partly true, 2: true
- “claimant”: entity who made the claim

# 3. Load and process articles to generate corpora

### 3.1 Build dataframe of article text and article id

In [25]:
path_to_articles = r"/media/bking/data/Datasets/fakenews_datacup/train_articles/"

# Build lists of article id and article text
article_id_list = []
article_text_list = []
for index, article_file in enumerate(os.scandir(path_to_articles)):
    article_path = str(article_file.path)
    article_id = int(''.join(list(filter(str.isdigit, article_path))))
    
    with open(article_path) as f:
         text = f.read()
    
    article_id_list.append(article_id)
    article_text_list.append(text)
#     if index == 3:
#         break
#     print(article_id)

# Construct dataframe
zippedList =  list(zip(article_id_list, article_text_list))
article_df = pd.DataFrame(zippedList, columns = ['id' , 'text'])  
# article_df.set_index(id, drop=True, append=False, inplace=False, verify_integrity=False)

### 3.2 Store initial data for other steps 

In [26]:
%store article_df 

Stored 'article_df' (DataFrame)
