In [1]:
%%capture

%cd ..
%load_ext autoreload
%autoreload 2

In [2]:
from pathlib import Path
import pandas as pd
import numpy as np
import json

# Data Preparation

This notebook will be used to change the data format from multiple files to a CSV file with all the needed information. To use this notebook correctly we need to do some initial configurations such as the paths of the data (see below). Here we will be using the full text dataset (i.e. the data that is not truncated).

In [3]:
RAW_DATA_FOLDER = Path('data/raw/')
INTERMEDIATE_DATA_FOLDER = Path('data/interim/')
REFERENCE_FOLDER = Path('references/')

FAKE_DATA_FOLDER = RAW_DATA_FOLDER / 'fake'
TRUE_DATA_FOLDER = RAW_DATA_FOLDER / 'true'
FAKE_META_FOLDER = RAW_DATA_FOLDER / 'fake-meta-information'
TRUE_META_FOLDER = RAW_DATA_FOLDER / 'true-meta-information'

## Text Data

We are going to start with the text data. For this we will create a dataframe with each text as a cell of a column. This will need a function:

In [4]:
def create_text_df(folder):
    df_dict = {}
    for filepath in folder.glob("*.txt"):
        with open(filepath, 'r', encoding='utf-8') as f:
            df_dict[filepath.stem] = f.read() 
    return pd.DataFrame.from_dict(df_dict, orient='index', columns=['text'])

Now, let's create the text datasets

In [5]:
fake_text_df = create_text_df(FAKE_DATA_FOLDER)
true_text_df = create_text_df(TRUE_DATA_FOLDER)

In [6]:
fake_text_df.head()

Unnamed: 0,text
1,Kátia Abreu diz que vai colocar sua expulsão e...
10,"Dr. Ray peita Bolsonaro, chama-o de conservad..."
100,Reinaldo Azevedo desmascarado pela Polícia Fed...
1000,Relatório assustador do BNDES mostra dinheiro ...
1001,"Radialista americano fala sobre o PT: ""Eles ve..."


## Metadata Dataset

As in the "text data" section we need to do the same here for metadata.

In [7]:
def create_metadata_df(folder, metadata_columns):
    df_dict = {}
    df_dict = {k:[] for k in metadata_columns}
    df_dict["index"] = []
    
    for filepath in list(folder.glob("*.txt")):
        with open(filepath, 'r') as f:    
            df_dict["index"].append(filepath.stem.split("-")[0])
            for col, value in zip(metadata_columns, f.readlines()):
                df_dict[col].append(value[0:-1])
    
    df = pd.DataFrame(df_dict)
    df.index.name = None
    return df

Analyzing the dataset folders, we can see that the values of the columns are defined by the an order defined on the README file. The order is:

In [8]:
metadata_columns = [
    "author", "link", "category", "date_of_publication",
    "tokens", "words_no_punctuation", "types", "links_inside", 
    "upper_words", "verbs", "subjuntive_imperative_verbs",
    "nouns", "adjectives", "adverbs", "modal_verbs", 
    "singular_first_second_personal_pronouns",
    "plural_first_personal_pronouns", "pronouns",
    "pausality", "characters", "average_sentence_length",
    "average_word_lenght", "percentage_spelling_errors",
    "emotiveness", "diversity"
]

Now we can create the datasets

In [9]:
fake_metadata_df = create_metadata_df(FAKE_META_FOLDER, metadata_columns)
true_metadata_df = create_metadata_df(TRUE_META_FOLDER, metadata_columns)

In [10]:
fake_metadata_df.head()

Unnamed: 0,author,link,category,date_of_publication,tokens,words_no_punctuation,types,links_inside,upper_words,verbs,...,plural_first_personal_pronouns,pronouns,pausality,characters,average_sentence_length,average_word_lenght,percentage_spelling_errors,emotiveness,diversity,index
0,mrk,https://ceticismopolitico.com/2017/11/30/katia...,politica,2017-11-30,211,185,120,0,6,30,...,0,26,2.0,815,14.2308,4.40541,0.0,0.263158,0.64864,1
1,,https://ceticismopolitico.com/2017/11/24/dr-ra...,politica,2017-11-24,289,254,163,0,0,56,...,0,20,2.5,1205,18.1429,4.74409,0.00787402,0.241667,0.64173,10
2,,https://afolhabrasil.com.br/politica/reinaldo-...,politica,2017-05-23,304,275,170,0,0,45,...,0,18,1.8125,1344,17.1875,4.88727,0.00363636,0.12782,0.61818,100
3,,https://www.diariodobrasil.org/relatorio-assus...,politica,24/07/2017,639,572,316,1,14,87,...,0,34,2.68,3122,22.88,5.45804,0.00174825,0.229008,0.55244,1000
4,,https://www.diariodobrasil.org/radialista-amer...,politica,25/07/2017,128,111,82,0,1,21,...,0,12,0.894737,515,5.84211,4.63964,0.0,0.269231,0.73873,1001


In [11]:
len(fake_metadata_df)

3600

Analyzing the data we can see that some columns have a string "None" instead of a np.nan. This can be seen in the code below

In [12]:
print(len(fake_metadata_df[fake_metadata_df.isin(["None"]).any(axis=1)]))
print(len(true_metadata_df[true_metadata_df.isin(["None"]).any(axis=1)]))

3528
1393


Let's fix this by replacing the values of "None" by np.nan

In [13]:
fake_metadata_df = fake_metadata_df.replace("None", np.nan)
true_metadata_df = true_metadata_df.replace("None", np.nan)

Also, let's correct the data types. Here we need to specify the data types manually based on the information provided by the documentation. 

**NOTE:** the variable *date_of_publication* have a lot of different formats that cannot be idenfitified by the pandas library, this variable will be corrected later. In this case we define here as a string variable.

In [14]:
metadata_dtypes = {
    "author": "string", "link": "string", "category": "string",
    "date_of_publication": "string",
    "tokens": "float", "words_no_punctuation": "float",
    "types": "float","links_inside": "float", "upper_words": "float",
    "verbs": "float", "subjuntive_imperative_verbs": "float", "nouns": "float", 
    "adjectives": "float", "adverbs": "float","modal_verbs": "float", 
    "singular_first_second_personal_pronouns": "float",
    "plural_first_personal_pronouns": "float", "pronouns": "float","characters": "float",
    "pausality": "float", "average_sentence_length": "float",
    "average_word_lenght": "float", "percentage_spelling_errors": "float",
    "emotiveness": "float", "diversity": "float"
}

In [15]:
fake_metadata_df = fake_metadata_df.astype(metadata_dtypes, errors='raise').set_index("index", drop=True)
true_metadata_df = true_metadata_df.astype(metadata_dtypes, errors='raise').set_index("index", drop=True)

### Datetime transformation

In [16]:
# TODO: DATETIME TRANSFORMATION

## Merging Datasets

Now, we just need to merge the created datasets. First the metadata with the texts and than both datasets to create a unique csv file with all the information that we need.

In [23]:
fake_df = pd.concat([fake_text_df, fake_metadata_df], axis=1, sort=False)
fake_df.index = fake_df.index.astype(int)
fake_df = fake_df.reset_index().rename(columns={"index": "file_index"})
fake_df = fake_df.sort_index()

true_df = pd.concat([true_text_df, true_metadata_df], axis=1, sort=False)
true_df.index = true_df.index.astype(int)
true_df = true_df.reset_index().rename(columns={"index": "file_index"})
true_df = true_df.sort_index()

In [24]:
result = pd.concat([true_df, fake_df], keys=['True', 'Fake'])
result = result.reset_index(level=0)
result = result.rename(columns={"level_0": "class"})
result.index.name = None

## Exporting Data

In [25]:
INTERMEDIATE_DATA_FOLDER.mkdir(exist_ok=True, parents=True)
result.to_csv(INTERMEDIATE_DATA_FOLDER/ "fake_true_news.csv", index=False)