In [None]:
import numpy as np
import pandas as pd

pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = 150

In [None]:
bahamas_edge_raw = pd.read_csv('../data/raw/bahamas_leaks/bahamas_leaks.edges.csv')
offshore_edge_raw = pd.read_csv('../data/raw/offshore_leaks/offshore_leaks.edges.csv')
panama_edge_raw = pd.read_csv('../data/raw/panama_papers/panama_papers.edges.csv')
paradise_edge_raw = pd.read_csv('../data/raw/paradise_papers/paradise_papers.edges.csv')

## To Do

- Standardize the column names
- Remove column `valid_until`
- Why are rel_type/TYPE and link columns different?
- Parse dates to standard format/datetime[64]

## Date Formats

There are multiple date formats in the start_date and end_date columns. These differences include:

- the month represented as a number and as an abbreviated string (14-MAR-2015 and 03/14/2015)
- the year as two digits and as four (15 and 2015)
- the month and day represented as two digits and one (03 and 3)
- no way to tell whether the day or month comes first (03/09/2015 could be March 9th or September 3rd, there are no digits in the non-year sections that are above 12)

### Why would anyone record dates in multiple formats??

Couple of reasons come to mind, this is real world data after all

- Dates probably come from underlying source data. As this comes from different sources, they may have been recorded in different formats and just collated without any cleaning
- Someone could have opened it in Excel, which tends to change things. It may have updated the date for the ones it knew and left ones that it didn't (where day is > 12)

### Plan of attack

Regardless of the reason, this needs to be addressed.

The format including abbreviated month names is easy, they're all in the same format: d%-%b-%Y and can be separated out using df['date'].str.contains('-', na=False)

The fully numeric format isn't as easy, there are 3 different formats and no indication on whether the month or day comes first. Only consistent piece is that the year comes last.

As we aren't making any real-world applicable decisions, we could just arbitrarily pick whether to parse the date as the day first. However, we can get a little more accurate.

Using the Wikipedia page https://en.wikipedia.org/wiki/Date_format_by_country we can use where the data came from to make an educated guess as to the probable date format.

- Bahamas Leaks data came from the official corporate registry of the Bahamas. So we can guess they use the MDY format.
- Offshore Leaks data came from offshore entities incorporated through Portcullis Trustnet (based in the Cook Islands) and Commonwealth Limited (based in the British Virgin Islands). The islands both companies are based in typically use the DMY date format.
- Panama Papers data came from a Panama law firm. Panama uses the MDY format.
- Paradise Papers data comes from the most varied data sources. According to the IJIC website, this data comes from the corporate registeries from seven countries: Aruba, Bahamas, Barbados, Nevis, Cook Islands, Malta, and Samoa. However, looking at the unique sourceIDs, we can see Lebanon also included. Fortunately, all of those countries use the same date format except Samoa. Majority: DMY, Samoa: MDY

### Date Format Decision

- Bahamas Leaks: MDY
- Offshore Leaks: DMY
- Panama Papers: MDY
- Paradise Papers: separate by df['sourceID'].str.contains('Samoa', na=False)
  + contains Samoa: MDY
  + doesn't contain Samoa: DMY

### Numeric difference plan of attack

- Two vs Four digit year: split on '\' and look at length of 3rd segment
- Four digit year
    + If length of the string is < 10 month/day are not zero padded
    + If length of string = 10 (dd/mm/yyyy) month/day are zero padded
- Two digit year
    + If length of string is < 8 month/day are not zero padded
    + If length of string = 8 (dd/mm/yy)