# ICIJ analysis: prepare data

This notebook prepares data for loading into a graph database, based on the _ICIJ Offshore Leaks_ dataset: <https://offshoreleaks.icij.org/pages/database>

## Set up

Load the Python dependencies.

In [1]:
from collections import Counter
import csv
import pathlib
import re
import typing

from icecream import ic
import pandas as pd
import watermark

%load_ext watermark

In [2]:
%watermark
%watermark --iversions

Last updated: 2024-07-08T18:29:19.709059-07:00

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.26.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.30)
OS          : Darwin
Release     : 23.5.0
Machine     : arm64
Processor   : arm
CPU cores   : 14
Architecture: 64bit

pandas   : 2.2.2
re       : 2.2.1
csv      : 1.0
watermark: 2.4.3



This notebook assumes that the ICIJ data has been 

  1. downloaded from <https://offshoreleaks-data.icij.org/offshoreleaks/csv/full-oldb.LATEST.zip>
  2. inflated into a local directory `"full_2023"`

In [3]:
!rm -rf temp
!mkdir temp

In [4]:
DATA_DIR: pathlib.Path = pathlib.Path("full_2023")
TEMP_DIR: pathlib.Path = pathlib.Path("temp")

Also, confirm that the ICIJ download has a timestamp generated on **2023-11-27** (latest release)

In [5]:
list(DATA_DIR.glob("GENERATED*"))

[PosixPath('full_2023/GENERATED_ON_20231127.txt')]

## Data discovery

### Entities

In [6]:
data_file: pathlib.Path = DATA_DIR / "nodes-entities.csv"

df: pd.DataFrame = pd.read_csv(
    data_file,
    header = 0,
    low_memory = False,
).astype(str).fillna("")

df.replace({"nan": ""}, regex = True, inplace = True)

df.head(3)

Unnamed: 0,node_id,name,original_name,former_name,jurisdiction,jurisdiction_description,company_type,address,internal_id,incorporation_date,...,struck_off_date,dorm_date,status,service_provider,ibcRUC,country_codes,countries,sourceID,valid_until,note
0,10000001,"TIANSHENG INDUSTRY AND TRADING CO., LTD.","TIANSHENG INDUSTRY AND TRADING CO., LTD.",,SAM,Samoa,,ORION HOUSE SERVICES (HK) LIMITED ROOM 1401; 1...,1001256.0,23-MAR-2006,...,15-FEB-2013,,Defaulted,Mossack Fonseca,25221,HKG,Hong Kong,Panama Papers,The Panama Papers data is current through 2015,
1,10000002,"NINGBO SUNRISE ENTERPRISES UNITED CO., LTD.","NINGBO SUNRISE ENTERPRISES UNITED CO., LTD.",,SAM,Samoa,,ORION HOUSE SERVICES (HK) LIMITED ROOM 1401; 1...,1001263.0,27-MAR-2006,...,15-FEB-2014,,Defaulted,Mossack Fonseca,25249,HKG,Hong Kong,Panama Papers,The Panama Papers data is current through 2015,
2,10000003,"HOTFOCUS CO., LTD.","HOTFOCUS CO., LTD.",,SAM,Samoa,,ORION HOUSE SERVICES (HK) LIMITED ROOM 1401; 1...,1000896.0,10-JAN-2006,...,15-FEB-2012,,Defaulted,Mossack Fonseca,24138,HKG,Hong Kong,Panama Papers,The Panama Papers data is current through 2015,


In [7]:
df.describe(include = "all").loc[[ "count", "freq", "unique", "top", ]]

Unnamed: 0,node_id,name,original_name,former_name,jurisdiction,jurisdiction_description,company_type,address,internal_id,incorporation_date,...,struck_off_date,dorm_date,status,service_provider,ibcRUC,country_codes,countries,sourceID,valid_until,note
count,814344,814344.0,814344.0,814344.0,814344,814344,814344.0,814344.0,814344.0,814344.0,...,814344.0,814344.0,814344.0,814344.0,814344.0,814344.0,814344.0,814344,814344,814344.0
freq,1,29.0,424822.0,807507.0,209634,209713,675593.0,515021.0,424822.0,25874.0,...,470501.0,794137.0,456310.0,470258.0,251878.0,309353.0,309353.0,213634,213634,772569.0
unique,814344,781569.0,368821.0,6783.0,103,88,66.0,20583.0,343795.0,19009.0,...,9083.0,276.0,86.0,5.0,511469.0,1102.0,1103.0,22,65,207.0
top,10000001,,,,BAH,Bahamas,,,,,...,,,,,,,,Panama Papers,The Panama Papers data is current through 2015,


What are the values within the `company_type` column?  These will get coalesced later with a related `type` column from another CSV file.

In [8]:
set(df.company_type.values)

{'',
 'Antigua',
 'Audit Licence',
 'BVI Share Trust',
 'BVI Sundry Entities (one off transactions)',
 'BVI Trust',
 'Bahamas IBC',
 'Belize International Business Company',
 'Busines Company Limited by Shares & Guarantee',
 'Business Company Limited by Guarantee',
 'Business Company Limited by Shares',
 'Business Company Restricted Purposes',
 'Business Corporation',
 'Business Vehicle',
 'CAP 285',
 'Client Sundry Account',
 'Collective Investment Scheme',
 'Cook Islands Asset Protection Trust',
 'Cook Islands Asset Protection Trust - 3520A',
 'Cook Islands Trust',
 'Domestic Company',
 'EXEMPT INSURANCE HOLDING REGISTER',
 'EXEMPT INSURANCE REGISTER',
 'Foreign Company',
 'Foreign Company Transfer',
 'Holding Company',
 'Hong Kong',
 'Hong Kong Trust',
 'International Business Corporation',
 'International Company',
 'International Trust',
 'Limited Liability Company',
 'Limited Partnership',
 'Liquidator Licence',
 'Mail Forwarding Only',
 'Mauritius - Hybrid',
 'Mauritius - Intern

Keep track of the set of `node_id` values for this class.

In [9]:
id_entity: typing.Set[ str ] = set(df.node_id.values)

Reshape the dataframe to fit our inclusive `Entity` metadata, and store to a CSV file.

In [10]:
df.columns.values.tolist()

['node_id',
 'name',
 'original_name',
 'former_name',
 'jurisdiction',
 'jurisdiction_description',
 'company_type',
 'address',
 'internal_id',
 'incorporation_date',
 'inactivation_date',
 'struck_off_date',
 'dorm_date',
 'status',
 'service_provider',
 'ibcRUC',
 'country_codes',
 'countries',
 'sourceID',
 'valid_until',
 'note']

In [11]:
df.insert(1, "role", "Entity")
df["vague"] = False

df.head(3)

Unnamed: 0,node_id,role,name,original_name,former_name,jurisdiction,jurisdiction_description,company_type,address,internal_id,...,dorm_date,status,service_provider,ibcRUC,country_codes,countries,sourceID,valid_until,note,vague
0,10000001,Entity,"TIANSHENG INDUSTRY AND TRADING CO., LTD.","TIANSHENG INDUSTRY AND TRADING CO., LTD.",,SAM,Samoa,,ORION HOUSE SERVICES (HK) LIMITED ROOM 1401; 1...,1001256.0,...,,Defaulted,Mossack Fonseca,25221,HKG,Hong Kong,Panama Papers,The Panama Papers data is current through 2015,,False
1,10000002,Entity,"NINGBO SUNRISE ENTERPRISES UNITED CO., LTD.","NINGBO SUNRISE ENTERPRISES UNITED CO., LTD.",,SAM,Samoa,,ORION HOUSE SERVICES (HK) LIMITED ROOM 1401; 1...,1001263.0,...,,Defaulted,Mossack Fonseca,25249,HKG,Hong Kong,Panama Papers,The Panama Papers data is current through 2015,,False
2,10000003,Entity,"HOTFOCUS CO., LTD.","HOTFOCUS CO., LTD.",,SAM,Samoa,,ORION HOUSE SERVICES (HK) LIMITED ROOM 1401; 1...,1000896.0,...,,Defaulted,Mossack Fonseca,24138,HKG,Hong Kong,Panama Papers,The Panama Papers data is current through 2015,,False


The way that `pandas` handles CSV serialization is known to be buggy. Since all of the data is strings, force quoting.

In [12]:
df.to_csv(
    TEMP_DIR / "entity.1.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

Store a copy of these headers to use for validating the reshaping of related CSV files later.

In [13]:
EXPECTED_ENTITY_COLUMNS: typing.List[ str ] = df.columns.values.tolist()

### Officers

In [14]:
data_file: pathlib.Path = DATA_DIR / "nodes-officers.csv"

df: pd.DataFrame = pd.read_csv(
    data_file,
    header = 0,
    low_memory = False,
).astype(str).fillna("")

df.replace({"nan": ""}, regex = True, inplace = True)

df.head(3)

Unnamed: 0,node_id,name,countries,country_codes,sourceID,valid_until,note
0,12000001,KIM SOO IN,South Korea,KOR,Panama Papers,The Panama Papers data is current through 2015,
1,12000002,Tian Yuan,China,CHN,Panama Papers,The Panama Papers data is current through 2015,
2,12000003,GREGORY JOHN SOLOMON,Australia,AUS,Panama Papers,The Panama Papers data is current through 2015,


In [15]:
df.describe(include = "all").loc[[ "count", "freq", "unique", "top", ]]

Unnamed: 0,node_id,name,countries,country_codes,sourceID,valid_until,note
count,771315,771315,771315.0,771315.0,771315,771315,771315.0
freq,1,70873,302334.0,301057.0,238402,238402,767554.0
unique,771315,538360,4091.0,4772.0,22,81,24.0
top,12000001,THE BEARER,,,Panama Papers,The Panama Papers data is current through 2015,


Note: phrases such as "THE BEARER" or "EL PORTADOR" are contract placeholders for whomever holds the bearer shares, and we need to filter them from entity resolution or constructing relations.

In [16]:
names: typing.Dict[ str, int ] = df["name"].value_counts().to_dict()
names

{'THE BEARER': 70873,
 'EL PORTADOR': 9325,
 'Bearer 1': 2655,
 'CARMICHAEL TREVOR A.': 1196,
 'CLEMENTI LIMITED': 1111,
 'TANAH MERAH LIMITED': 1046,
 'BUKIT MERAH LIMITED': 963,
 'CST ADMINISTRATION (BAHAM': 835,
 'Bearer': 818,
 'The Bearer': 813,
 'THE CORPORATE SECRETARY LIMITED': 700,
 'Christopher Marcus GRADEL': 576,
 'BEARER': 537,
 'COURT ADMINISTRATION LIMI': 474,
 'PRIMARY MANAGEMENT LIMITE': 449,
 'BARNES DEBORAH J.': 416,
 'FIELDS JAMES A.': 408,
 'BLUE SEAS ADMINISTRATION': 389,
 'STANDARD NOMINEES (BAHAMA': 384,
 "TRIDENT CORPORATE SERVICES (B'DOS) LTD": 369,
 'CIRCLE CORPORATE SERVICES': 343,
 'CORPORATE SERVICES LIMITED': 319,
 'GUARDIAN NOMINEES (BARBADOS) LIMITED': 288,
 'IL SHIN CORPORATE CONSULTING LIMITED': 286,
 'KHAN ZARINA': 283,
 'CASCADO AG': 275,
 'OCTAGON MANAGEMENT LIMITE': 272,
 'Vanessa Marie-Antoine PAYET': 259,
 'AMICORP (BARBADOS) LTD.': 243,
 'Shirley Sabia Therese VAN KERKHOVE': 242,
 'MANEX LIMITED': 238,
 'BDG MANAGEMENT LTD.': 231,
 'SISNETT NAT

In [17]:
suspect: typing.Dict[ str, int ] = {
    name.lower(): count
    for name, count in names.items()
    if count >= 2 and ("bearer" in name.lower() or "portador" in name.lower())
}

suspect

{'the bearer': 4,
 'el portador': 25,
 'bearer 1': 13,
 'bearer': 16,
 'the  bearer': 114,
 'bearer 2': 10,
 'to the bearer': 6,
 'al portador': 39,
 'bearer 3': 4,
 'bearer 4': 5,
 'bearer 5': 2,
 'bearer 6': 16,
 'bearer02': 14,
 'bearer 8': 14,
 'bearer 7': 14,
 'to bearer': 14,
 'bearer 9': 12,
 'bearer  - bearer agent - liang chen yuen chi': 10,
 'bearer 10': 10,
 'bearer1': 6,
 'the bearer 50,000 shares': 7,
 'the bearer at 500 shares': 7,
 'the bearer (agent: mr. mak kin man)': 7,
 'bearer share': 5,
 'the bearer at 50,000 shares': 6,
 'the bearer of 50,000 shares': 6,
 'the bearer (agent: pro-corp nominees limited)': 6,
 'bearer: agent-wirja tanizar': 6,
 'bearer for immota foundation': 6,
 'bearer01': 2,
 'bearer.': 5,
 'bearer03': 5,
 'bearer 12': 5,
 'bearer (elena kyprianou)': 5,
 'bearer2': 4,
 'the bearer agent: wbc secretaries ltd.': 4,
 'the bearer agent: ann lay': 4,
 'the bearer1': 4,
 '-the bearer': 4,
 'bearer 14': 4,
 'bearer 13': 4,
 'bearer 15': 4,
 'bearer10': 4

In [18]:
len(suspect)

146

Identify regex patterns to use for filtering these legalese placeholders.

In [19]:
import re

PAT_LIST: typing.List[ str ] = [
    r"^\-?(to\s+)?([the]+\s+)?bearer\.?\s?(\d+)?(\w)?$",
    r"^.*bearer.*shares?$",
    r"^the\s+bearer\s+\([\d\,]+\)$",
    r"^[ae]l\s+portador$",
    r"^the\s?bearer$",
    r"^bearer\s?warrant$",
    r"^bearer\s?shareholder$",
    r"^the\,\s+bearer$",
    r"^bearer\s+\(reedeem\s+shares\)$",
    r"^the\s+bearer\s+\(lost\)$",
    r"^bearer\s+\-\s+[\w]$",
    r"^bearer\s+\"\w\"$",
    r"^bearer\s+[\d\-]+$",
    r"^bearer\s+no\.\s+\d+$",
    r"^the\s+bearer\s+at\s+[\d\,]+$",
    r"^nan$",
    r"^[\?]+$",
]


def filter_bearer (
    name: str,
    ) -> bool:
    name = name.lower()

    for pat in PAT_LIST:
        if re.search(pat, name) is not None:
            return False

    return True


total: int = 0

for name, count in suspect.items():
    if filter_bearer(name):
        ic(name, count)
        total += 1

total

ic| name: 'bearer  - bearer agent - liang chen yuen chi', count: 10
ic| name: 'the bearer (agent: mr. mak kin man)', count: 7
ic| name: 'the bearer (agent: pro-corp nominees limited)', count: 6
ic| name: 'bearer: agent-wirja tanizar', count: 6
ic| name: 'bearer for immota foundation', count: 6
ic| name: 'bearer (elena kyprianou)', count: 5
ic| name: 'the bearer agent: wbc secretaries ltd.', count: 4
ic| name: 'the bearer agent: ann lay', count: 4
ic| name: 'the bearer (mr. andrey komarov)', count: 4
ic| name: 'the bearer:agent davenhill assets limited', count: 4
ic| name: 'the bearer (mr. yuriy vasilievich schastlivyi)', count: 3
ic| name: 'the bearer agent-kan pui wai', count: 3
ic| name: 'bearer agent: toni juhani salmela', count: 3
ic| name: 'bearer (vinsonburg management corp.)', count: 3
ic| name: 'bearer agent:- pro-corp nominees limited', count: 3
ic| name: 'bearer pro-corp nominees limited', count: 3
ic| name: 'the bearer (agent pro-corp nominees limited', count: 3
ic| name: 't

57

Confirm whether this filter works…

In [20]:
counter: Counter = Counter()

for name, count in df["name"].value_counts().to_dict().items():
    if filter_bearer(name):
        counter[name] += count

counter.most_common(100)

[('CARMICHAEL TREVOR A.', 1196),
 ('CLEMENTI LIMITED', 1111),
 ('TANAH MERAH LIMITED', 1046),
 ('BUKIT MERAH LIMITED', 963),
 ('CST ADMINISTRATION (BAHAM', 835),
 ('THE CORPORATE SECRETARY LIMITED', 700),
 ('Christopher Marcus GRADEL', 576),
 ('COURT ADMINISTRATION LIMI', 474),
 ('PRIMARY MANAGEMENT LIMITE', 449),
 ('BARNES DEBORAH J.', 416),
 ('FIELDS JAMES A.', 408),
 ('BLUE SEAS ADMINISTRATION', 389),
 ('STANDARD NOMINEES (BAHAMA', 384),
 ("TRIDENT CORPORATE SERVICES (B'DOS) LTD", 369),
 ('CIRCLE CORPORATE SERVICES', 343),
 ('CORPORATE SERVICES LIMITED', 319),
 ('GUARDIAN NOMINEES (BARBADOS) LIMITED', 288),
 ('IL SHIN CORPORATE CONSULTING LIMITED', 286),
 ('KHAN ZARINA', 283),
 ('CASCADO AG', 275),
 ('OCTAGON MANAGEMENT LIMITE', 272),
 ('Vanessa Marie-Antoine PAYET', 259),
 ('AMICORP (BARBADOS) LTD.', 243),
 ('Shirley Sabia Therese VAN KERKHOVE', 242),
 ('MANEX LIMITED', 238),
 ('BDG MANAGEMENT LTD.', 231),
 ('SISNETT NATALIA B.', 230),
 ('WORME ROBERT C.', 220),
 ('CALLAGHAN DAVID 

Now add a `vague` column to flag the dubious names

In [21]:
df["vague"] = df.apply(lambda x: not filter_bearer(x["name"]), axis = 1)

Keep track of the set of `node_id` values for this class.

In [22]:
id_officer: typing.Set[ str ] = set(df[df.vague == False].node_id.values)

Reshape the dataframe to fit our inclusive `Entity` metadata.

In [23]:
df.columns.values.tolist()

['node_id',
 'name',
 'countries',
 'country_codes',
 'sourceID',
 'valid_until',
 'note',
 'vague']

In [24]:
df.insert(1, "role", "Officer")

df.insert(3, "original_name", "")
df.insert(4, "former_name", "")
df.insert(5, "jurisdiction", "")
df.insert(6, "jurisdiction_description", "")
df.insert(7, "company_type", "")
df.insert(8, "address", "")
df.insert(9, "internal_id", "")
df.insert(10, "incorporation_date", "")
df.insert(11, "inactivation_date", "")
df.insert(12, "struck_off_date", "")
df.insert(13, "dorm_date", "")
df.insert(14, "status", "")
df.insert(15, "service_provider", "")
df.insert(16, "ibcRUC", "")

In [25]:
columns: typing.List[ str ] = list(df.columns.values)

columns[17] = "country_codes"
columns[18] = "countries"

columns

['node_id',
 'role',
 'name',
 'original_name',
 'former_name',
 'jurisdiction',
 'jurisdiction_description',
 'company_type',
 'address',
 'internal_id',
 'incorporation_date',
 'inactivation_date',
 'struck_off_date',
 'dorm_date',
 'status',
 'service_provider',
 'ibcRUC',
 'country_codes',
 'countries',
 'sourceID',
 'valid_until',
 'note',
 'vague']

In [26]:
df = df.reindex(columns = columns)
df.head(3)

Unnamed: 0,node_id,role,name,original_name,former_name,jurisdiction,jurisdiction_description,company_type,address,internal_id,...,dorm_date,status,service_provider,ibcRUC,country_codes,countries,sourceID,valid_until,note,vague
0,12000001,Officer,KIM SOO IN,,,,,,,,...,,,,,KOR,South Korea,Panama Papers,The Panama Papers data is current through 2015,,False
1,12000002,Officer,Tian Yuan,,,,,,,,...,,,,,CHN,China,Panama Papers,The Panama Papers data is current through 2015,,False
2,12000003,Officer,GREGORY JOHN SOLOMON,,,,,,,,...,,,,,AUS,Australia,Panama Papers,The Panama Papers data is current through 2015,,False


Confirm that the reshaped dataframe has the proper column format.

In [27]:
assert df.columns.values.tolist() == EXPECTED_ENTITY_COLUMNS

Store this data as another CSV partition

In [28]:
df.to_csv(
    TEMP_DIR / "entity.2.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

### Intermediaries

In [29]:
data_file: pathlib.Path = DATA_DIR / "nodes-intermediaries.csv"

df: pd.DataFrame = pd.read_csv(
    data_file,
    header = 0,
    low_memory = False,
).astype(str).fillna("")

df.replace({"nan": ""}, regex = True, inplace = True)

df.head(3)

Unnamed: 0,node_id,name,status,internal_id,address,countries,country_codes,sourceID,valid_until,note
0,11000001,"MICHAEL PAPAGEORGE, MR.",ACTIVE,10001,MICHAEL PAPAGEORGE; MR. 106 NICHOLSON STREET B...,South Africa,ZAF,Panama Papers,The Panama Papers data is current through 2015,
1,11000002,CORFIDUCIA ANSTALT,ACTIVE,10004,,Liechtenstein,LIE,Panama Papers,The Panama Papers data is current through 2015,
2,11000003,"DAVID, RONALD",SUSPENDED,10014,,Monaco,MCO,Panama Papers,The Panama Papers data is current through 2015,


In [30]:
df.describe(include = "all").loc[[ "count", "freq", "unique", "top", ]]

Unnamed: 0,node_id,name,status,internal_id,address,countries,country_codes,sourceID,valid_until,note
count,26768,26768,26768.0,26768.0,26768.0,26768,26768,26768,26768,26768.0
freq,1,62,14147.0,12117.0,18125.0,4895,4895,14110,14110,26761.0
unique,26768,25590,10.0,14282.0,8640.0,286,286,9,11,7.0
top,11000001,HUTCHINSON GAYLE A.,,,,Hong Kong,HKG,Panama Papers,The Panama Papers data is current through 2015,


The `Intermediary` records defined by ICIJ has many overlapping records with `Officer`, although the instances in `Officer` provide more info. Remove the duplicates from `Intermediary` based on `node_id` values.

In [31]:
len(id_officer.intersection(set(df.node_id.values)))

1139

In [32]:
def check_row (
    node_id: int,
    ) -> bool:
    return node_id not in id_officer

df["keep"] = df.apply(lambda x: check_row(x["node_id"]), axis = 1)
df = df[df.keep]
del df["keep"]

assert len(id_officer.intersection(set(df.node_id.values))) == 0

In [33]:
df.describe(include = "all").loc[[ "count", "freq", "unique", "top", ]]

Unnamed: 0,node_id,name,status,internal_id,address,countries,country_codes,sourceID,valid_until,note
count,25629,25629,25629.0,25629.0,25629.0,25629,25629,25629,25629,25629.0
freq,1,62,13008.0,10978.0,16986.0,4719,4719,14110,14110,25622.0
unique,25629,24463,10.0,14282.0,8640.0,279,279,9,11,7.0
top,11000001,HUTCHINSON GAYLE A.,,,,Hong Kong,HKG,Panama Papers,The Panama Papers data is current through 2015,


Keep track of the set of `node_id` values for this class.

In [34]:
id_intermed: typing.Set[ str ] = set(df.node_id.values)

Reshape the dataframe to fit our inclusive `Entity` metadata.

In [35]:
df.columns.values.tolist()

['node_id',
 'name',
 'status',
 'internal_id',
 'address',
 'countries',
 'country_codes',
 'sourceID',
 'valid_until',
 'note']

In [36]:
df.insert(1, "role", "Intermediary")

df.insert(3, "original_name", "")
df.insert(4, "former_name", "")
df.insert(5, "jurisdiction", "")
df.insert(6, "jurisdiction_description", "")
df.insert(7, "company_type", "")

df.insert(11, "incorporation_date", "")
df.insert(12, "inactivation_date", "")
df.insert(13, "struck_off_date", "")
df.insert(14, "dorm_date", "")

df.insert(15, "service_provider", "")
df.insert(16, "ibcRUC", "")

df["vague"] = False

In [37]:
df.columns.values.tolist()

['node_id',
 'role',
 'name',
 'original_name',
 'former_name',
 'jurisdiction',
 'jurisdiction_description',
 'company_type',
 'status',
 'internal_id',
 'address',
 'incorporation_date',
 'inactivation_date',
 'struck_off_date',
 'dorm_date',
 'service_provider',
 'ibcRUC',
 'countries',
 'country_codes',
 'sourceID',
 'valid_until',
 'note',
 'vague']

In [38]:
columns: typing.List[ str ] = list(df.columns.values)

columns.remove("address")
columns.insert(9, "address")

columns.remove("status")
columns.insert(14, "status")

columns[17] = "country_codes"
columns[18] = "countries"

columns

['node_id',
 'role',
 'name',
 'original_name',
 'former_name',
 'jurisdiction',
 'jurisdiction_description',
 'company_type',
 'address',
 'internal_id',
 'incorporation_date',
 'inactivation_date',
 'struck_off_date',
 'dorm_date',
 'status',
 'service_provider',
 'ibcRUC',
 'country_codes',
 'countries',
 'sourceID',
 'valid_until',
 'note',
 'vague']

In [39]:
df = df.reindex(columns = columns)
df.head(3)

Unnamed: 0,node_id,role,name,original_name,former_name,jurisdiction,jurisdiction_description,company_type,address,internal_id,...,dorm_date,status,service_provider,ibcRUC,country_codes,countries,sourceID,valid_until,note,vague
0,11000001,Intermediary,"MICHAEL PAPAGEORGE, MR.",,,,,,MICHAEL PAPAGEORGE; MR. 106 NICHOLSON STREET B...,10001,...,,ACTIVE,,,ZAF,South Africa,Panama Papers,The Panama Papers data is current through 2015,,False
1,11000002,Intermediary,CORFIDUCIA ANSTALT,,,,,,,10004,...,,ACTIVE,,,LIE,Liechtenstein,Panama Papers,The Panama Papers data is current through 2015,,False
2,11000003,Intermediary,"DAVID, RONALD",,,,,,,10014,...,,SUSPENDED,,,MCO,Monaco,Panama Papers,The Panama Papers data is current through 2015,,False


Confirm that the reshaped dataframe has the proper column format.

In [40]:
assert df.columns.values.tolist() == EXPECTED_ENTITY_COLUMNS

Store this data as another CSV partition.

In [41]:
df.to_csv(
    TEMP_DIR / "entity.3.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

### Others

In [42]:
data_file: pathlib.Path = DATA_DIR / "nodes-others.csv"

df: pd.DataFrame = pd.read_csv(
    data_file,
    header = 0,
    low_memory = False,
).astype(str).fillna("")

df.replace({"nan": ""}, regex = True, inplace = True)

df.head(3)

Unnamed: 0,node_id,name,type,incorporation_date,struck_off_date,closed_date,jurisdiction,jurisdiction_description,countries,country_codes,sourceID,valid_until,note
0,85004929,ANTAM ENTERPRISES N.V.,LIMITED LIABILITY COMPANY,18-MAY-1983,,28-NOV-2012,AW,Aruba,,,Paradise Papers - Aruba corporate registry,Aruba corporate registry data is current throu...,Closed date stands for Cancelled date.
1,85008443,DEVIATION N.V.,LIMITED LIABILITY COMPANY,28-JUN-1989,31-DEC-2002,,AW,Aruba,,,Paradise Papers - Aruba corporate registry,Aruba corporate registry data is current throu...,
2,85008517,ARIAZI N.V.,LIMITED LIABILITY COMPANY,19-JUL-1989,,19-MAY-2004,AW,Aruba,,,Paradise Papers - Aruba corporate registry,Aruba corporate registry data is current throu...,Closed date stands for Cancelled date.


In [43]:
df.describe(include = "all").loc[[ "count", "freq", "unique", "top", ]]

Unnamed: 0,node_id,name,type,incorporation_date,struck_off_date,closed_date,jurisdiction,jurisdiction_description,countries,country_codes,sourceID,valid_until,note
count,2989,2989,2989.0,2989.0,2989.0,2989.0,2989.0,2989.0,2989.0,2989.0,2989,2989,2989.0
freq,1,3,2101.0,2101.0,2944.0,2872.0,2031.0,2031.0,2603.0,2603.0,2031,2031,2872.0
unique,2989,2976,4.0,835.0,40.0,113.0,7.0,7.0,64.0,64.0,7,10,3.0
top,85004929,Shampaign Investments Limited,,,,,,,,,Paradise Papers - Appleby,Appleby data is current through 2014,


Confirm that the `type` column values are comparable with values in `Entity.company_type`

In [44]:
set(df.type.values)

{'',
 'FOREIGN FORMED CORPORATION',
 'LIMITED LIABILITY COMPANY',
 'SOLE OWNERSHIP'}

Keep track of the set of `node_id` values for this class.

In [45]:
id_other: typing.Set[ str ] = set(df.node_id.values)

Reshape the dataframe to fit our inclusive `Entity` metadata.

In [46]:
df.columns.values.tolist()

['node_id',
 'name',
 'type',
 'incorporation_date',
 'struck_off_date',
 'closed_date',
 'jurisdiction',
 'jurisdiction_description',
 'countries',
 'country_codes',
 'sourceID',
 'valid_until',
 'note']

In [47]:
df.insert(1, "role", "Other")

df.insert(3, "original_name", "")
df.insert(4, "former_name", "")

df.rename(columns = { "type": "company_type" }, inplace = True)

df.insert(6, "address", "")
df.insert(7, "internal_id", "")

df.rename(columns = { "closed_date": "inactivation_date" }, inplace = True)
df.insert(13, "dorm_date", "")
df.insert(14, "status", "")
df.insert(15, "service_provider", "")
df.insert(16, "ibcRUC", "")

df["vague"] = False

In [48]:
columns: typing.List[ str ] = list(df.columns.values)

columns.remove("jurisdiction")
columns.remove("jurisdiction_description")
columns.insert(5, "jurisdiction")
columns.insert(6, "jurisdiction_description")

columns[11] = "inactivation_date"
columns[12] = "struck_off_date"

columns[17] = "country_codes"
columns[18] = "countries"

columns

['node_id',
 'role',
 'name',
 'original_name',
 'former_name',
 'jurisdiction',
 'jurisdiction_description',
 'company_type',
 'address',
 'internal_id',
 'incorporation_date',
 'inactivation_date',
 'struck_off_date',
 'dorm_date',
 'status',
 'service_provider',
 'ibcRUC',
 'country_codes',
 'countries',
 'sourceID',
 'valid_until',
 'note',
 'vague']

In [49]:
df = df.reindex(columns = columns)
df.head(3)

Unnamed: 0,node_id,role,name,original_name,former_name,jurisdiction,jurisdiction_description,company_type,address,internal_id,...,dorm_date,status,service_provider,ibcRUC,country_codes,countries,sourceID,valid_until,note,vague
0,85004929,Other,ANTAM ENTERPRISES N.V.,,,AW,Aruba,LIMITED LIABILITY COMPANY,,,...,,,,,,,Paradise Papers - Aruba corporate registry,Aruba corporate registry data is current throu...,Closed date stands for Cancelled date.,False
1,85008443,Other,DEVIATION N.V.,,,AW,Aruba,LIMITED LIABILITY COMPANY,,,...,,,,,,,Paradise Papers - Aruba corporate registry,Aruba corporate registry data is current throu...,,False
2,85008517,Other,ARIAZI N.V.,,,AW,Aruba,LIMITED LIABILITY COMPANY,,,...,,,,,,,Paradise Papers - Aruba corporate registry,Aruba corporate registry data is current throu...,Closed date stands for Cancelled date.,False


Confirm that the reshaped dataframe has the proper column format.

In [50]:
assert df.columns.values.tolist() == EXPECTED_ENTITY_COLUMNS

Store this data as another CSV partition.

In [51]:
df.to_csv(
    TEMP_DIR / "entity.4.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

### Registered Addresses

In [52]:
data_file: pathlib.Path = DATA_DIR / "nodes-addresses.csv"

df: pd.DataFrame = pd.read_csv(
    data_file,
    header = 0,
    low_memory = False,
).astype(str).fillna("")

df.replace({"nan": ""}, regex = True, inplace = True)

df.head(3)

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note
0,24000001,"ANNEX FREDERICK & SHIRLEY STS, P.O. BOX N-4805...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,
1,24000002,"SUITE E-2,UNION COURT BUILDING, P.O. BOX N-818...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,
2,24000003,"LYFORD CAY HOUSE, LYFORD CAY, P.O. BOX N-7785,...",,Bahamas,BHS,Bahamas Leaks,The Bahamas Leaks data is current through earl...,


In [53]:
df.describe(include = "all").loc[[ "count", "freq", "unique", "top", ]]

Unnamed: 0,node_id,address,name,countries,country_codes,sourceID,valid_until,note
count,402246,402246.0,402246.0,402246.0,402246.0,402246,402246,402246.0
freq,1,19932.0,178898.0,125335.0,125335.0,123269,123269,402208.0
unique,402246,377861.0,222810.0,366.0,226.0,20,26,12.0
top,24000001,,,,,Paradise Papers - Malta corporate registry,Malta corporate registry data is current throu...,


Keep track of the set of `node_id` values for this class.

In [54]:
id_address: typing.Set[ str ] = set(df.node_id.values)

In [55]:
df.columns.values.tolist()

['node_id',
 'address',
 'name',
 'countries',
 'country_codes',
 'sourceID',
 'valid_until',
 'note']

Store this data as a CSV file.

In [56]:
df.to_csv(
    TEMP_DIR / "addr.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

## Connecting entities

This parts gets a bit iffy. In the ICIJ download, some entity resolution has already been performed. However, this appears to be simply fuzzy matching on strings for either names or addresses.

Also, the pre-defined relations are haphazard in terms of domain and range.
While Neo4j may be forgiving about inconsistent semantic modeling, this artifact may degrade inference downstream.

Consequently, the _semantic model_ used by ICIJ seems to be messy. While it may produce useful visualizations and graph query results, good luck trying to use this data for inference downstream!

We want to do better. See below for the transformations applied to relations in the ICIJ data.

### Relationships

In [57]:
data_file: pathlib.Path = DATA_DIR / "relationships.csv"

df: pd.DataFrame = pd.read_csv(
    data_file,
    header = 0,
    low_memory = False,
).astype(str).fillna("")

df.replace({"nan": ""}, regex = True, inplace = True)

df.head(3)

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID
0,10000035,14095990,registered_address,registered address,,,,Panama Papers
1,10000044,14091035,registered_address,registered address,,,,Panama Papers
2,10000055,14095990,registered_address,registered address,,,,Panama Papers


In [58]:
df.describe(include = "all").loc[[ "count", "freq", "unique", "top", ]]

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID
count,3339267,3339267,3339267,3339267,3339267.0,3339267.0,3339267.0,3339267
freq,36373,37338,1720357,589938,3164321.0,2392562.0,3070400.0,776335
unique,1131496,1321323,14,1041,3.0,24336.0,13561.0,12
top,54662,236724,officer_of,shareholder of,,,,Paradise Papers - Malta corporate registry


In [59]:
df.columns.values.tolist()

['node_id_start',
 'node_id_end',
 'rel_type',
 'link',
 'status',
 'start_date',
 'end_date',
 'sourceID']

Show which relations are named in ICIJ.

In [60]:
rels: typing.Dict[ str, int ] = df["rel_type"].value_counts().to_dict()
rels

{'officer_of': 1720357,
 'registered_address': 832721,
 'intermediary_of': 598546,
 'same_name_as': 104170,
 'similar': 46761,
 'same_company_as': 15523,
 'connected_to': 12145,
 'same_as': 4272,
 'same_id_as': 3120,
 'underlying': 1308,
 'similar_company_as': 203,
 'probably_same_officer_as': 132,
 'same_address_as': 5,
 'same_intermediary_as': 4}

For now we'll skip the questionable relations which were in the ICIJ download: 

  1. `same_name_as`: simple string matches associating `Officer` nodes, some rather questionable
  1. `similar`: simple string matches associating `Officer` nodes, some questionable
  1. `same_company_as`: simple string matches, where the domain and range sometimes link `Officer` nodes in error
  1. `same_as`: simple string matches associating `Entity` nodes
  1. `similar_company_as`: simple string matches associating `Entity` nodes
  1. `probably_same_officer_as`: simple string matches associating `Officer` nodes
  1. `same_intermediary_as`: simple string matches associating `Intermediary` nodes

Keep these:

  1. `officer_of`: source nodes with names which are legal placeholders, e.g., "THE BEARER" (filter)
  1. `registered_address`: note that some `Entity` nodes have more than one registered address
  1. `intermediary_of`: the domain and range are a jumble of `Entity`, `Officer`, `Intermediary`
  1. `connected_to`: connecting `Other` nodes with `Entity`, `Officer`, `Intermediary` -- akin to `intermediary_of` with weaker connotations
  1. `underlying`: holding corporations mapped to their underlying `Entity`, `Officer`, `Intermediary`, `Other`
  1. `same_id_as`: alias among `Officer` nodes, different `node_id` but same person
  1. `same_address_as`: connecting two `Address` nodes, seem legit

Double-check: which `node_id` values exist in the `Entity` node table, and can be used in constructing relations?

In [61]:
id_ents: typing.Set[ str ] = id_entity.union(id_officer).union(id_intermed).union(id_other)

len(id_ents)

1528152

#### OfficerOf

Link `Officer` nodes to `Entity` nodes using the `OfficerOf` relation

In [62]:
df[df.rel_type == "officer_of"].head()

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID
221749,12000001,10073324,officer_of,shareholder of,,19-NOV-1999,04-JUL-2000,Panama Papers
221751,12000002,10148386,officer_of,shareholder of,,30-MAR-2012,06-JUL-2012,Panama Papers
221753,12000003,10024966,officer_of,shareholder of,,14-JAN-2010,,Panama Papers
221755,12000004,10004763,officer_of,shareholder of,,23-JUL-2012,,Panama Papers
221757,12000005,10206741,officer_of,shareholder of,,13-SEP-2010,,Panama Papers


In [63]:
def check_row (
    node_id_start: int,
    node_id_end: int,
    rel_type: str,
    ) -> bool:
    return rel_type == "officer_of" and node_id_start in id_officer and node_id_end in id_entity


df["todo"] = df.apply(lambda x: check_row(x["node_id_start"], x["node_id_end"], x["rel_type"]), axis = 1)
df_todo: pd.DataFrame = df[df.todo]

del df_todo["rel_type"]
del df_todo["todo"]

df_todo

Unnamed: 0,node_id_start,node_id_end,link,status,start_date,end_date,sourceID
221749,12000001,10073324,shareholder of,,19-NOV-1999,04-JUL-2000,Panama Papers
221751,12000002,10148386,shareholder of,,30-MAR-2012,06-JUL-2012,Panama Papers
221753,12000003,10024966,shareholder of,,14-JAN-2010,,Panama Papers
221755,12000004,10004763,shareholder of,,23-JUL-2012,,Panama Papers
221757,12000005,10206741,shareholder of,,13-SEP-2010,,Panama Papers
...,...,...,...,...,...,...,...
3339244,240556390,240554206,Beneficiary of trust,,,,
3339245,240556391,240554207,Beneficial owner of the underlying company,,,,
3339246,240556392,240554207,Beneficial owner of the underlying company,,,,
3339247,240556393,240554208,Settlor,,,,


Store this data as a CSV partition.

In [64]:
df_todo.to_csv(
    TEMP_DIR / "rel_officer.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

#### RegisteredAddress

Link `Entity` nodes to `Address` nodes using the `RegisteredAddress` relation

In [65]:
df[df.rel_type == "registered_address"].head()

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID,todo
0,10000035,14095990,registered_address,registered address,,,,Panama Papers,False
1,10000044,14091035,registered_address,registered address,,,,Panama Papers,False
2,10000055,14095990,registered_address,registered address,,,,Panama Papers,False
4,10000064,14091429,registered_address,registered address,,,,Panama Papers,False
5,10000089,14098253,registered_address,registered address,,,,Panama Papers,False


In [66]:
def check_row (
    node_id_start: int,
    node_id_end: int,
    rel_type: str,
    ) -> bool:
    return rel_type == "registered_address" and node_id_start in id_ents and node_id_end in id_address


df["todo"] = df.apply(lambda x: check_row(x["node_id_start"], x["node_id_end"], x["rel_type"]), axis = 1)
df_todo: pd.DataFrame = df[df.todo]

del df_todo["rel_type"]
del df_todo["todo"]

df_todo

Unnamed: 0,node_id_start,node_id_end,link,status,start_date,end_date,sourceID
0,10000035,14095990,registered address,,,,Panama Papers
1,10000044,14091035,registered address,,,,Panama Papers
2,10000055,14095990,registered address,,,,Panama Papers
4,10000064,14091429,registered address,,,,Panama Papers
5,10000089,14098253,registered address,,,,Panama Papers
...,...,...,...,...,...,...,...
3339158,240461995,240450001,registered address,,,,
3339159,240461996,240450007,registered address,,,,
3339160,240461997,240450007,registered address,,,,
3339161,240461998,240450029,registered address,,,,


Store this data as a CSV partition.

In [67]:
df_todo.to_csv(
    TEMP_DIR / "rel_regaddr.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

#### IntermediaryOf

Link `Entity`, `Officer`, `Intermediary` nodes together using the `IntermediaryOf` relation

In [68]:
df[df.rel_type == "intermediary_of"].head()

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID,todo
8091,11000001,10208879,intermediary_of,intermediary of,,,,Panama Papers,False
8092,11000001,10198662,intermediary_of,intermediary of,,,,Panama Papers,False
8093,11000001,10159927,intermediary_of,intermediary of,,,,Panama Papers,False
8094,11000001,10165779,intermediary_of,intermediary of,,,,Panama Papers,False
8095,11000001,10152967,intermediary_of,intermediary of,,,,Panama Papers,False


In [69]:
def check_row (
    node_id_start: int,
    node_id_end: int,
    rel_type: str,
    ) -> bool:
    return rel_type == "intermediary_of" and node_id_start in id_ents and node_id_end in id_ents


df["todo"] = df.apply(lambda x: check_row(x["node_id_start"], x["node_id_end"], x["rel_type"]), axis = 1)
df_todo: pd.DataFrame = df[df.todo]

del df_todo["rel_type"]
del df_todo["todo"]

df_todo

Unnamed: 0,node_id_start,node_id_end,link,status,start_date,end_date,sourceID
8091,11000001,10208879,intermediary of,,,,Panama Papers
8092,11000001,10198662,intermediary of,,,,Panama Papers
8093,11000001,10159927,intermediary of,,,,Panama Papers
8094,11000001,10165779,intermediary of,,,,Panama Papers
8095,11000001,10152967,intermediary of,,,,Panama Papers
...,...,...,...,...,...,...,...
2412895,240090997,240083721,Intermediary of,,,,
2412896,240090998,240083722,Intermediary of,,,,
2412897,240090999,240083724,Intermediary of,,,,
2412898,240091000,240083741,Intermediary of,,,,


Store this data as a CSV partition.

In [70]:
df_todo.to_csv(
    TEMP_DIR / "rel_intermed.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

Link `Entity`, `Officer`, `Intermediary`, `Other` nodes together using the `ConnectedTo` relation

#### ConnectedTo

Connect `Other` nodes with `Entity`, `Officer`, `Intermediary` using the `ConnectedTo` relation

In [71]:
df[df.rel_type == "connected_to"].head()

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID,todo
1535555,85004929,85008101,connected_to,connected to,,,,Paradise Papers - Aruba corporate registry,False
1535556,85004929,85021444,connected_to,connected to,,,,Paradise Papers - Aruba corporate registry,False
1535571,85008443,85011025,connected_to,connected to,,,,Paradise Papers - Aruba corporate registry,False
1535649,85008517,85022984,connected_to,connected to,,,,Paradise Papers - Aruba corporate registry,False
1535674,85008542,85010050,connected_to,connected to,,,,Paradise Papers - Aruba corporate registry,False


In [72]:
def check_row (
    node_id_start: int,
    node_id_end: int,
    rel_type: str,
    ) -> bool:
    return rel_type == "connected_to" and node_id_start in id_ents and node_id_end in id_ents


df["todo"] = df.apply(lambda x: check_row(x["node_id_start"], x["node_id_end"], x["rel_type"]), axis = 1)
df_todo: pd.DataFrame = df[df.todo]

del df_todo["rel_type"]
del df_todo["todo"]

df_todo

Unnamed: 0,node_id_start,node_id_end,link,status,start_date,end_date,sourceID
1535555,85004929,85008101,connected to,,,,Paradise Papers - Aruba corporate registry
1535556,85004929,85021444,connected to,,,,Paradise Papers - Aruba corporate registry
1535571,85008443,85011025,connected to,,,,Paradise Papers - Aruba corporate registry
1535649,85008517,85022984,connected to,,,,Paradise Papers - Aruba corporate registry
1535674,85008542,85010050,connected to,,,,Paradise Papers - Aruba corporate registry
...,...,...,...,...,...,...,...
1980526,85049790,85049939,connected to,,,,Paradise Papers - Aruba corporate registry
1980538,85049801,85050315,connected to,,,,Paradise Papers - Aruba corporate registry
1980621,85049875,85050345,connected to,,,,Paradise Papers - Aruba corporate registry
1980739,85049985,85050168,connected to,,,,Paradise Papers - Aruba corporate registry


Store this data as a CSV partition.

In [73]:
df_todo.to_csv(
    TEMP_DIR / "rel_connect.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

#### Underlying

Map holding corporations to their underlying `Entity`, `Officer`, `Intermediary`, `Other` nodes using the `Underlying` relation

In [74]:
df[df.rel_type == "underlying"].head()

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID,todo
811684,51240,110010,underlying,Nominee Shareholder of,,,,Offshore Leaks,False
812311,51364,122604,underlying,Nominee Shareholder of,,,,Offshore Leaks,False
812592,51425,85812,underlying,Nominee Shareholder of,,,,Offshore Leaks,False
813084,51545,120215,underlying,Nominee Shareholder of,,,,Offshore Leaks,False
813796,55864,59201,underlying,Nominee Director of,,,,Offshore Leaks,False


In [75]:
def check_row (
    node_id_start: int,
    node_id_end: int,
    rel_type: str,
    ) -> bool:
    return rel_type == "underlying" and node_id_start in id_ents and node_id_end in id_ents


df["todo"] = df.apply(lambda x: check_row(x["node_id_start"], x["node_id_end"], x["rel_type"]), axis = 1)
df_todo: pd.DataFrame = df[df.todo]

del df_todo["rel_type"]
del df_todo["todo"]

df_todo

Unnamed: 0,node_id_start,node_id_end,link,status,start_date,end_date,sourceID
811684,51240,110010,Nominee Shareholder of,,,,Offshore Leaks
812311,51364,122604,Nominee Shareholder of,,,,Offshore Leaks
812592,51425,85812,Nominee Shareholder of,,,,Offshore Leaks
813084,51545,120215,Nominee Shareholder of,,,,Offshore Leaks
813796,55864,59201,Nominee Director of,,,,Offshore Leaks
...,...,...,...,...,...,...,...
3330666,240558066,240554203,underlying company of,,,,
3330669,240558067,240554203,underlying company of,,,,
3330672,240558068,240554203,underlying company of,,,,
3330675,240558069,240554207,underlying company of,,,,


Store this data as a CSV partition.

In [76]:
df_todo.to_csv(
    TEMP_DIR / "rel_underly.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

#### AliasOfficer

Alias `Officer` nodes (same person, different `node_id`) using the `AliasOfficer` relation

In [77]:
df[df.rel_type == "same_id_as"].head()

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID,todo
1538270,59178341,59190179,same_id_as,same id as,,,,Paradise Papers - Malta corporate registry,False
1555417,59181407,59108285,same_id_as,same id as,,,,Paradise Papers - Malta corporate registry,False
2455740,56031433,56031434,same_id_as,same id as,,,,Paradise Papers - Malta corporate registry,False
2455801,56031538,56031539,same_id_as,same id as,,,,Paradise Papers - Malta corporate registry,False
2455853,56031619,56031620,same_id_as,same id as,,,,Paradise Papers - Malta corporate registry,False


In [78]:
def check_row (
    node_id_start: int,
    node_id_end: int,
    rel_type: str,
    ) -> bool:
    return rel_type == "same_id_as" and node_id_start in id_ents and node_id_end in id_ents


df["todo"] = df.apply(lambda x: check_row(x["node_id_start"], x["node_id_end"], x["rel_type"]), axis = 1)
df_todo: pd.DataFrame = df[df.todo]

del df_todo["rel_type"]
del df_todo["todo"]

df_todo

Unnamed: 0,node_id_start,node_id_end,link,status,start_date,end_date,sourceID
1538270,59178341,59190179,same id as,,,,Paradise Papers - Malta corporate registry
1555417,59181407,59108285,same id as,,,,Paradise Papers - Malta corporate registry
2455740,56031433,56031434,same id as,,,,Paradise Papers - Malta corporate registry
2455801,56031538,56031539,same id as,,,,Paradise Papers - Malta corporate registry
2455853,56031619,56031620,same id as,,,,Paradise Papers - Malta corporate registry
...,...,...,...,...,...,...,...
3228483,56104646,56104647,same id as,,,,Paradise Papers - Malta corporate registry
3228685,56105060,56105061,same id as,,,,Paradise Papers - Malta corporate registry
3228738,56105126,56105127,same id as,,,,Paradise Papers - Malta corporate registry
3228807,56105302,56105303,same id as,,,,Paradise Papers - Malta corporate registry


In [79]:
df_todo.to_csv(
    TEMP_DIR / "rel_same_officer.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)

#### AliasAddress

Connect two `Address` nodes (same business?, different `node_id`) using the `AliasAddress` relation

In [80]:
df[df.rel_type == "same_address_as"].head()

Unnamed: 0,node_id_start,node_id_end,rel_type,link,status,start_date,end_date,sourceID,todo
227,24000030,14035591,same_address_as,same address as,,,,Bahamas Leaks,False
393,24000086,14077570,same_address_as,same address as,,,,Bahamas Leaks,False
403,24000090,14077931,same_address_as,same address as,,,,Bahamas Leaks,False
415,24000098,14037925,same_address_as,same address as,,,,Bahamas Leaks,False
2065,24000336,14049152,same_address_as,same address as,,,,Bahamas Leaks,False


In [81]:
def check_row (
    node_id_start: int,
    node_id_end: int,
    rel_type: str,
    ) -> bool:
    return rel_type == "same_address_as" and node_id_start in id_address and node_id_end in id_address


df["todo"] = df.apply(lambda x: check_row(x["node_id_start"], x["node_id_end"], x["rel_type"]), axis = 1)
df_todo: pd.DataFrame = df[df.todo]

del df_todo["rel_type"]
del df_todo["todo"]

df_todo

Unnamed: 0,node_id_start,node_id_end,link,status,start_date,end_date,sourceID
227,24000030,14035591,same address as,,,,Bahamas Leaks
393,24000086,14077570,same address as,,,,Bahamas Leaks
403,24000090,14077931,same address as,,,,Bahamas Leaks
415,24000098,14037925,same address as,,,,Bahamas Leaks
2065,24000336,14049152,same address as,,,,Bahamas Leaks


In [82]:
df_todo.to_csv(
    TEMP_DIR / "rel_same_address.csv",
    index = False,
    header = True,
    sep = ",",
    quoting = csv.QUOTE_ALL,
    encoding = "utf-8",
)