# ICIJ analysis: load KùzuDB

## Set up

Load the Python dependencies.

In [1]:
import pathlib
import typing

from icecream import ic
import kuzu
import pandas as pd
import watermark

%load_ext watermark

In [2]:
%watermark
%watermark --iversions

Last updated: 2024-07-08T09:09:19.730906-07:00

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 8.26.0

Compiler    : Clang 13.0.0 (clang-1300.0.29.30)
OS          : Darwin
Release     : 23.5.0
Machine     : arm64
Processor   : arm
CPU cores   : 14
Architecture: 64bit

kuzu     : 0.4.2
watermark: 2.4.3
pandas   : 2.2.2



Create a KùzuDB database and establish a connection.

In [3]:
!rm -rf ./demo

In [4]:
db: kuzu.database.Database = kuzu.Database("./demo")
conn: kuzu.database.Database = kuzu.Connection(db)

In [5]:
TEMP_DIR: pathlib.Path = pathlib.Path("temp")

## Schema definitions

### Entities

After iterating the first time through this analysis, we return to this point and redefine an `Entity` node structure which is a superset of the fields defined among all of the Entity-ish nodes in ICIJ.
See <https://docs.google.com/spreadsheets/d/1eSelhXhix_DtTZuzR2vfl_UdQlEwbwql6NZxqrtROxk/edit?usp=sharing>

In [6]:
conn.execute("""
  CREATE NODE TABLE Entity (
    node_id STRING,
    role STRING,
    name STRING,
    original_name STRING,
    former_name STRING,
    jurisdiction STRING,
    jurisdiction_description STRING,
    company_type STRING,
    address STRING,
    internal_id STRING,
    incorporation_date STRING,
    inactivation_date STRING,
    struck_off_date STRING,
    dorm_date STRING,
    status STRING,
    service_provider STRING,
    ibcRUC STRING,
    country_codes STRING,
    countries STRING,
    sourceID STRING,
    valid_until STRING,
    note STRING,
    vague BOOLEAN,
    PRIMARY KEY (node_id)
  )
""");

In [7]:
conn.execute("""
    COPY Entity FROM "./temp/entity.1.csv" (header=true, escape='"', parallel=False)
""");

In [8]:
conn.execute("""
    COPY Entity FROM "./temp/entity.2.csv" (header=true, escape='"', parallel=False)
""");

In [9]:
conn.execute("""
    COPY Entity FROM "./temp/entity.3.csv" (header=true, escape='"', parallel=False)
""");

In [10]:
conn.execute("""
    COPY Entity FROM "./temp/entity.4.csv" (header=true, escape='"', parallel=False)
""");

In [11]:
results = conn.execute("""
  MATCH (n:Entity)
  RETURN *
  LIMIT 1;
""")

while results.has_next():
    ic(results.get_next())

ic| results.get_next(): [{'_id': {'offset': 2048, 'table': 0},
                          '_label': 'Entity',
                          'address': 'ORION HOUSE SERVICES (HK) LIMITED ROOM 1401; 14/F.; WORLD '
                                     'COMMERCE  CENTRE; HARBOUR CITY; 7-11 CANTON ROAD; TSIM SHA TSUI; '
                                     'KOWLOON; HONG KONG',
                          'company_type': None,
                          'countries': 'Hong Kong',
                          'country_codes': 'HKG',
                          'dorm_date': None,
                          'former_name': None,
                          'ibcRUC': '2322195',
                          'inactivation_date': None,
                          'incorporation_date': '05-SEP-2014',
                          'internal_id': '2003444.0',
                          'jurisdiction': 'ANG',
                          'jurisdiction_description': 'British Anguilla',
                          'name': 'NINGBO HAIYA

How many `Entity` nodes have been loaded?

In [12]:
results = conn.execute("""
  MATCH (n:Entity)
  RETURN COUNT(*)
""")

while results.has_next():
    row = results.get_next()
    ic(row[0])

ic| row[0]: 1613491


In [13]:
!wc -l temp/entity.*.csv

  814617 temp/entity.1.csv
  771369 temp/entity.2.csv
   25636 temp/entity.3.csv
    2990 temp/entity.4.csv
 1614612 total


`(1614612-4)-1613491 = 1117` missing records?

### Registered Addresses

In [14]:
conn.execute("""
  CREATE NODE TABLE Address (
    node_id STRING,
    address STRING,
    name STRING,
    countries STRING,
    country_codes STRING,
    sourceID STRING,
    valid_until STRING,
    note STRING,
    PRIMARY KEY (node_id)
  )
""");

In [15]:
conn.execute("""
    COPY Address FROM "./temp/addr.csv" (header=true, escape='"', parallel=False)
""");

In [16]:
results = conn.execute("""
  MATCH (n:Address)
  RETURN *
  LIMIT 1;
""")

while results.has_next():
    ic(results.get_next())

ic| results.get_next(): [{'_id': {'offset': 4096, 'table': 1},
                          '_label': 'Address',
                          'address': 'NO.2 HUOYAOKU; GULOU DISTRICT; FUZHOU; FUJIAN; CHINA',
                          'countries': 'China',
                          'country_codes': 'CHN',
                          'name': None,
                          'node_id': '14055318',
                          'note': None,
                          'sourceID': 'Panama Papers',
                          'valid_until': 'The Panama Papers  data is current through 2015'}]


## Connecting relations

### OfficerOf

In [17]:
conn.execute("""
  CREATE REL TABLE OfficerOf (FROM Entity TO Entity,
    link STRING,
    status STRING,
    start_date STRING,
    end_date STRING,
    sourceID STRING,
    MANY_MANY
  )
""");

In [18]:
conn.execute("""
    COPY OfficerOf FROM "./temp/rel_officer.csv" (header=true, escape='"', parallel=False)
""");

RuntimeError: Runtime exception: Unable to find primary key value 12008661.

### RegisteredAddress

In [None]:
conn.execute("""
  CREATE REL TABLE RegisteredAddress (FROM Entity TO Address,
    link STRING,
    status STRING,
    start_date STRING,
    end_date STRING,
    sourceID STRING,
    MANY_MANY
  )
""");

In [None]:
conn.execute(
    "COPY RegisteredAddress FROM (LOAD FROM df_todo RETURN *)"
);