# Building High-Quality Knowledge Graphs

In [1]:
import pathlib
import shutil

import kuzu
import polars as pl
import watermark

import open_sanctions as os
import open_ownership as oo
import process_senzing as sz

# For visualizing the Kuzu graph in yFiles widget
from yfiles_jupyter_graphs_for_kuzu import KuzuGraphWidget

In [2]:
%load_ext watermark
%watermark
%watermark --iversions

Last updated: 2025-08-10T15:33:23.053315-07:00

Python implementation: CPython
Python version       : 3.13.3
IPython version      : 9.1.0

Compiler    : Clang 16.0.0 (clang-1600.0.26.6)
OS          : Darwin
Release     : 24.5.0
Machine     : arm64
Processor   : arm
CPU cores   : 14
Architecture: 64bit

watermark                     : 2.5.0
yfiles_jupyter_graphs_for_kuzu: 0.0.4
polars                        : 1.29.0
kuzu                          : 0.9.0



In [3]:
DB_PATH = "./db"
shutil.rmtree(DB_PATH, ignore_errors=True)

db = kuzu.Database(DB_PATH)
conn = kuzu.Connection(db)

Be sure to download the datasets as described in the `README.md` though we can handle that automagically here as:

In [4]:
!if [ ! -f "data/open-sanctions.json" ]; then wget https://raw.githubusercontent.com/DerwenAI/senzing_starter_kit/refs/heads/main/senzing_rootfs/data/open-sanctions.json -O data/open-sanctions.json ; fi

In [5]:
!if [ ! -f "data/open-ownership.json" ]; then wget https://raw.githubusercontent.com/DerwenAI/senzing_starter_kit/refs/heads/main/senzing_rootfs/data/open-ownership.json -O data/open-ownership.json ; fi

## OpenSanctions

In [6]:
data_path: pathlib.Path = pathlib.Path("data")

We'll load a slice of the [OpenSanctions](https://www.opensanctions.org/) dataset, which provides the "risk" category of data.
This describes people and organizations who represent known risks for FinCrime.

In [7]:
df1 = pl.read_ndjson(data_path / "open-sanctions.json")
df1.head(3)

DATA_SOURCE,RECORD_ID,RECORD_TYPE,LAST_CHANGE,NAMES,GENDER,RISKS,ADDRESSES,DATES,COUNTRIES,IDENTIFIERS,SOURCE_LINKS,RELATIONSHIPS,URL,CONTACTS
str,str,str,str,list[struct[3]],str,list[struct[1]],list[struct[7]],list[struct[2]],list[struct[3]],list[struct[9]],list[struct[1]],list[struct[5]],str,list[struct[1]]
"""OPEN-SANCTIONS""","""NK-25vyVFzt8vdJGgAXMRTwTJ""","""PERSON""","""2024-07-30T16:41:14""","[{""PRIMARY"",null,""Abassin BADSHAH""}]",,"[{""corp.disqual""}]","[{""31 Quernmore Close, Bromley, Kent, United Kingdom, BR1 4EL"",null,null,null,null,null,null}]","[{null,""1985-05-12""}]","[{null,""gb"",null}]","[{""OPEN-SANCTIONS"",""NK-25vyVFzt8vdJGgAXMRTwTJ"",null,null,null,null,null,null,null}]","[{""https://find-and-update.company-information.service.gov.uk/disqualified-officers/natural/mGquuTbmESWiRmHJPz1ObUwfDgk""}]","[{null,null,""Directorship"",""OPEN-SANCTIONS"",""NK-SKAADAiqiZ78JsJjeg72Te""}, {null,null,""Directorship"",""OPEN-SANCTIONS"",""NK-3p3mmVWmjwVtTfKchz4kNE""}]","""https://www.opensanctions.org/…",
"""OPEN-SANCTIONS""","""NK-3p3mmVWmjwVtTfKchz4kNE""","""ORGANIZATION""","""2025-01-07T00:33:03""","[{""PRIMARY"",""LMAR (GB) LTD"",null}]",,,"[{""31 Quernmore Close, Bromley, Kent, United Kingdom, BR1 4EL"",null,null,null,null,null,""BUSINESS""}]",,"[{""gb"",null,null}]","[{""OPEN-SANCTIONS"",""NK-3p3mmVWmjwVtTfKchz4kNE"",null,null,null,null,null,null,null}]",,"[{""OPEN-SANCTIONS"",""NK-3p3mmVWmjwVtTfKchz4kNE"",null,null,null}]","""https://www.opensanctions.org/…",
"""OPEN-SANCTIONS""","""NK-auyPsLrBzRoxjCRWgjBvas""","""ORGANIZATION""","""2024-03-03T19:51:29""","[{""PRIMARY"",""WANDLE HOLDINGS LIMITED"",null}]",,"[{""sanction.linked""}]","[{""DEANA BEACH APTS, BLOCK A, Flat 212, Προμαχών Ελευθερίας, 33, 'Αγιος Αθανάσιος, 4103, Λεμεσός, Κύπρος"",null,null,null,null,null,""BUSINESS""}]","[{""2006-12-08"",null}]","[{""cy"",null,null}]","[{null,null,""C188266"",null,null,null,null,null,null}, {null,null,""HE188266"",null,null,null,null,null,null}, {""OPEN-SANCTIONS"",""NK-auyPsLrBzRoxjCRWgjBvas"",null,null,null,null,null,null,null}]",,"[{""OPEN-SANCTIONS"",""NK-auyPsLrBzRoxjCRWgjBvas"",null,null,null}]","""https://opensanctions.org/enti…",


Each entity ID from OpenSanctions has a risk classification. This can be useful to associate an ID with a particular risk, allowing us to narrow down on candidates that are relevant to a particular investigation.

In [8]:
# Get risks from OpenSanctions
df_risk = os.extract_risks(df1)
df_risk.head(3)

id,topic
str,str
"""NK-25vyVFzt8vdJGgAXMRTwTJ""","""corp.disqual"""
"""NK-auyPsLrBzRoxjCRWgjBvas""","""sanction.linked"""
"""NK-cf4Q3KcmUnQbt8Cy7iTtwK""","""sanction.linked"""


We're now ready to extract the open sanctions data. The `extract_open_sanctions` function will take the raw data, process the nested fields within it and return the relevant columns that we need for our graph.

In [9]:
df_os = os.extract_open_sanctions(df1)
df_os.head(3)

id,kind,name,addr,url
str,str,str,str,str
"""NK-25vyVFzt8vdJGgAXMRTwTJ""","""PERSON""","""Abassin BADSHAH""","""31 Quernmore Close, Bromley, K…","""https://www.opensanctions.org/…"
"""NK-3p3mmVWmjwVtTfKchz4kNE""","""ORGANIZATION""","""LMAR (GB) LTD""","""31 Quernmore Close, Bromley, K…","""https://www.opensanctions.org/…"
"""NK-L2UmsZtsyvYiaEmHSaiZ2t""","""PERSON""","""Gulnara Suleimanova KERIMOVA""","""MOSCOW, RUS, 123430""","""https://www.opensanctions.org/…"


Of particular interest for this workshop is the person ["Abassin Badshah"](https://find-and-update.company-information.service.gov.uk/disqualified-officers/natural/mGquuTbmESWiRmHJPz1ObUwfDgk), former owner of multiple Papa John's franchises in London, who is disqualified from being a corporate director until 2026, due to his [tax evasion conviction](https://londonnewsonline.co.uk/news/catford-papa-johns-pizza-boss-jailed-after-669000-tax-evasion/) in 2021.

## Open Ownership

[Open Ownership](https://www.openownership.org/) describes _ultimate beneficial ownership_ (UBO) details, which provides the "link" category of data. In other words, "Who owns how much of what, and who actually has controlling interest?"

In [10]:
df2 = pl.read_ndjson(data_path / "open-ownership.json")
df2.head(3)

DATA_SOURCE,RECORD_ID,statementDate,RECORD_TYPE,NAMES,PRIMARY_NAME_FULL,personType,ATTRIBUTES,ADDRESSES,IDENTIFIERS,LINKS,RELATIONSHIPS,replaces_statements,REGISTRATION_DATE,dissolutionDate,REGISTRATION_COUNTRY,DATE_OF_BIRTH
str,str,str,str,list[struct[2]],str,str,list[struct[1]],list[struct[3]],list[struct[3]],list[struct[3]],list[struct[7]],list[struct[1]],str,str,str,str
"""OPEN-OWNERSHIP""","""10094521532396971848""","""2023-06-18""","""ORGANIZATION""","[{null,""GOLD WYNN UK HOLDINGS LIMITED""}]",,,,"[{""BUSINESS"",""C/O Fladgate Llp, 16 Great Queen Street, London, WC2B 5DG"",""GB""}]","[{""12524623"",""GB-COH"",""GBR""}]","[{null,""https://opencorporates.com/companies/gb/12524623"",null}, {null,null,""https://register.openownership.org/entities/18432059995972240708""}]","[{""OOR"",""10094521532396971848"",null,null,null,null,null}, {null,null,""OOR"",""7584591804488095167"",""shareholding 75% 100%"",""2020-03-18"",""2020-04-29""}, … {null,null,""OOR"",""7584591804488095167"",""appointment_of_board"",""2020-03-18"",""2020-04-29""}]",,"""2020-03-18""",,"""GB""",
"""OPEN-OWNERSHIP""","""10165632722354515453""","""2023-06-18""","""ORGANIZATION""","[{null,""UPSIDE TECHNOLOGY LIMITED""}]",,,,"[{""BUSINESS"",""Apt 52, 3 Whitehall Court, London, SW1A 2EL"",""GB""}]","[{""12165794"",""GB-COH"",""GBR""}]","[{null,""https://opencorporates.com/companies/gb/12165794"",null}, {null,null,""https://register.openownership.org/entities/15659422647652524790""}]","[{""OOR"",""10165632722354515453"",null,null,null,null,null}, {null,null,""OOR"",""598161773989218568"",""shareholding 75% 100%"",""2019-08-20"",null}, … {null,null,""OOR"",""598161773989218568"",""appointment_of_board"",""2019-08-20"",null}]",,"""2019-08-20""","""2022-10-11""","""GB""",
"""OPEN-OWNERSHIP""","""10165632722354515453""","""2023-06-18""","""ORGANIZATION""","[{null,""UPSIDE TECHNOLOGY LIMITED""}]",,,,"[{""BUSINESS"",""Apt 52, 3 Whitehall Court, London, SW1A 2EL"",""GB""}]","[{""12165794"",""GB-COH"",""GBR""}]","[{null,""https://opencorporates.com/companies/gb/12165794"",null}, {null,null,""https://register.openownership.org/entities/15659422647652524790""}]","[{""OOR"",""10165632722354515453"",null,null,null,null,null}, {null,null,""OOR"",""598161773989218568"",""shareholding 75% 100%"",""2019-08-20"",null}, … {null,null,""OOR"",""598161773989218568"",""appointment_of_board"",""2019-08-20"",null}]",,"""2019-08-20""","""2022-10-11""","""GB""",


Just like with the OpenSanctions data, we can use the `extract_open_ownership` function to process the nested JSON data and return the relevant columns that we need for our graph.

In [11]:
df_oo = oo.extract_open_ownership(df2)
df_oo.head(3)

id,kind,name,address,country
str,str,str,str,str
"""10094521532396971848""","""ORGANIZATION""","""GOLD WYNN UK HOLDINGS LIMITED""","""C/O Fladgate Llp, 16 Great Que…","""GB"""
"""10165632722354515453""","""ORGANIZATION""","""UPSIDE TECHNOLOGY LIMITED""","""Apt 52, 3 Whitehall Court, Lon…","""GB"""
"""10264459789712927869""","""PERSON""","""Kenneth Kurt Hansen""","""Finderupvej 61, Kastrup, 2770""","""DK"""


For the relationships in our graph, we'll need to select only the relationships that have **both** `src_id` and `dst_id` in the list of ids. This is done via the `extract_open_ownership_relationships` function.

In [12]:
ids = df_oo.select("id").to_series().to_list()
df_oa_relationships = oo.extract_open_ownership_relationships(df2, open_ownership_ids=ids)
df_oa_relationships.head(3)

src_id,dst_id,role,date
str,str,str,str
"""10094521532396971848""","""7584591804488095167""","""shareholding 75% 100%""","""2020-03-18"""
"""10094521532396971848""","""7584591804488095167""","""appointment_of_board""","""2020-03-18"""
"""10094521532396971848""","""7584591804488095167""","""voting_rights 75% 100%""","""2020-03-18"""


## ✋🏽 🛑 Pause here and run the Senzing workflow

To generate high quality entity data, we'll use Senzing to process the OpenSanctions and Open Ownership data via the Senzing Python SDK. The returned data from Senzing contains resolved entities (in `export.json`), which is once again nested JSON that we can use to create our graph.

Due to the deeply nested nature of this data, we'll have to process it in a few steps using the `process_senzing_export()` function. Each entity is assigned a unique identifier with the prefix `sz_`, associated with the record ID from the original data, plus its data source.

In the interest of time, if you really need to take a shortcut, then download this `export.json` file at <https://storage.googleapis.com/erkg/starterkit/export.json>

In [13]:
sz_export = sz.process_senzing_export(data_path / "export.json")

This first dataframe `df_ent` lists the entities identified by Senzing _entity resolution_.

In [14]:
df_ent = sz_export.df_ent.sort("id")
df_ent.head(3)

id,descrip
str,str
"""sz_1""","""Abassin Badshah"""
"""sz_10""","""Nicholas Thomas Wright"""
"""sz_100001""","""Gold Wynn Uk Holdings Limited"""


The `df_rel` dataframe lists probabilistic relationships between entities, also identified by Senzing _entity resolution_. In other words, there isn't sufficient evidence _yet_ to merge these entities; however, there's enough evidence to suggest following these as closely related leads during an investigation.

In [15]:
df_rel = sz_export.df_rel
df_rel.head(3)

ent_id,rel_id,why,level
str,str,str,i64
"""sz_1""","""sz_2""","""+ADDRESS+OPEN-SANCTIONS(DIRECT…",11
"""sz_1""","""sz_9""","""+ADDRESS+OPEN-SANCTIONS(DIRECT…",11
"""sz_1""","""sz_100075""","""+ADDRESS+OOR(:APPOINTMENT_OF_B…",11


### Separate the Senzing entities by source

The final step to preprocess the data for our graph is to separate the entities by their source (whether they come from OpenSanctions or Open Ownership).

In [16]:
df_sz_oo = sz_export.df_rec.filter(pl.col("source") == "OPEN-OWNERSHIP").select("ent_id", "rec_id", "why", "level")
df_sz_oo.head(3)

ent_id,rec_id,why,level
str,str,str,i64
"""sz_1""","""17207853441353212969""","""+NAME+ADDRESS+NATIONALITY""",1
"""sz_1""","""6747548100436839873""","""+NAME+DOB+NATIONALITY""",1
"""sz_10""","""5927522753545014068""","""+NAME+DOB+NATIONALITY""",1


In [17]:
df_sz_os = sz_export.df_rec.filter(pl.col("source") == "OPEN-SANCTIONS").select("ent_id", "rec_id", "why", "level")
df_sz_os.head(3)

ent_id,rec_id,why,level
str,str,str,i64
"""sz_1""","""NK-25vyVFzt8vdJGgAXMRTwTJ""","""""",0
"""sz_2""","""NK-3p3mmVWmjwVtTfKchz4kNE""","""""",0
"""sz_3""","""NK-auyPsLrBzRoxjCRWgjBvas""","""""",0


## Copy data to Kuzu graph

Kuzu is an embedded, open source graph database that supports the Cypher query language. It uses a structured property graph model, which is similar to the labelled property graph model you may be familiar with from other systems -- the only difference being that Kuzu requires strict data types for properties in the schema.

The following steps will create the graph schema in Kuzu (node and relationship tables) and copy the data into them.

In [18]:
conn.execute("CREATE NODE TABLE IF NOT EXISTS OpenSanctions (id STRING PRIMARY KEY, kind STRING, name STRING, addr STRING, url STRING)")
conn.execute("CREATE NODE TABLE IF NOT EXISTS OpenOwnership (id STRING PRIMARY KEY, kind STRING, name STRING, addr STRING, country STRING)")
conn.execute("CREATE NODE TABLE IF NOT EXISTS Risk (topic STRING PRIMARY KEY)")
conn.execute("CREATE NODE TABLE IF NOT EXISTS Entity (id STRING PRIMARY KEY, descrip STRING)")
conn.execute("CREATE REL TABLE IF NOT EXISTS Role (FROM OpenOwnership TO OpenOwnership, role STRING, date DATE)")

<kuzu.query_result.QueryResult at 0x116e8a2c0>

In [19]:
conn.execute("COPY OpenSanctions FROM df_os")
conn.execute("COPY OpenOwnership FROM df_oo")
conn.execute("COPY Risk FROM (LOAD FROM df_risk RETURN DISTINCT topic)")
conn.execute("COPY Entity FROM (LOAD FROM df_ent RETURN id, descrip)")
conn.execute("COPY Role FROM df_oa_relationships")

<kuzu.query_result.QueryResult at 0x116ed5150>

In [20]:
# Create a yFiles graph widget so we can explore our graph as it's created
g = KuzuGraphWidget(conn)

# Add custom node configurations for clarity in visualization
g.add_node_configuration("Entity", color="red", text= lambda node: {"text": node["properties"]["descrip"]})  # type: ignore
g.add_node_configuration("OpenSanctions", color="yellow", text= lambda node: {"text": node["properties"]["name"]})  # type: ignore
g.add_node_configuration("OpenOwnership", color="green", text= lambda node: {"text": node["properties"]["name"]})  # type: ignore
g.add_node_configuration("Risk", color="blue", text= lambda node: {"text": node["properties"]["topic"]})  # type: ignore

In [21]:
g.show_cypher("MATCH (a:OpenSanctions:OpenOwnership)-[b]->(c:OpenSanctions:OpenOwnership) RETURN * LIMIT 50")

GraphWidget(layout=Layout(height='800px', width='100%'))

In [22]:
# Create Related table between entities
conn.execute("CREATE REL TABLE IF NOT EXISTS Related (FROM Entity TO Entity, why STRING, level INT8)");
conn.execute("COPY Related FROM df_rel");

In [23]:
g.show_cypher(
    """
    MATCH (a:Entity)-[b *1..3]->(c)
    WHERE a.descrip CONTAINS "Abassin"
    RETURN * LIMIT 50
    """
)

GraphWidget(layout=Layout(height='500px', width='100%'))

We'll need to create `Matched` relationships between the entities in our graph. The from/to columns in the following table are the source and destination Senzing identifiers.

In [24]:
# Create Matched table between multiple sets of entities
conn.execute(
    """
    CREATE REL TABLE IF NOT EXISTS Matched (
        FROM Entity TO OpenSanctions,
        FROM Entity TO OpenOwnership,
        why STRING,
        level INT8
    )
"""
)
conn.execute("COPY Matched FROM df_sz_os (from='Entity', to='OpenSanctions')");
conn.execute("COPY Matched FROM df_sz_oo (from='Entity', to='OpenOwnership')");

In [25]:
# Add Risks to OpenSanctions
conn.execute("CREATE REL TABLE IF NOT EXISTS HasRisk (FROM OpenSanctions TO Risk)")
conn.execute("COPY HasRisk FROM df_risk")

<kuzu.query_result.QueryResult at 0x132a15bb0>

In [26]:
g.show_cypher(
    """
    MATCH (a:Entity)-[b *1..3]->(c)
    WHERE a.descrip CONTAINS "Abassin"
    RETURN * LIMIT 100;
    """,
    layout="radial"
)

GraphWidget(layout=Layout(height='650px', width='100%'))

We've successfully combined the data from OpenSanctions, Open Ownership, and resolved entities from Senzing to create a graph that's persisted in Kuzu!
This graph is of high enough quality that it can be used for a variety of investigative tasks downstream. Happy querying!

## Network analysis

Let's demonstrate a common design pattern that we see for investigative graphs -- we can use centrality measures to identify the most connected elements of a subgraph, as a most likely controlling party or ultimate beneficial owner (UBO). Betweenness centrality is a good algorithm for this, as it shows important bridge nodes in the graph, allowing a deeper investigation in the vicinity of these bridge nodes.

Kuzu makes it simple to use graph algorithms via the NetworkX Python package.

First, let's convert the subgraph of entities, open ownership, and open sanctions nodes and relationships to a NetworkX graph.


In [27]:
subg = conn.execute(
    """
    MATCH (a:Entity:OpenOwnership:OpenSanctions)-[b]->(c:Entity:OpenOwnership:OpenSanctions) RETURN *
    """
)
subg_networkx = subg.get_as_networkx(directed=True)  # type: ignore

### Add a new column

Kuzu is a strictly typed system, so if we want to update our nodes with a new property `betweenness_centrality`, we need to first add a new column to the relevant tables.

In [28]:
conn.execute("ALTER TABLE Entity ADD betweenness_centrality FLOAT")
conn.execute("ALTER TABLE OpenOwnership ADD betweenness_centrality FLOAT")
conn.execute("ALTER TABLE OpenSanctions ADD betweenness_centrality FLOAT")

<kuzu.query_result.QueryResult at 0x177d9a090>

### Run the betweenness centrality algorithm

We can run the betweenness centrality algorithm in NetworkX.

In [29]:
import networkx as nx

bc = nx.betweenness_centrality(subg_networkx)

#  Transform the betweenness centrality results into a Polars dataframe
df = pl.DataFrame({"id": k, "betweenness_centrality": v} for k, v in bc.items())
df = (
    df.with_columns(
        pl.col("id").str.replace("Entity_", "")
        .str.replace("OpenOwnership_", "")
        .str.replace("OpenSanctions_", "")
    )
)

df.sort("betweenness_centrality", descending=True).head(5)

id,betweenness_centrality
str,f64
"""sz_100036""",0.002753
"""sz_100225""",0.002251
"""sz_100003""",0.001314
"""sz_100092""",0.001273
"""sz_100023""",0.001176


### Update the database with the betweenness centrality results

Now that we have the betweenness centrality results, we can update the database with the data in the Polars DataFrame's `betweenness_centrality` column.


In [30]:
conn.execute(
    f"""
    LOAD FROM df
    MERGE (s1:Entity {{id: id}})
    SET s1.betweenness_centrality = betweenness_centrality
    MERGE (s2:OpenSanctions {{id: id}})
    SET s2.betweenness_centrality = betweenness_centrality
    MERGE (s3:OpenOwnership {{id: id}})
    SET s3.betweenness_centrality = betweenness_centrality
    """
)

<kuzu.query_result.QueryResult at 0x177d99730>

### Query the database

Let's inspect the most important entities in descending order of betweenness centrality!

In [31]:
res = conn.execute(
    """
    MATCH (a:Entity)
    RETURN a.id, a.descrip, a.betweenness_centrality
    ORDER BY a.betweenness_centrality DESC
    LIMIT 10
    """
)
res.get_as_pl()  # type: ignore

a.id,a.descrip,a.betweenness_centrality
str,str,f32
"""sz_100036""","""Victor Nyland Poulsen""",0.002753
"""sz_100225""","""Daniel Symmons""",0.002251
"""sz_100003""","""Kenneth Kurt Hansen""",0.001314
"""sz_100092""","""Daniel Lee Symons""",0.001273
"""sz_100023""","""Rudolf Esser""",0.001176
"""sz_100080""","""Art By Lunei Holding ApS""",0.001108
"""sz_100151""","""Daniel Nicholas Simons""",0.001077
"""sz_100006""","""Helena Antoinette Marie Verbee…",0.00107
"""sz_100060""","""Anders Kjærgaard Frandsen""",0.001064
"""sz_100072""","""Ida-Marie Langelund-Larsen Sil…",0.001015


It seems like the entity with the highest betweenness centrality is Victor Nyland Poulsen. It's worth inspecting the subgraph of related Open Ownership and Open Sanctions nodes around this entity to see if there are any interesting relationships.

For this, we'll use the circular layout in yFiles to make sense of the large number of relationships.

In [32]:
g.show_cypher(
    """
    MATCH (a)-[b]->(c:Entity)-[d *1..4]->(e)
    WHERE c.id = "sz_100036"
    RETURN * LIMIT 200;
    """,
    layout="circular"
)

GraphWidget(layout=Layout(height='700px', width='100%'))