# Spotting Potential Fraud Patterns with Neo4j

Financial crime is rarely an isolated event committed by a single rogue actor - it is almost always a connected phenomenon involving intricate networks of people, companies, and locations. Traditional fraud detection systems, which typically analyze data in tabular silos, often struggle to identify these complex schemes because they lack the ability to traverse relationships. By shifting the perspective to a graph-based approach, organizations can uncover the structural patterns that underpin systematic fraud, money laundering, and evasion. This notebook moves beyond simple entity resolution to explore three specific, high-risk topologies that are frequently indicative of illicit activity.

The first pattern we analyze is the phenomenon of **"Registration Factories."** In shell company laundering, quantity has a quality all its own. Fraudsters often require a large volume of disposable corporate entities to layer illicit funds or facilitate "long firm" fraud. To achieve this, they frequently register hundreds or even thousands of companies at a single physical address, often a residential property or a modest virtual office. While formation agents legitimately host multiple businesses, extreme outliers - where a single postcode hosts a density of companies akin to a skyscraper - are a primary red flag for "company mills" designed to mass-produce corporate vehicles for criminal misuse.

We then turn our attention to **"Circular Ownership"** loops. This is a sophisticated obfuscation technique where ownership chains are engineered to loop back on themselvesâ€”for example, Company A owns Company B, which owns Company C, which in turn owns Company A. These "Russian Doll" structures serve a dual purpose: they artificially inflate the capital on a company's balance sheet without any actual injection of funds, and more critically, they decapitate the ownership structure. By creating a closed loop, fraudsters can effectively hide the Ultimate Beneficial Owner (UBO), making it nearly impossible for standard due diligence processes to identify who is actually in control, thereby evading sanctions lists and KYC checks.

Finally, we investigate the **"Offshore Nexus,"** visualizing the flow of control from domestic assets to high-secrecy jurisdictions such as Jersey, Guernsey, and the Isle of Man. While offshore ownership can be legitimate, it is also a classic vector for tax evasion, capital flight, and the concealment of assets. By mapping the "flight paths" of corporate control, we can identify specific districts in the UK that are disproportionately owned by offshore interests. This geospatial perspective helps compliance teams prioritize their investigations, focusing on clusters where the ownership trail goes cold in jurisdictions known for their opacity.

In [16]:
import dotenv
import os

dotenv.load_dotenv()

NEO4J_URI = os.getenv("NEO4J_URI")
NEO4J_USER = os.getenv("NEO4J_USER")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
NEO4J_DATABASE = os.getenv("NEO4J_DATABASE")

In [17]:
from neo4j_analysis import Neo4jAnalysis

# Initialize the analysis helper
analysis = Neo4jAnalysis(NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD, NEO4J_DATABASE)

## Registration Factories: Identifying High-Density Company Registrations

To identify potential fraud patterns, we can analyze the density of company registrations in specific locations. A high concentration of company registrations in a particular area may indicate the presence of a "registration factory," where multiple companies are registered at the same address, sometimes for fraudulent purposes.

We can analyze the company registration data to find locations with a high number of registered companies. This can be done using a Cypher query to count the number of companies registered at each address and then visualizing the results on a map.

In [18]:
factory_district_query = """
MATCH (c:Company)-[:REGISTERED_AT]->(a:Address)
WHERE a.latitude IS NOT NULL AND a.longitude IS NOT NULL

// Filter for Active companies only
MATCH (c)-[:HAS_STATUS]->(s:CompanyStatus)
WHERE s.name = 'Active'

// Extract the District (e.g., 'EC2M' from 'EC2M 7PP')
WITH split(a.postcode, ' ')[0] AS district, a.line_1 AS address_line_1, a.postcode AS postcode, count(c) AS company_count

// Filter for significant volume to focus the map on high-density areas
WHERE company_count > 1000

RETURN district, address_line_1, postcode, company_count
ORDER BY company_count DESC
LIMIT 100
"""

df = analysis.run_query_df(factory_district_query)

In [19]:
# Show the top 10 locations for factory registrations, ordered by company count
df.sort_values(by="company_count", ascending=False).head(10)[
    ["address_line_1", "postcode", "company_count"]
]

Unnamed: 0,address_line_1,postcode,company_count
0,71-75 SHELTON STREET,WC2H 9JQ,68776
1,"3RD FLOOR, 86-90",EC2A 4NE,13639
2,167-169 GREAT PORTLAND STREET,W1W 5PF,8627
3,85 GREAT PORTLAND STREET,W1W 7LT,6637
4,50 LOTHIAN ROAD,EH3 9WJ,5778
5,2ND FLOOR COLLEGE HOUSE,HA4 7AE,4433
6,320 FIRECREST COURT,WA1 1RG,4122
7,3RD FLOOR,EC2A 4NE,4087
8,101 NEW CAVENDISH STREET,W1W 6XH,2642
9,82A JAMES CARTER ROAD,IP28 7DE,2350


And we can now geographically visualize all the top districts, where a specific address with more than 1000 companies registered, is located.

In [20]:
import geopandas as gpd
import pandas as pd
import requests
import io
import re
import pydeck as pdk
import matplotlib
import matplotlib.colors as mcolors
import json


# Fetch GeoJSON Boundaries
def get_postcode_area(district):
    # Extracts the leading letters (e.g., "SW" from "SW1A", "B" from "B1")
    match = re.match(r"([A-Z]+)", district, re.I)
    return match.group(1) if match else None


df["area"] = df["district"].apply(get_postcode_area)
unique_areas = df["area"].dropna().unique()

# Repository for UK Postcode Polygons
base_url = "https://raw.githubusercontent.com/missinglink/uk-postcode-polygons/refs/heads/master/geojson"
gdf_list = []

for area in unique_areas:
    url = f"{base_url}/{area.upper()}.geojson"
    try:
        response = requests.get(url)
        if response.status_code == 200:
            area_gdf = gpd.read_file(io.BytesIO(response.content))
            gdf_list.append(area_gdf)
    except Exception as e:
        print(f"Error fetching {area}: {e}")

if not gdf_list:
    raise ValueError("No GeoJSON data could be retrieved.")

# Combine into one GeoDataFrame
full_gdf = pd.concat(gdf_list, ignore_index=True)
if full_gdf.crs and full_gdf.crs.to_string() != "EPSG:4326":
    full_gdf = full_gdf.to_crs(epsg=4326)

# Merge with our company counts
# The GeoJSON 'name' property matches the district (e.g., "AB10")
merged_gdf = full_gdf.merge(df, left_on="name", right_on="district", how="left")
merged_gdf["company_count"] = merged_gdf["company_count"].fillna(0)

# We use a Red scale to indicate 'Risk/Intensity'
cmap = matplotlib.colormaps["Reds"]
norm = mcolors.LogNorm(vmin=100, vmax=merged_gdf["company_count"].max())


def get_fill_color(count):
    if count < 100:
        return [50, 50, 50, 50]  # Grey/Transparent for low density
    rgba = cmap(norm(count))
    return [int(rgba[0] * 255), int(rgba[1] * 255), int(rgba[2] * 255), 200]


merged_gdf["fill_color"] = merged_gdf["company_count"].apply(get_fill_color)

In [21]:
geo_data_dict = json.loads(merged_gdf.to_json())

view_state = pdk.ViewState(
    latitude=54.5,
    longitude=-3.0,
    zoom=6.5,
    pitch=45,
    bearing=0,
)

layer = pdk.Layer(
    "GeoJsonLayer",
    data=geo_data_dict,
    opacity=0.8,
    stroked=True,
    filled=True,
    extruded=True,  # Extrude based on count
    wireframe=False,
    get_fill_color="properties.fill_color",
    get_line_color=[255, 255, 255, 80],
    # Scale elevation: Taller bars = More Companies = Higher Risk
    get_elevation="properties.company_count * 2",
    get_line_width=20,
    pickable=True,
    auto_highlight=True,
)

r = pdk.Deck(
    layers=[layer],
    initial_view_state=view_state,
    map_style=pdk.map_styles.CARTO_DARK,
    tooltip={
        "html": "<b>District:</b> {name}<br/><b>Active Companies:</b> {company_count}"
    },
)

# Save and Render
html_path = "renderings/registration_factory_density_choropleth.html"
r.to_html(html_path, notebook_display=False)

# Optional: Snapshot (requires the helper from your notebook)
await analysis.capture_graph_to_png(
    html_content=None,
    output_path="renderings/registration_factory_density_choropleth.png",
    scale=1,
    width=1500,
    height=1920,
    html_file=html_path,
)

![Registration Factory Density Choropleth](renderings/registration_factory_density_choropleth.png)

## Circular ownership loops

Circular ownership loops occur when a company is indirectly owned by itself through a chain of ownership. These loops can be indicative of complex corporate structures used to obscure the true ownership of a company, which may be a red flag for fraudulent activities or money laundering.

Because of the complexity of these loops, they can be difficult to detect using traditional methods. However, graph databases like Neo4j are perfectly suited for identifying such patterns due to their ability to efficiently traverse relationships.

> Circular ownership loops start with a company, go through a series of ownership relationships, and eventually loop back to the original company. For example, Company A owns Company B, Company B owns Company C, and Company C owns Company A. This doesn't necessarily indicate fraud on its own, but it can be a sign of an attempt to hide the true ownership of a company or to create a complex structure that is difficult to analyze, and it may warrant further investigation.

In [22]:
circular_ownership_query = """
// MATCH: Find a path that starts and ends at the same Company
// We increase the max depth to 12 to allow for larger loops (e.g., 6 controllers)
MATCH path = (c:Company)-[:CONTROLS|SAME_AS*1..12]->(c)

// FILTER: "Five or more controlling entities"
// We count how many 'CONTROLS' relationships exist in the path.
// Each 'CONTROLS' relationship represents one entity exerting power.
WHERE size([r IN relationships(path) WHERE type(r) = 'CONTROLS']) >= 5

RETURN path
"""

response = analysis.run_query_viz(circular_ownership_query)

In [None]:
from neo4j_viz.neo4j import from_neo4j, ColorSpace

# Run the query (using the query string above)
results = analysis.run_query_viz(circular_ownership_query)

colors = {
    "Country": "#1f77b4",  # Blue for Countries
    "Address": "#ff7f0e",  # Orange for Addresses
    "Person": "#2ca02c",  # Green for Persons
    "PreviousName": "#d62728",  # Red for Previous Names
    "SICCode": "#9467bd",  # Purple for SIC Codes
    "Company": "#8c564b",  # Brown for Companies
    "CompanyCategory": "#e377c2",  # Pink for Company Categories
    "CompanyStatus": "#7f7f7f",  # Gray for Company Statuses
    "SupervisoryAuthority": "#bcbd22",  # Olive for Supervisory Authorities
    "AuthorisedCorporateServiceProvider": "#17becf",  # Cyan for Authorised Corporate Service Providers,
    "Organization": "#aec7e8",  # Light Blue for Organizations,
}

VG = from_neo4j(results)

VG.color_nodes(
    field="caption",  # Using the internal labels property
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)
VG.resize_relationships(
    property="thickness",
)
VG.color_relationships(
    property="thickness",
    color_space=ColorSpace.DISCRETE,
    colors={
        1: "blue",  # Blue for low control (<=25% voting and share rights)
        2: "orange",  # Orange for medium control (26-50% voting and share rights)
        3: "red",  # Red for high control (51-75% voting and share rights)
        4: "purple",  # Purple for anything else (>75% voting and share rights)
    },
)

label_to_property = {"Organization": "uid", "Person": "id", "Company": "uid"}

analysis.set_caption_by_label(VG, label_to_property)

generated_html = VG.render(layout="forcedirected", initial_zoom=0.7)

await analysis.capture_graph_to_png(
    generated_html, "renderings/circular_ownership_loops.png"
)

![Circular Ownership Loops](renderings/circular_ownership_loops.png)

## Offshore Company Concentration: Identifying High-Density Offshore Registrations

Offshore company registrations can be a red flag for potential fraud, as they are often used to hide assets, avoid taxes, or engage in illicit activities. By analyzing the concentration of offshore company registrations in specific jurisdictions, we can identify areas that may warrant further investigation.

Let us produce a geographical visualization of the top districts in the offshore jurisdictions with the highest concentration of company registrations. This can help us identify potential hotspots for fraudulent activities.

> The plot below shows concentration by the width of the arcs towards the offshore jurisdictions, can you spot the thick arc landing near Portsmouth ?

In [24]:
aggregated_offshore_query = """
MATCH (controller)-[:BASED_IN|RESIDES_IN]->(country:Country)
WHERE country.name IN ['Jersey', 'Guernsey', 'Isle of Man']

MATCH (controller)-[:CONTROLS]->(c:Company)-[:HAS_STATUS]->(s:CompanyStatus)
WHERE s.name = 'Active'

MATCH (c)-[:REGISTERED_AT]->(ca:Address)
WHERE ca.postcode IS NOT NULL

// Extract the Outcode/District
WITH country.name AS Jurisdiction, split(ca.postcode, ' ')[0] AS District, count(c) AS Company_Count
WHERE Company_Count >= 10
RETURN Jurisdiction, District, Company_Count
ORDER BY Company_Count DESC
LIMIT 10000
"""

df = analysis.run_query_df(aggregated_offshore_query)

In [25]:
import pgeocode

# Geocode UK Districts (Source)
nomi = pgeocode.Nominatim("gb")

# Get unique districts from the query result to minimize API calls/lookups
unique_districts = df["District"].unique()
geo_results = nomi.query_postal_code(unique_districts)

# Create a lookup dictionary: District -> [Lat, Lon]
district_map = geo_results.set_index("postal_code")[
    ["latitude", "longitude"]
].T.to_dict("list")

# Map the coordinates back to the main DataFrame
df["src_lat"] = df["District"].map(lambda x: district_map.get(x, [None, None])[0])
df["src_lon"] = df["District"].map(lambda x: district_map.get(x, [None, None])[1])

OFFSHORE_CENTROIDS = {
    "Jersey": [-2.1312, 49.2144],
    "Guernsey": [-2.5853, 49.4482],
    "Isle of Man": [-4.5481, 54.2361],
}

df["tgt_lon"] = df["Jurisdiction"].apply(
    lambda x: OFFSHORE_CENTROIDS.get(x, [None, None])[0]
)
df["tgt_lat"] = df["Jurisdiction"].apply(
    lambda x: OFFSHORE_CENTROIDS.get(x, [None, None])[1]
)

# Drop invalid rows (where geocoding failed)
df_clean = df.dropna(subset=["src_lat", "src_lon", "tgt_lat", "tgt_lon"]).copy()

geo_data_dict = json.loads(df_clean.to_json(orient="records"))

view_state = pdk.ViewState(
    latitude=49.1,  # Just south of Jersey (approx. St Helier is 49.18)
    longitude=-2.1,  # Roughly aligned with the gap between Jersey and France
    zoom=8,  # Closer zoom to emphasize the islands
    pitch=55,  # Slightly lower pitch to see the "landing" of arcs in London
    bearing=15,  # Bearing NNE towards London (approx 0.1W)
)

layer = pdk.Layer(
    "ArcLayer",
    data=geo_data_dict,
    get_source_position=["src_lon", "src_lat"],
    get_target_position=["tgt_lon", "tgt_lat"],
    get_source_color=[0, 255, 128, 140],
    get_target_color=[255, 0, 0, 140],
    # DYNAMIC WIDTH: Scale thickness based on Company_Count
    # We use a log-like scaling or simple multiplier so huge districts don't cover the map
    get_width="1 + (Company_Count / 100)",
    get_tilt=15,
    pickable=True,
    auto_highlight=True,
)

r = pdk.Deck(
    layers=[layer],
    initial_view_state=view_state,
    map_style=pdk.map_styles.CARTO_DARK,
    tooltip={
        "html": "<b>District:</b> {District}<br/>"
        "<b>Offshore Haven:</b> {Jurisdiction}<br/>"
        "<b>Controlled Companies:</b> {Company_Count}"
    },
)

html_path = "renderings/aggregated_offshore_arcs.html"
r.to_html(html_path, notebook_display=False)

await analysis.capture_graph_to_png(
    html_content=None,
    output_path="renderings/aggregated_offshore_arcs.png",
    scale=1,
    width=1500,
    height=1920,
    html_file=html_path,
)

![High-Density Offshore Registrations](renderings/aggregated_offshore_arcs.png)

## What is in a name ?

There are many use cases where certain names can be a red flag for potential fraud. For example, if a company has a name that is very similar to a well-known brand, it could be an attempt to deceive customers or investors. Frequent name changing can also be a red flag, as it may indicate an attempt to avoid detection or to create confusion. Additionally, names that include certain keywords (e.g., "offshore," "investment," "holding") may warrant further scrutiny, especially if they are associated with high-risk jurisdictions or industries.

### Reputation laundering

In the following query, we screen the entire history of a company, not just its current legal name. It traverses the `HAS_PREVIOUS_NAME` and `PREVIOUS_NAME_OF` chain to find if any past identity matches a high-risk keyword (e.g., "Crypto", "Capital", or a specific sanctioned entity name). This is crucial because a company might change its name to "Generic Holdings Ltd" specifically to hide its past association with a collapsed crypto exchange or a sanctioned entity.

In [28]:
reputation_laundering_query = """
/// MATCH: Traverse the full history of company names
MATCH path = (c:Company)-[:HAS_PREVIOUS_NAME|PREVIOUS_NAME_OF*]->(prev:PreviousName)

// FILTER: Check if ANY previous name contains high-risk keywords
WHERE prev.name CONTAINS 'CRYPTO'
   OR prev.name CONTAINS 'FX'

// REFINE: Ensure the *current* name does NOT contain these keywords
AND NOT c.name CONTAINS 'CRYPTO'
AND NOT c.name CONTAINS 'FX'

RETURN
    c.number AS Company_ID,
    c.incorporation_date AS Incorporation_Date,
    // Extract the date the name was changed (stored on the relationship)
    last(relationships(path)).changed_on AS Name_Changed_Date,
    // We check the previous name against the list and return the first match found.
    head([word IN ['CRYPTO', 'FX'] WHERE prev.name CONTAINS word]) AS Matched_Keyword
LIMIT 20;
"""

df = analysis.run_query_df(reputation_laundering_query)

df.head(10)

Unnamed: 0,Company_ID,Incorporation_Date,Name_Changed_Date,Matched_Keyword
0,9221291,2014-09-16,2025-02-28,CRYPTO
1,16157727,2024-12-31,2026-01-30,FX
2,13684038,2021-10-17,2023-05-01,CRYPTO
3,13079328,2020-12-14,2025-03-05,CRYPTO
4,13766183,2021-11-25,2025-03-05,CRYPTO
5,12218786,2019-09-20,2021-07-28,FX
6,6483658,2008-01-25,2010-04-15,FX
7,15207218,2023-10-12,2023-11-07,FX
8,16098093,2024-11-25,2025-02-17,FX
9,11383083,2018-05-25,2019-01-30,CRYPTO


### High-velocity name changes

There are many genuine reasons for a company to change its name, such as rebranding, mergers, or changes in business focus. However, a high frequency of name changes within a short period can be a red flag for potential fraud. This pattern may indicate an attempt to evade detection, create confusion among customers or investors, or distance the company from negative publicity. By analyzing the history of name changes for a company, we can identify those that exhibit this high-velocity pattern and may warrant further investigation.

Because we organised the name history as a linked list, we can easily traverse it to count the number of name changes and identify companies that have changed their names multiple times within a short period. This can be done using a Cypher query that counts the number of `HAS_PREVIOUS_NAME` relationships for each company and filters for those with a high count within a specific timeframe.

In the following graph visualization, we can see a company that has changed its name multiple times within a short period, which may be indicative of an attempt to evade detection or create confusion. Brown nodes represent the company, and red nodes represent its previous names. The thickness of the edges indicates the frequency of name changes, with thicker edges representing more frequent changes, and red edges representing more frequent changes. This visual representation can help us quickly identify companies that exhibit this high-velocity name change pattern.

In [49]:
high_velocity_name_changes_query = """
// MATCH: Find the LONGEST path only
// We anchor the start at a Company and the end at a PreviousName that has no further history.
MATCH path = (c:Company)-[:HAS_PREVIOUS_NAME|PREVIOUS_NAME_OF*]->(last_n:PreviousName)
WHERE NOT (last_n)-[:PREVIOUS_NAME_OF]->() 

// FILTER: Limit to chains with interesting history
AND length(path) >= 5 

// UNWIND: Now we can safely unwind because we only have one path per company
WITH path, relationships(path) AS rels
UNWIND range(0, size(rels)-1) AS i
WITH rels[i] AS r_curr, 
     rels[i+1] AS r_next, 
     startNode(rels[i]) AS source, 
     endNode(rels[i]) AS target

// CALCULATE: Duration
// The logic: The name 'target' was valid FROM r_next.changed_on UNTIL r_curr.changed_on
// If r_next is null (it's the oldest name), we assign a default large duration (e.g. 2000 days)
WITH source, target, r_curr,
     CASE 
       WHEN r_next IS NOT NULL THEN duration.between(r_next.changed_on, r_curr.changed_on).days 
       ELSE 2000 
     END AS days_lasted

// THICKNESS: Inverse of duration (Shorter time = Thicker line)
// +1 ensures we don't divide by zero
WITH source, target, r_curr, days_lasted,
     toInteger(10.0 / (days_lasted + 5)) + 1 AS calculated_thickness

// RETURN: Distinct virtual relationships
RETURN 
    source, 
    target, 
    apoc.create.vRelationship(source, 'HAS_PREVIOUS_NAME', {
        thickness: calculated_thickness,
        days_held: days_lasted,
        changed_on: r_curr.changed_on
    }, target) AS rel
LIMIT 50
"""

results = analysis.run_query_viz(high_velocity_name_changes_query)

In [51]:
VG = from_neo4j(results)

VG.color_nodes(
    field="caption",  # Using the internal labels property
    color_space=ColorSpace.DISCRETE,
    colors=colors,
)
VG.resize_relationships(property="thickness")
# (Red = Fast, Blue = Slow)
VG.color_relationships(
    property="thickness", color_space=ColorSpace.CONTINUOUS, colors=["blue", "red"]
)

label_to_property = {"PreviousName": "", "Company": ""}

analysis.set_caption_by_label(VG, label_to_property)

generated_html = VG.render(layout="forcedirected", initial_zoom=0.8)

await analysis.capture_graph_to_png(
    generated_html, "renderings/high_velocity_name_changes_graph.png"
)

![High-Velocity Name Changes](renderings/high_velocity_name_changes_graph.png)