# Entity Resolution with Neo4j GDS

## Prerequisites

Before running this notebook:

1. **Completed Notebook 1** - Data must be imported into Neo4j
2. **Neo4j running** with the `neo4j` database
3. **GDS plugin installed** (version 2.3+)
4. **Environment configured** - `.env` file with credentials

**Expected Runtime**: 30-45 minutes for full dataset

---

## The Problem

The same person often appears multiple times in our data:
- `kenneth.lay@enron.com` and `klay@enron.com` and `ken.lay@enron.com`
- `jeff.skilling@enron.com` and `jeffrey.skilling@enron.com`
- Display names: "Ken Lay" vs "Kenneth L. Lay" vs "Lay, Kenneth"

**Goal**: Identify and link ~5,000 duplicate User nodes with high confidence

## The Solution

Multi-stage entity resolution pipeline:

1. **Community Detection** (Louvain) - Partition graph into manageable communities
2. **Node Similarity** - Find users with overlapping mailbox connections  
3. **String Matching** (Jaro-Winkler) - Validate name similarity
4. **Connected Components** (WCC) - Group all aliases together

---

## 1. Setup

In [None]:
import os
import pandas as pd
from graphdatascience import GraphDataScience
from dotenv import load_dotenv
import time

load_dotenv()

NEO4J_URI = os.getenv("NEO4J_URI", "bolt://localhost:7687")
NEO4J_USER = os.getenv("NEO4J_USERNAME", "neo4j")
NEO4J_PASSWORD = os.getenv("NEO4J_PASSWORD")
DATABASE = "neo4j"

if not NEO4J_PASSWORD:
    raise ValueError("NEO4J_PASSWORD not found in .env file!")

try:
    gds = GraphDataScience(NEO4J_URI, auth=(NEO4J_USER, NEO4J_PASSWORD), database=DATABASE)
    gds_version = gds.version()
    print(f"Connected to Neo4j GDS {gds_version}")
    
    # Verify data exists
    result = gds.run_cypher("MATCH (e:Email) RETURN count(e) as count")
    email_count = result['count'].iloc[0]
    
    if email_count == 0:
        raise ValueError("No emails found! Please run Notebook 1 first to import data.")
    
    print(f"✓ Found {email_count:,} emails in database")
    
except Exception as e:
    print(f"Connection or data validation failed: {e}")
    print("\nTroubleshooting:")
    print("  1. Have you run Notebook 1 to import data?")
    print("  2. Is Neo4j running with the 'neo4j database?")
    print("  3. Is the GDS plugin installed and activated?")
    raise

## 2. Low-hanging Fruit

Currently, we have two nodes that identify individuals:

1. `(:User)` nodes identify the names extracted from headers
2. `(:Mailbox)` nodes identify emails extracted along with those names

Both are connected individually to the emails they `:SENT`, `:RECEIVED`, `CC_ON`, `BCC_ON`

We cannot use names as primary identifiers. Here's an example as to why:

In [None]:
similar_people = gds.run_cypher("""
    MATCH (u:User)
    WHERE toLower(u.nameRaw) CONTAINS(toLower("Ken")) AND toLower(u.nameRaw) CONTAINS(toLower("Lay"))
    RETURN u.nameRaw AS name, u.primaryEmail AS email
    ORDER BY name
""")

print("Ken Lay lookalikes:")
print("="*60)
display(similar_people)

These are all the emails that could refer to 'Kenneth Lay'.

However, check out line 26. (It may differ slightly on yours).

Ken Slay '80 is not the same person as Kenneth Lay -- yet a straight name comparison may well put them together. Jaro-Winkler would probably pass, but the point stands -- we need to do some filtering.

### Using stable identifiers

There are many ways of doing entity resolution, and the simplest and most predictable is to rely on stable, unique identifiers shared by many entities.

We can then recursively tighten the graph, so that newly combined entities inherit each others' connections.

In our dataset, the most stable identifiers we have are emails. 

When we imported, we ensured that identical email addresses could only enter the database once. So, if:

klay@  got imported with the display name 'Ken Lay' and then the same email again with 'Kenneth Lay' -- they both get connected to each other intermediately via USED.

Let's take a look at some of these entity pairs.

Note: We'll filter out any mailboxes that have over 50 shared users -- the probability of someone having 50 ways to present their name is low.

In [None]:
# Set pandas display options to show full content
pd.set_option('display.max_colwidth', None)  # Show full column width
pd.set_option('display.max_rows', None)      # Show all rows
pd.set_option('display.max_columns', None)   # Show all columns
pd.set_option('display.width', None)         # Don't wrap to fit console width
pd.set_option('display.max_seq_items', None) # Show all items in lists

# Which users are connected to the same mailbox address, but have different nodes?
same_mailbox = gds.run_cypher("""
    MATCH (u:User)-[:USED]->(m:Mailbox)<-[:USED]-(u2:User) 
    WHERE u <> u2 
    AND m.address CONTAINS "@enron.com"
    WITH DISTINCT m.address AS address, collect(u.nameRaw) as names1, collect(u2.nameRaw) AS names2
    WHERE size(names1) < 50 AND size(names2) < 50
    RETURN address, names1, names2
    ORDER BY address ASC
""")

print("Which users are connected to the same mailbox address, but have different nodes?")
print("="*60)
display(same_mailbox)

You will notice these are  -- for the most part -- correct. However, we also need to filter out generic emails like 'list' and 'no.user', etc.

A super interesting example here is that 'Greg Whalley' and 'Lawrence Whalley' both share the same email address. This must be an error, right? 

Actually Lawrence Whalley also goes by Lawrence "Greg" Whalley -- hence why his name shows up in both formats. So these two should merge. On the other hand, we have examples like Randy Young and Becky Young -- the "Randy" seems to come from messy data, and there's not a whole lot we can do about that at this stage.

You could manually attack each of these email addressess with a large dict or array -- or we can take a look at the kinds of relational attributes they display.

For instance, we have already filtered the initial list to exclude `Mailbox` nodes that are connected to more than 50 names. However, we can go one step further. Let's get the degree count for every User -> Email node and append it.

To do so, we:

1. Project a graph of User->Mailbox via email.
2. Run degree centrality

In [None]:
# Clean up any existing projection
try:
    gds.graph.drop('user_email_bipartite')
    print("Dropped existing projection")
except:
    pass

# Create bipartite projection: Person -> Email
# This connects people to the emails they participated in
projection = gds.run_cypher("""
    MATCH (source)
    WHERE "User" IN labels(source)
    OPTIONAL MATCH (source)-[r:USED]->(target:Mailbox)
    WITH gds.graph.project(
        'user_email_bipartite',
        source,
        target,
        {
            sourceNodeLabels: labels(source),
            targetNodeLabels: labels(target),
            relationshipType: type(r)
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created bipartite projection:")
display(projection)

In [None]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

print("Running Degree Centrality...\n")

G = gds.graph.get('user_email_bipartite')

# Run Degree Centrality using the GDS client
result = gds.degree.write(
    G,
    writeProperty='email_degree',
    orientation= 'REVERSE' 
)

print(f"Degree Centrality completed successfully!")
print(f"Properties written: {result['nodePropertiesWritten']:,}")

In [None]:
top_degree = gds.run_cypher("""
MATCH (m:Mailbox)
WHERE m.email_degree IS NOT NULL
OPTIONAL MATCH (u:User)-[:USED]->(m)
WITH m, collect(DISTINCT u.nameNormalized) as owners
RETURN 
  m.address as email,
  owners[0..6] as owner_names,
  size(owners) as unique_owners,
  m.email_degree
ORDER BY m.email_degree DESC
LIMIT 50
""")
display(top_degree)

In [None]:
## Drop the graph
gds.graph.drop('user_email_bipartite')

From these results, we can see that everything above 19 is absolute trash in terms of entity resolution.

Below 19, we start to get some decent matches, but we also have some spam.

Let's run our original @enron.com focused query -- this time with the degree centrality flag set to <= 19

In [None]:
# Set pandas display options to show full content
pd.set_option('display.max_colwidth', None)  # Show full column width
pd.set_option('display.max_rows', None)      # Show all rows
pd.set_option('display.max_columns', None)   # Show all columns
pd.set_option('display.width', None)         # Don't wrap to fit console width
pd.set_option('display.max_seq_items', None) # Show all items in lists

# Which users are connected to the same mailbox address, but have different nodes?
same_mailbox = gds.run_cypher("""
    MATCH (u:User)-[:USED]->(m:Mailbox)<-[:USED]-(u2:User) 
    WHERE u <> u2 
    AND m.address CONTAINS "@enron.com"
    AND m.email_degree <= 19
    WITH DISTINCT m.address AS address, collect(u.nameRaw) as names1, collect(u2.nameRaw) AS names2
    WHERE size(names1) < 50 AND size(names2) < 50
    RETURN address, names1, names2
    ORDER BY address ASC
""")

print("Which users are connected to the same mailbox address, but have different nodes?")
print("="*60)
display(same_mailbox)

This is now a much more realistic set. There are likely still a few spurious links in here -- but it's enough to get started with.

There are a few things we can do now, including:

1. **Weakly Connected Components:** identifies disconnected components in the graph. You can use this to narrow down the comparison areas.
2. **Louvain/Leiden:** Same as WCC, but they require more compute and can identify communities, even where they are connected to outside communities
3. **Node Similarity:** To check the overlap of nodes and their neighbours
4. **FastRP + KNN:** To get node positions and compare (can also use properties as features)

There are some other methods too, but let's start with these.

## 3. WCC

We'll run Weakly Connected Components to see if we have any large, separate components that we can investigate separately.

First, let's just project the entire graph. 

Note: I am using Cypher projection here -- you can also use Native projection if you prefer.

In [None]:
# Clean up any existing projection
try:
    gds.graph.drop('wcc_graph')
    print("Dropped existing projection")
except:
    pass

projection = gds.run_cypher("""
    MATCH (source)-[r]->(target)
    WITH gds.graph.project(
        'wcc_graph',
        source,
        target
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created wcc projection:")
display(projection)

Next we run WCC on the graph.

Before we go ahead and write results, it's worth checking whether there are any orphaned components with **stats**

In [None]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

print("Running WCC...\n")

G = gds.graph.get('wcc_graph')

# Run Degree Centrality using the GDS client
result = gds.wcc.stats(
    G
)

print(f"WCC completed successfully!")
pd.DataFrame(result)

In [None]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

print("Running WCC...\n")

G = gds.graph.get('wcc_graph')

# Run Degree Centrality using the GDS client
result = gds.wcc.write(
    G,
    writeProperty='wcc_id',
    minComponentSize=10
)

print(f"WCC completed successfully!")
print(f"Properties written: {result['nodePropertiesWritten']:,}")

In [None]:
# Set pandas display options to show full content
pd.set_option('display.max_colwidth', None)  # Show full column width
pd.set_option('display.max_rows', None)      # Show all rows
pd.set_option('display.max_columns', None)   # Show all columns
pd.set_option('display.width', None)         # Don't wrap to fit console width
pd.set_option('display.max_seq_items', None) # Show all items in lists

# Which users are connected to the same mailbox address, but have different nodes?
nodes_per_wcc = gds.run_cypher("""
    MATCH (u:User)
    WHERE u.wcc_id IS NOT NULL
    WITH DISTINCT u.wcc_id AS component, collect(u) AS total_per_wcc
    WITH component, size(total_per_wcc) AS count
    RETURN component, count
    ORDER BY count DESC
""")

print("Let's see how many nodes are in each component...")
print("="*60)
display(nodes_per_wcc)

So, we've got many singleton users, trapped alone in their components, a few slightly larger ones, and one gigantic component, helpfully labeled 0.

Running Jaro-Winkler on that would be heavy, so let's break it up some more.

## 4. Louvain

Let's use Louvain, just on that biggest component and see if we can break it up some more.

If you're not sure how Louvain works, check out the [Louvain docs](https://neo4j.com/docs/graph-data-science/current/algorithms/louvain/).

In [None]:
# Clean up any existing projection
try:
    gds.graph.drop('giant_component_graph')
    print("Dropped existing projection")
except:
    pass

# Project only the giant component (wcc_id = 0)
projection = gds.run_cypher("""
    MATCH (source)-[r]->(target)
    WHERE source.wcc_id = 0 AND target.wcc_id = 0
    WITH gds.graph.project(
        'giant_component_graph',
        source,
        target,
        {},
        {undirectedRelationshipTypes: ['*']}
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created giant component projection:")
display(projection)

In [None]:
# Run Louvain community detection
print("Running Louvain community detection...\n")

G = gds.graph.get('giant_component_graph')

result = gds.louvain.write(
    G,
    writeProperty='louvain_community',
    maxLevels=10,
    includeIntermediateCommunities=False
)

print(f"Louvain completed successfully!")
print(f"Properties written: {result['nodePropertiesWritten']:,}")
print(f"Communities found: {result['communityCount']:,}")
print(f"Modularity: {result['modularity']:.4f}")
print(f"Levels: {result['ranLevels']}")

# Display community distribution
pd.DataFrame([result])

On my run of this, Louvain has returned 262 communities with an overall modularity of 0.72 -- that's quite high.

With such a high modularity, we can assume that most of these communities are relatively insular, rarely speaking to each other outside of their social groups.

Let's get another distribution.

In [None]:
# Check the distribution of community sizes
pd.set_option('display.max_rows', 50)

community_sizes = gds.run_cypher("""
    MATCH (u:User)
    WHERE u.louvain_community IS NOT NULL
    WITH u.louvain_community AS community, count(*) AS size
    RETURN community, size
    ORDER BY size DESC
""")

print(f"Found {len(community_sizes)} Louvain communities")
print("\nTop 20 largest communities:")
display(community_sizes.head(20))

print(f"\nCommunity size statistics:")
print(f"  Min: {community_sizes['size'].min()}")
print(f"  Max: {community_sizes['size'].max()}")
print(f"  Mean: {community_sizes['size'].mean():.2f}")
print(f"  Median: {community_sizes['size'].median():.2f}")

In [None]:
## Drop the graph
gds.graph.drop('giant_component_graph')


That's a pretty big difference -- we've now got a relatively even spread of communities across the board.

This is still quite a lot of people to run Jaro-Winkler on -- it is quadratic, so every node will have to be checked against every other node. 

Still, for the hell of it, let's test on one community -- the largest -- just to see how we're doing.

In [None]:
# Get the largest community for testing
largest_community = gds.run_cypher("""
    MATCH (u:User)
    WHERE u.louvain_community IS NOT NULL
    WITH u.louvain_community AS community, count(*) AS size
    RETURN community, size
    ORDER BY size DESC
    LIMIT 1
""")
community_id = largest_community['community'].iloc[0]
community_size = largest_community['size'].iloc[0]

print(f"Processing community {community_id} (size: {community_size})...\n")

import time
start_time = time.time()

matches = gds.run_cypher("""
    MATCH (u1:User)
    WHERE u1.louvain_community = $community_id
      AND u1.nameNormalized IS NOT NULL
      AND u1.nameNormalized <> ''
    
    CALL (u1) {
        WITH u1
        MATCH (u2:User)
        WHERE u2.louvain_community = $community_id
          AND u2.nameNormalized IS NOT NULL
          AND u2.nameNormalized <> ''
          AND u1 < u2
          AND substring(toLower(u1.nameNormalized), 0, 1) = substring(toLower(u2.nameNormalized), 0, 1)
        
        WITH u1, u2,
             apoc.text.jaroWinklerDistance(
                 toLower(u1.nameNormalized), 
                 toLower(u2.nameNormalized)
             ) AS distance
        
        WHERE distance <= 0.08
        
        RETURN u1.nameRaw AS name1,
               u2.nameRaw AS name2,
               u1.nameNormalized AS norm1,
               u2.nameNormalized AS norm2,
               round(distance * 1000) / 1000.0 AS jw_distance,
               round((1 - distance) * 1000) / 1000.0 AS similarity_score
    } IN TRANSACTIONS OF 100 ROWS
    
    RETURN $community_id AS community,
           name1, name2, norm1, norm2, jw_distance, similarity_score
    ORDER BY jw_distance DESC
""", params={'community_id': community_id})

elapsed_time = time.time() - start_time

print(f"Completed in {elapsed_time:.2f} seconds")
print(f"Found {len(matches)} potential matches")

if len(matches) > 0:
    print(f"\nMatch statistics:")
    print(f"  Min distance: {matches['jw_distance'].min():.3f}")
    print(f"  Max distance: {matches['jw_distance'].max():.3f}")
    print(f"  Mean distance: {matches['jw_distance'].mean():.3f}")
    
    display(matches.head(50))
else:
    print("No matches found")

We've got some pretty good matches here, but still have some poor samples pushing through.

Many of the issues stem from the failure to strip out the '(E-mail)' and '\(E-mail)\' tags in the name headers.

You can fix things like this on import, or wait until they appear in your resolution work. I prefer the latter for two reasons:

1. It reminds you that the data is dirty -- overconfidence can lead spurious resolutions
2. Sometimes, those things you strip in the import turn out to be important later on. For instance, it might turn out that something as innocuous as '(E-mail)' can be used to identify different email clients sent to or from.

Let's make a new property, nameNormStrip where we remove all of the mess.

In [None]:
# Create nameNormStrip property by removing email suffixes
print("Creating nameNormStrip property...\n")

import time
start_time = time.time()

result = gds.run_cypher("""
    MATCH (u:User)
    WHERE u.nameNormalized IS NOT NULL
    
    CALL (u) {
        WITH u
        WITH u,
             replace(
                 replace(
                     replace(
                         replace(u.nameNormalized, '\\(E-Mail\\)', ''),
                         '(E-Mail)', ''
                     ),
                     '(E-Mail 2)', ''
                 ),
                 '(Personal)', ''
             ) AS stripped
        SET u.nameNormStrip = trim(stripped)
    } IN TRANSACTIONS OF 1000 ROWS
    
    RETURN count(*) AS nodes_updated
""")

elapsed_time = time.time() - start_time

print(f"Completed in {elapsed_time:.2f} seconds")
display(result)

And let's run it again using the newly stripped property.

In [None]:
# Get the largest community for testing
largest_community = gds.run_cypher("""
    MATCH (u:User)
    WHERE u.louvain_community IS NOT NULL
    WITH u.louvain_community AS community, count(*) AS size
    RETURN community, size
    ORDER BY size DESC
    LIMIT 1
""")
community_id = largest_community['community'].iloc[0]
community_size = largest_community['size'].iloc[0]

print(f"Testing with largest community: {community_id} (size: {community_size})")

import time
start_time = time.time()

matches = gds.run_cypher("""
    MATCH (u1:User)
    WHERE u1.louvain_community = $community_id
      AND u1.nameNormStrip IS NOT NULL
      AND u1.nameNormStrip <> ''
    
    CALL (u1) {
        WITH u1
        MATCH (u2:User)
        WHERE u2.louvain_community = $community_id
          AND u2.nameNormStrip IS NOT NULL
          AND u2.nameNormStrip <> ''
          AND u1 < u2
          AND substring(toLower(u1.nameNormStrip), 0, 1) = substring(toLower(u2.nameNormStrip), 0, 1)
        
        WITH u1, u2,
             apoc.text.jaroWinklerDistance(
                 toLower(u1.nameNormStrip), 
                 toLower(u2.nameNormStrip)
             ) AS distance
        
        WHERE distance <= 0.08
        
        RETURN u1.nameRaw AS name1,
               u2.nameRaw AS name2,
               u1.nameNormStrip AS normStrip1,
               u2.nameNormStrip AS normStrip2,
               round(distance * 1000) / 1000.0 AS jw_distance,
               round((1 - distance) * 1000) / 1000.0 AS similarity_score
    } IN TRANSACTIONS OF 100 ROWS
    
    RETURN $community_id AS community,
           name1, name2, normStrip1, normStrip2, jw_distance, similarity_score
    ORDER BY jw_distance DESC
""", params={'community_id': community_id})

elapsed_time = time.time() - start_time

print(f"Completed in {elapsed_time:.2f} seconds")
print(f"Found {len(matches)} potential matches")

if len(matches) > 0:
    print(f"\nMatch statistics:")
    print(f"  Min distance: {matches['jw_distance'].min():.3f}")
    print(f"  Max distance: {matches['jw_distance'].max():.3f}")
    print(f"  Mean distance: {matches['jw_distance'].mean():.3f}")
    
    display(matches.head(50))
else:
    print("No matches found")

At this point, you may notice that 'if only we were to add more rules, we could get more names to match'. That would be a mistake, leading to hours of adding 'just one more rule'.

With entity resolution, one should expect the outcome to be relatively generalisable. The idiosyncracies in one region of the graph may differ entirely from another. So, yes, add more rules, but don't go overboard.

Better yet, let's further filter our graph.

## 5. Node Similarity

In this particular graph we don't have many identifying, unique features that can identify an individual except that they used the same mailbox.

The same is true of graphs you'll work with from unstructred text.

However, node similarity is still sort of useful here, as you will see.

First, let's project a graph of one of our communities.

We're only going to project `(User)-[:USED]->(Mailbox)` -- this gives us a bipartite graph where no users connect directly to each other, and no mailboxes do either.

In this projection, Node Similarity will look for Users whose mailboxes overlap across the community. Let's run it first, before we look at why it's powerful.

In [None]:
# Get the largest community for testing
largest_community = gds.run_cypher("""
    MATCH (u:User)
    WHERE u.louvain_community IS NOT NULL
    WITH u.louvain_community AS community, count(*) AS size
    RETURN community, size
    ORDER BY size DESC
    LIMIT 1
""")
community_id = largest_community['community'].iloc[0]
community_size = largest_community['size'].iloc[0]

print(f"Testing with largest community: {community_id} (size: {community_size})")

# Clean up any existing projection
try:
    gds.graph.drop('user_mailbox_similarity')
    print("Dropped existing projection")
except:
    pass

# Create bipartite projection: User -> Mailbox
print("Creating User-Mailbox bipartite projection...\n")

projection = gds.run_cypher("""
    MATCH (source:User)-[r:USED]->(target:Mailbox)
    WHERE source.louvain_community = $community_id
    WITH gds.graph.project(
        'user_mailbox_similarity',
        source,
        target,
        {
            sourceNodeLabels: labels(source),
            targetNodeLabels: labels(target),
            relationshipType: type(r)
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""", params={'community_id': community_id})

print("Created bipartite projection:")
display(projection)

Now that we've projected it, we can run node similarity -- same as with any algorithm.

In [None]:
# Stream Node Similarity results
print("Running Node Similarity (streaming)...\n")

import time
start_time = time.time()

similarity_results = gds.run_cypher("""
    CALL gds.nodeSimilarity.stream('user_mailbox_similarity', {
        similarityCutoff: 0.5,
        topK: 5,
        similarityMetric: 'overlap'
    })
    YIELD node1, node2, similarity
    WITH gds.util.asNode(node1) AS u1, gds.util.asNode(node2) AS u2, similarity
    WHERE u1.nameNormStrip IS NOT NULL AND u2.nameNormStrip IS NOT NULL
    RETURN u1.nameNormStrip AS name1,
           u2.nameNormStrip AS name2,
           u1.nameNormalized AS norm1,
           u2.nameNormalized AS norm2,
           round(similarity * 1000) / 1000.0 AS similarity_score
    ORDER BY similarity DESC
""")

elapsed_time = time.time() - start_time

print(f"Completed in {elapsed_time:.2f} seconds")
print(f"Found {len(similarity_results)} similarity pairs")

if len(similarity_results) > 0:
    print(f"\nSimilarity statistics:")
    print(f"  Min: {similarity_results['similarity_score'].min():.3f}")
    print(f"  Max: {similarity_results['similarity_score'].max():.3f}")
    print(f"  Mean: {similarity_results['similarity_score'].mean():.3f}")
    print(f"  Median: {similarity_results['similarity_score'].median():.3f}")
    
    pd.set_option('display.max_rows', None)  # Show all rows
    display(similarity_results)

Let's try this now on a projection of the full graph, keeping our degree centrality filter from earlier.

In [None]:
# Clean up any existing projection
try:
    gds.graph.drop('user_mailbox_similarity_full')
    print("Dropped existing projection")
except:
    pass

# Create bipartite projection: User -> Mailbox (filtered by degree centrality)
print("Creating User-Mailbox bipartite projection for entire dataset...\n")

projection = gds.run_cypher("""
    MATCH (source:User)-[r:USED]->(target:Mailbox)
    WHERE target.email_degree IS NOT NULL
      AND target.email_degree <= 19
    WITH gds.graph.project(
        'user_mailbox_similarity_full',
        source,
        target,
        {
            sourceNodeLabels: labels(source),
            targetNodeLabels: labels(target),
            relationshipType: type(r)
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created bipartite projection:")
display(projection)

In [None]:
# Stream Node Similarity results for entire dataset
print("Running Node Similarity (streaming) on full dataset...\n")

import time
start_time = time.time()

similarity_results = gds.run_cypher("""
    CALL gds.nodeSimilarity.stream('user_mailbox_similarity_full', {
        similarityCutoff: 0.5,
        topK: 5,
        similarityMetric: 'overlap'
    })
    YIELD node1, node2, similarity
    WITH gds.util.asNode(node1) AS u1, gds.util.asNode(node2) AS u2, similarity
    WHERE u1.nameNormStrip IS NOT NULL AND u2.nameNormStrip IS NOT NULL
    RETURN u1.nameNormStrip AS name1,
           u2.nameNormStrip AS name2,
           u1.nameNormalized AS norm1,
           u2.nameNormalized AS norm2,
           round(similarity * 1000) / 1000.0 AS similarity_score
    ORDER BY similarity DESC
""")

elapsed_time = time.time() - start_time

print(f"Completed in {elapsed_time:.2f} seconds")
print(f"Found {len(similarity_results)} similarity pairs")

if len(similarity_results) > 0:
    print(f"\nSimilarity statistics:")
    print(f"  Min: {similarity_results['similarity_score'].min():.3f}")
    print(f"  Max: {similarity_results['similarity_score'].max():.3f}")
    print(f"  Mean: {similarity_results['similarity_score'].mean():.3f}")
    print(f"  Median: {similarity_results['similarity_score'].median():.3f}")
    
    pd.set_option('display.max_rows', None)
    display(similarity_results)

We could accept these results and just merge the entities.

There would be problems for sure, but it would be useable.

Node Similarity here has checked which User nodes overlap with all other nodes in the graph in terms of shared neighbours. It has then calculated the similarity of those nodes by overlap. You could also use Jaccard or Cosine. 

Each of the three has its benefits:

- **Jaccard:** Good as a baseline
- **Overlap:** Good for when you have missing data
- **Cosine:** Best for when you have complete data across the set

For now, we're not going to merge these -- as there is another technique that can be of some use to you, and which will likely work better on this particular dataset. However, if you find yourself working with things like financial data, census data, etc, where users can have multiple unique identifiers and descriptors, Node Similarity can often turn out to be the simplest, most effective method of disambiguation.

In [None]:
gds.graph.drop('user_mailbox_similarity_full')
print("Dropped existing projection")

## 6. FastRP + KNN

FastRP is an algorithm that turn the topology of your nodes into a vector representation. There are a lot of ways to finesse this, and we won't go into those in detail here.

You can get a deeper walkthrough of this algorithm in [this notebook exploring GDS with Python](https://github.com/henrardo/workshop-gds-repo).

For now, let's just run a basic workflow, and see how we can use this to generate embeddings to encode our nodes' positions.

Then, we'll run KNN to find the most similar nodes.

In [None]:
# Clean up any existing projection
try:
    gds.graph.drop('full_graph_fastrp')
    print("Dropped existing projection")
except:
    pass

# Create projection with ALL nodes and relationships
print("Creating full graph projection for FastRP...\n")

projection = gds.run_cypher("""
    MATCH (source)-[r]->(target)
    WITH gds.graph.project(
        'full_graph_fastrp',
        source,
        target
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created full graph projection:")
display(projection)

In [None]:
# Clean up any existing projection
try:
    gds.graph.drop('full_graph_fastrp')
    print("Dropped existing projection")
except:
    pass

# Create projection with ALL nodes and relationships
print("Creating full graph projection for FastRP...\n")

projection = gds.run_cypher("""
    MATCH (source)-[r]->(target)
    WITH gds.graph.project(
        'full_graph_fastrp',
        source,
        target,
        {
            sourceNodeLabels: labels(source),
            targetNodeLabels: labels(target),
            relationshipType: type(r)
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created full graph projection:")
display(projection)

In [None]:
# Run FastRP to generate embeddings on the full graph
print("Running FastRP to generate node embeddings on full graph...\n")

G = gds.graph.get('full_graph_fastrp')

fastrp_result = gds.fastRP.write(
    G,
    writeProperty='fastrp_embedding_full',
    embeddingDimension=128,
    iterationWeights=[0.0, 0.5, 0.5, 0.5, 0.5],
    randomSeed=42
)

print(f"FastRP completed successfully!")
print(f"Properties written: {fastrp_result['nodePropertiesWritten']:,}")

pd.DataFrame([fastrp_result])

In [None]:
# Create a new projection with User nodes and their embeddings for KNN
try:
    gds.graph.drop('user_knn_full_graph')
    print("Dropped existing KNN projection")
except:
    pass

print("\nCreating projection for KNN with embeddings from full graph...\n")

knn_projection = gds.run_cypher("""
    MATCH (source:User)
    WHERE source.fastrp_embedding_full IS NOT NULL
    WITH gds.graph.project(
        'user_knn_full_graph',
        source,
        null,
        {
            sourceNodeProperties: source {
                .fastrp_embedding_full
            },
            targetNodeProperties: null
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created KNN projection:")
display(knn_projection)

In [None]:
# Run KNN to find similar users based on full graph embeddings
print("Running KNN to find similar users based on full graph context...\n")

import time
start_time = time.time()

knn_results = gds.run_cypher("""
    CALL gds.knn.stream('user_knn_full_graph', {
        nodeProperties: ['fastrp_embedding_full'],
        topK: 5,
        sampleRate: 1.0,
        deltaThreshold: 0.0,
        concurrency: 4,
        similarityCutoff: 0.999
    })
    YIELD node1, node2, similarity
    WITH gds.util.asNode(node1) AS u1, gds.util.asNode(node2) AS u2, similarity
    WHERE u1.nameNormStrip IS NOT NULL 
      AND u2.nameNormStrip IS NOT NULL
    RETURN u1.nameNormStrip AS name1,
           u2.nameNormStrip AS name2,
           u1.nameNormalized AS norm1,
           u2.nameNormalized AS norm2,
           similarity AS similarity_score
    ORDER BY similarity DESC
""")

elapsed_time = time.time() - start_time

print(f"Completed in {elapsed_time:.2f} seconds")
print(f"Found {len(knn_results)} similar pairs")

if len(knn_results) > 0:
    print(f"\nKNN Similarity statistics:")
    print(f"  Min: {knn_results['similarity_score'].min():.3f}")
    print(f"  Max: {knn_results['similarity_score'].max():.3f}")
    print(f"  Mean: {knn_results['similarity_score'].mean():.3f}")
    print(f"  Median: {knn_results['similarity_score'].median():.3f}")
    
    pd.set_option('display.max_rows', 100)
    display(knn_results.head(100))
else:
    print("No similar pairs found")

In [None]:
# Clean up
gds.graph.drop('full_graph_fastrp')
gds.graph.drop('user_knn_full_graph')
print("Dropped projections")

First, we'll convert our date stamps into floats (features in FastRP must always be encoded as floats or list of floats)

In [None]:
# Step 1: Convert date strings to float timestamps and store as a new property
print("Step 1: Converting date strings to float timestamps...\n")

import time
start_time = time.time()

date_conversion = gds.run_cypher("""
    MATCH (n)
    WHERE n.date IS NOT NULL
    
    CALL (n) {
        WITH n
        SET n.dateFloat = toFloat(datetime(n.date).epochMillis)
    } IN TRANSACTIONS OF 1000 ROWS
    
    RETURN count(*) AS nodes_updated
""")

elapsed_time = time.time() - start_time
print(f"Completed in {elapsed_time:.2f} seconds")
display(date_conversion)

We can improve the accuracy of FastRP by including node features. This allows us to infer the attributes of nodes using one or more algorithms and then have those encoded as part of the vectors output by FastRP.

In this case, let's get degree centrality for every node in the graph. This should naturally separate generic, high-volume email nodes from specific, lower volume user nodes.

In [None]:
# Step 2: Create standard graph projection and run degree centrality
try:
    gds.graph.drop('full_graph_for_degree')
    print("Dropped existing projection")
except:
    pass

print("\nStep 2: Creating full graph projection for degree centrality...\n")

degree_projection = gds.run_cypher("""
    MATCH (source)-[r]->(target)
    WITH gds.graph.project(
        'full_graph_for_degree',
        source,
        target,
        {
            sourceNodeLabels: labels(source),
            targetNodeLabels: labels(target),
            relationshipType: type(r)
        },
        {
            undirectedRelationshipTypes: ['*']
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created projection for degree centrality:")
display(degree_projection)

In [None]:
# Run Degree Centrality (undirected) and write to graph
print("Running Degree Centrality (undirected)...\n")

G = gds.graph.get('full_graph_for_degree')

degree_result = gds.degree.write(
    G,
    writeProperty='degree_centrality_undirected',
    orientation='UNDIRECTED'
)

print(f"Degree Centrality completed successfully!")
print(f"Properties written: {degree_result['nodePropertiesWritten']:,}")

pd.DataFrame([degree_result])

In [None]:
# Clean up degree projection
gds.graph.drop('full_graph_for_degree')
print("Dropped degree projection")

We could add these nodeProperties to our graph as is, but we'll end up with better results if we scale them first. So, let's now project a new graph to create scaled Properties.

In [None]:
# Step 3: Create projection with properties for scaling
try:
    gds.graph.drop('graph_for_scaling')
    print("Dropped existing projection")
except:
    pass

print("\nStep 3: Creating projection for scaling properties...\n")

scaling_projection = gds.run_cypher("""
    MATCH (n)
    RETURN gds.graph.project(
        'graph_for_scaling',
        n,
        null,
        {
            sourceNodeProperties: n { 
                dateFloat: coalesce(n.dateFloat, 0.0),
                degree_centrality_undirected: coalesce(n.degree_centrality_undirected, 0.0)
            },
            targetNodeProperties: {}
        }
    ) AS g
""")

print("Created projection for scaling:")
display(scaling_projection)

In [None]:
# Step 4: Scale properties using GDS Scale Properties algorithm
print("\nStep 4: Scaling properties using MinMax scaler...\n")

G_scale = gds.graph.get('graph_for_scaling')

scale_result = gds.scaleProperties.mutate(
    G_scale,
    nodeProperties=['degree_centrality_undirected', 'dateFloat'],
    scaler='MinMax',
    mutateProperty='scaledFeatures'
)

print(f"Scale Properties completed successfully!")
print(f"Properties mutated: {scale_result['nodePropertiesWritten']:,}")

pd.DataFrame([scale_result])

In [None]:
# Step 5: Write the scaled properties back to Neo4j
print("\nStep 5: Writing scaled properties back to Neo4j...\n")

write_scaled = gds.run_cypher("""
    CALL gds.graph.nodeProperty.stream('graph_for_scaling', 'scaledFeatures')
    YIELD nodeId, propertyValue
    WITH gds.util.asNode(nodeId) AS node, propertyValue
    SET node.scaledFeatures = propertyValue
    RETURN count(*) AS nodes_updated
""")

print("Wrote scaled properties:")
display(write_scaled)

# Clean up scaling projection
gds.graph.drop('graph_for_scaling')
print("Dropped scaling projection")

In [None]:
# Step 6: Create final projection with scaled properties for FastRP
try:
    gds.graph.drop('full_graph_fastrp_scaled')
    print("Dropped existing projection")
except:
    pass

print("\nStep 6: Creating final projection with scaled properties...\n")

fastrp_projection = gds.run_cypher("""
    MATCH (source)-[r]->(target)
    WITH gds.graph.project(
        'full_graph_fastrp_scaled',
        source,
        target,
        {
            sourceNodeLabels: labels(source),
            targetNodeLabels: labels(target),
            sourceNodeProperties: source { 
                .scaledFeatures
            },
            targetNodeProperties: target { 
                .scaledFeatures
            },
            relationshipType: type(r)
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created projection for FastRP:")
display(fastrp_projection)

In [None]:
# Step 7: Run FastRP with scaled features
print("\nStep 7: Running FastRP with scaled degree and date features...\n")

G = gds.graph.get('full_graph_fastrp_scaled')

fastrp_result = gds.fastRP.write(
    G,
    writeProperty='fastrp_embedding_scaled',
    embeddingDimension=128,
    featureProperties=['scaledFeatures'],
    iterationWeights=[0.0, 1.0, 1.0],
    randomSeed=42
)

print(f"FastRP completed successfully!")
print(f"Properties written: {fastrp_result['nodePropertiesWritten']:,}")

pd.DataFrame([fastrp_result])

In [None]:
gds.graph.drop('full_graph_fastrp_scaled')
print("Dropped existing projection")

In [None]:
# Step 8: Create KNN projection with User nodes
try:
    gds.graph.drop('user_knn_scaled')
    print("Dropped existing KNN projection")
except:
    pass

print("\nStep 8: Creating projection for KNN with scaled embeddings...\n")

knn_projection = gds.run_cypher("""
    MATCH (source:User)
    WHERE source.fastrp_embedding_scaled IS NOT NULL
    WITH gds.graph.project(
        'user_knn_scaled',
        source,
        null,
        {
            sourceNodeProperties: source {
                .fastrp_embedding_scaled
            },
            targetNodeProperties: {}
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print("Created KNN projection:")
display(knn_projection)

In [None]:
# Run KNN to find similar users based on full graph embeddings
print("Running KNN to find similar users based on full graph context...\n")

import time
start_time = time.time()

knn_results = gds.run_cypher("""
    CALL gds.knn.stream('user_knn_full_graph', {
        nodeProperties: ['fastrp_embedding_scaled'],
        topK: 5,
        sampleRate: 1.0,
        deltaThreshold: 0.0,
        concurrency: 4,
        similarityCutoff: 0.999
    })
    YIELD node1, node2, similarity
    WITH gds.util.asNode(node1) AS u1, gds.util.asNode(node2) AS u2, similarity
    WHERE u1.nameNormStrip IS NOT NULL 
      AND u2.nameNormStrip IS NOT NULL
    RETURN u1.nameNormStrip AS name1,
           u2.nameNormStrip AS name2,
           u1.nameNormalized AS norm1,
           u2.nameNormalized AS norm2,
           similarity AS similarity_score
    ORDER BY similarity DESC
""")

elapsed_time = time.time() - start_time

print(f"Completed in {elapsed_time:.2f} seconds")
print(f"Found {len(knn_results)} similar pairs")

if len(knn_results) > 0:
    print(f"\nKNN Similarity statistics:")
    print(f"  Min: {knn_results['similarity_score'].min():.3f}")
    print(f"  Max: {knn_results['similarity_score'].max():.3f}")
    print(f"  Mean: {knn_results['similarity_score'].mean():.3f}")
    print(f"  Median: {knn_results['similarity_score'].median():.3f}")
    
    pd.set_option('display.max_rows', 100)
    display(knn_results.head(100))
else:
    print("No similar pairs found")

In [None]:
gds.graph.drop('user_knn_scaled')
print("Dropped existing projection")

## 7. Resolving entities

So, here we find ourselves with a few different methods for getting to high-quality resolutions.

1. Community Detection (WCC, Leiden, Louvain) can help us to identify certain sectors within the graph.
2. Node Similarity can help us to identify structurally similar nodes within the entire graph, or even down to just those sectors (good for when the graph is large)
3. FastRP can help us to topologically encode every node in the graph + KNN allows us to get similarity another way (more approximate that Node Similarity)
4. The basic text-matching rules which you may or may not be used to come into play at the very end of this process.

The theory here is that, you can define some 'business rules' using a series of these filtration steps whereby 'if two people are this similar, and also have closely matching names, they are likely to be the same'. If you have enough supporting data, you could theoretically skip the name matching portion of this entirely.

So, now, let's use everything we've done so far to identify a query that will allow us to resolve entities -- even with these limited data.

In [None]:
# First, let's get the list of Louvain communities with their sizes
# We'll process each community separately to keep memory manageable
# Not wildly important at this scale, but good practice for larger graphs

print("Getting Louvain communities to process...\n")

communities = gds.run_cypher("""
    MATCH (u:User)
    WHERE u.louvain_community IS NOT NULL
      AND u.nameNormStrip IS NOT NULL
      AND u.nameNormStrip <> ''
    WITH u.louvain_community AS community, count(*) AS size
    WHERE size >= 2  // Need at least 2 users to find matches
    RETURN community, size
    ORDER BY size DESC
""")

print(f"Found {len(communities)} communities with 2+ users")
print(f"Total users across communities: {communities['size'].sum():,}")
print(f"\nLargest 10 communities:")
display(communities.head(10))

This is essentially the resolution function we're defining. This cell just defines it; the next cell runs it in whatever way we specify.

You could, for example, run this one by one per community or for every community, one after the other in a single run.

In [None]:
import time

def resolve_entities_in_community(gds, community_id, similarity_cutoff=0.5, jw_threshold=0.15):
    """
    Run entity resolution for a single Louvain community.
    
    Steps:
    1. Project User->Mailbox bipartite graph for this community
    2. Run Node Similarity to find structurally similar users
    3. Filter by Jaro-Winkler name similarity
    
    Parameters:
    - community_id: The Louvain community ID to process
    - similarity_cutoff: Minimum node similarity score (default 0.5)
    - jw_threshold: Maximum Jaro-Winkler distance (default 0.15, i.e. 85% similarity)
    
    Returns:
    - DataFrame of matched entity pairs
    """
    
    graph_name = f'community_{community_id}_bipartite'
    
    # Step 1: Clean up any existing projection
    try:
        gds.graph.drop(graph_name)
    except:
        pass
    
    # Step 2: Project User→Mailbox bipartite graph for this community
    projection = gds.run_cypher("""
        MATCH (source:User)-[r:USED]->(target:Mailbox)
        WHERE source.louvain_community = $community_id
          AND target.email_degree IS NOT NULL
          AND target.email_degree <= 19
        WITH gds.graph.project(
            $graph_name,
            source,
            target,
            {
                sourceNodeLabels: labels(source),
                targetNodeLabels: labels(target),
                relationshipType: type(r)
            }
        ) AS g
        RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
    """, params={'community_id': community_id, 'graph_name': graph_name})
    
    # Step 3: Run Node Similarity
    # This finds users who share similar mailbox connections
    similarity_results = gds.run_cypher("""
        CALL gds.nodeSimilarity.stream($graph_name, {
            similarityCutoff: $similarity_cutoff,
            topK: 10,
            similarityMetric: 'OVERLAP'
        })
        YIELD node1, node2, similarity
        WITH gds.util.asNode(node1) AS u1, gds.util.asNode(node2) AS u2, similarity
        WHERE u1:User AND u2:User
          AND u1.nameNormStrip IS NOT NULL 
          AND u2.nameNormStrip IS NOT NULL
          AND u1 < u2
        
        // Step 4: Apply Jaro-Winkler filtering
        WITH u1, u2, similarity,
             apoc.text.jaroWinklerDistance(
                 toLower(u1.nameNormStrip), 
                 toLower(u2.nameNormStrip)
             ) AS jw_distance
        
        WHERE jw_distance <= $jw_threshold
        
        RETURN 
            $community_id AS community,
            u1.nameRaw AS user1_raw,
            u2.nameRaw AS user2_raw,
            u1.nameNormStrip AS user1_normalized,
            u2.nameNormStrip AS user2_normalized,
            round(similarity * 1000) / 1000.0 AS node_similarity,
            round((1.0 - jw_distance) * 1000) / 1000.0 AS name_similarity,
            round(jw_distance * 1000) / 1000.0 AS jw_distance
        ORDER BY node_similarity ASC, name_similarity ASC
    """, params={
        'graph_name': graph_name, 
        'similarity_cutoff': similarity_cutoff,
        'jw_threshold': jw_threshold,
        'community_id': community_id
    })
    
    # Step 5: Clean up projection
    try:
        gds.graph.drop(graph_name)
    except:
        pass
    
    return similarity_results

This cell is essentially just grabbing the communities we've already identified, and then running our resolution pass on each one in turn until it's complete.

In [None]:
print("="*80)
print("ENTITY RESOLUTION: Louvain → Node Similarity → Jaro-Winkler")
print("="*80 + "\n")

# Configuration
SIMILARITY_CUTOFF = 0.5    # Minimum node similarity (shared mailbox overlap)
JW_THRESHOLD = 0.15        # Maximum Jaro-Winkler distance (85% name similarity)

all_matches = []
communities_processed = 0
communities_with_matches = 0

start_time = time.time()

for idx, row in communities.iterrows():
    community_id = row['community']
    community_size = row['size']
    
    # Progress indicator for large runs
    if communities_processed % 50 == 0:
        print(f"Processing community {communities_processed + 1}/{len(communities)}...")
    
    # Run resolution for this community
    matches = resolve_entities_in_community(
        gds, 
        community_id,
        similarity_cutoff=SIMILARITY_CUTOFF,
        jw_threshold=JW_THRESHOLD
    )
    
    if len(matches) > 0:
        all_matches.append(matches)
        communities_with_matches += 1
    
    communities_processed += 1

elapsed_time = time.time() - start_time

# Combine all results
if all_matches:
    entity_matches = pd.concat(all_matches, ignore_index=True)
else:
    entity_matches = pd.DataFrame()

print(f"\n{'='*80}")
print("RESULTS SUMMARY")
print(f"{'='*80}")
print(f"  Total time: {elapsed_time:.2f} seconds")
print(f"  Communities processed: {communities_processed}")
print(f"  Communities with matches: {communities_with_matches}")
print(f"  Total entity matches found: {len(entity_matches)}")


Now that we've run it, let's take a look at the results. 

You'll notice here that I'm setting it to display in ascending order of confidence. This is so we can be certain that our worst performing match is still a valid match.

In total, across the set of 58k User nodes -- on this run, I have found 4935 high-confidence matches, just by running a few algorithms.

In [None]:
# Set pandas display options to show full content
pd.set_option('display.max_colwidth', None)  # Show full column width
pd.set_option('display.max_rows', None)      # Show all rows
pd.set_option('display.max_columns', None)   # Show all columns
pd.set_option('display.width', None)         # Don't wrap to fit console width
pd.set_option('display.max_seq_items', None) # Show all items in lists

if len(entity_matches) > 0:
    # Add confidence score based on both signals
    # Higher node_similarity + higher name_similarity = higher confidence
    entity_matches['confidence_score'] = (
        0.5 * entity_matches['node_similarity'] + 
        0.5 * entity_matches['name_similarity']
    ).round(3)
    
    # Assign confidence levels
    entity_matches['match_confidence'] = pd.cut(
        entity_matches['confidence_score'],
        bins=[0, 0.70, 0.85, 1.0],
        labels=['LOW', 'MEDIUM', 'HIGH']
    )
    
    # Sort by confidence
    entity_matches = entity_matches.sort_values('confidence_score', ascending=True)
    
    print(f"\nMatch Statistics:")
    print(f"  HIGH confidence:   {(entity_matches['match_confidence'] == 'HIGH').sum()}")
    print(f"  MEDIUM confidence: {(entity_matches['match_confidence'] == 'MEDIUM').sum()}")
    print(f"  LOW confidence:    {(entity_matches['match_confidence'] == 'LOW').sum()}")
    print(f"\n  Node Similarity - Min: {entity_matches['node_similarity'].min():.3f}, "
          f"Max: {entity_matches['node_similarity'].max():.3f}, "
          f"Mean: {entity_matches['node_similarity'].mean():.3f}")
    print(f"  Name Similarity - Min: {entity_matches['name_similarity'].min():.3f}, "
          f"Max: {entity_matches['name_similarity'].max():.3f}, "
          f"Mean: {entity_matches['name_similarity'].mean():.3f}")
    
    print(f"\n{'='*80}")
    print("LIKELY ENTITY MATCHES")
    print(f"{'='*80}\n")
    
    display(entity_matches[[
        'community',
        'user1_raw', 
        'user2_raw',
        'node_similarity',
        'name_similarity',
        'confidence_score',
        'match_confidence'
    ]])
    
else:
    print("\nNo matches found with current thresholds.")
    print("Consider adjusting SIMILARITY_CUTOFF or JW_THRESHOLD.")

## 8. Committing to the resolution

Now, I'm happy with the results acheived in this run. You may want to experiment with higher or lower thresholds -- that's fine. 

My advice to you is to run this once, and commit to high-quality matches. Then, with your newly connected nodes, run it again -- the graph should tighten up over multiple runs and produce better and better matches until you hit diminishing returns.

If you're ready to commit, you've got a few options:

1. Create a new node to represent all of your SAME_AS nodes. You can then connect those nodes to it. (recommended)
    - Pros: You get full traceability -- if something gets funky later, you can always undo
    - Cons: Adds another layer of complexity to subsequent runs

2. Merge all known SAME_AS to a single node, and have it inherit rels and properties. If you could be 100% sure that all of your resolutions are correct, you could do this.
    - Pros: Easy to understand, simplifies subsequent runs, truly recursive
    - Cons: No way to fix a bad merge, no traceability, fundamentally destructive

I would advise adding a central node to represent your SAME_AS entities and connecting them to that. In later projections, it's easy enough to just include that entity as a source instead of its sub entities.

For each of our matches, we'll add a new relationship, 'SIM_RES'. Then we'll run WCC on those to identify a new group. This new group will represent our matches.

In [None]:
# 1. Create SIMILAR_ENT relationships with all metrics
# 2. Project graph with only SIMILAR_ENT relationships
# 3. Run WCC to find connected components
# 4. Create ResolvedEntity nodes per component

print("="*80)
print("CREATING RESOLVED ENTITY NODES (WCC-BASED)")
print("="*80 + "\n")

import time
start_time = time.time()

# Step 1: Create SIMILAR_ENT relationships between matched users with all metrics
print("Step 1: Creating SIMILAR_ENT relationships between matched users...\n")

similar_ent_result = gds.run_cypher("""
    UNWIND $matches AS match
    MATCH (u1:User {nameNormStrip: match.user1_normalized})
    MATCH (u2:User {nameNormStrip: match.user2_normalized})
    WHERE u1.louvain_community = match.community
      AND u2.louvain_community = match.community
      AND u1 <> u2
    
    CALL (u1, u2, match) {
        MERGE (u1)-[r:SIMILAR_ENT]-(u2)
        ON CREATE SET 
            r.confidence_score = match.confidence_score,
            r.node_similarity = match.node_similarity,
            r.name_similarity = match.name_similarity,
            r.jw_distance = match.jw_distance,
            r.match_confidence = match.match_confidence,
            r.created_at = datetime()
    } IN TRANSACTIONS OF 100 ROWS
    
    RETURN count(*) AS relationships_created
""", params={'matches': entity_matches.to_dict('records')})

print(f"  Created {similar_ent_result['relationships_created'].iloc[0]} SIMILAR_ENT relationships")

# Step 2: Project graph with User nodes and SIMILAR_ENT relationships
print("\nStep 2: Projecting graph for WCC analysis...\n")

# Clean up any existing projection
try:
    gds.graph.drop('similar_entities_graph')
except:
    pass

projection = gds.run_cypher("""
    MATCH (source:User)-[r:SIMILAR_ENT]-(target:User)
    WITH gds.graph.project(
        'similar_entities_graph',
        source,
        target,
        {
            sourceNodeLabels: labels(source),
            targetNodeLabels: labels(target),
            relationshipType: type(r)
        },
        {
            undirectedRelationshipTypes: ['*']
        }
    ) AS g
    RETURN g.graphName AS name, g.nodeCount AS nodes, g.relationshipCount AS relationships
""")

print(f"  Projected graph: {projection['nodes'].iloc[0]} nodes, {projection['relationships'].iloc[0]} relationships")

# Step 3: Run WCC on the projected graph
print("\nStep 3: Running WCC to find connected components...\n")

G = gds.graph.get('similar_entities_graph')

wcc_result = gds.wcc.write(
    G,
    writeProperty='entity_resolution_wcc',
    minComponentSize=1
)

print(f"  WCC completed successfully!")
print(f"  Components found: {wcc_result['componentCount']:,}")
print(f"  Properties written: {wcc_result['nodePropertiesWritten']:,}")

# Clean up the projection
gds.graph.drop('similar_entities_graph')

# Step 4: Create ResolvedEntity nodes per WCC component
print("\nStep 4: Creating ResolvedEntity nodes per component...\n")

resolved_result = gds.run_cypher("""
    // Get all users with entity_resolution_wcc property
    MATCH (u:User)
    WHERE u.entity_resolution_wcc IS NOT NULL
    
    // Group by component ID
    WITH u.entity_resolution_wcc AS component_id, collect(u) AS members
    WHERE size(members) >= 2
    
    // Create ResolvedEntity for each component
    CREATE (re:ResolvedEntity {
        wcc_id: component_id,
        member_count: size(members),
        resolution_method: 'louvain_nodesim_jarowinkler_wcc',
        created_at: datetime()
    })
    
    // Determine canonical name (most common normalized name, or longest)
    WITH re, members
    UNWIND members AS member
    WITH re, members, member.nameNormStrip AS name, count(*) AS freq
    ORDER BY freq DESC, size(name) DESC
    WITH re, members, collect(name)[0] AS canonical_name
    SET re.canonical_name = canonical_name
    
    // Collect all email addresses owned by members
    WITH re, members
    UNWIND members AS member
    OPTIONAL MATCH (member)-[:USED]->(m:Mailbox)
    WITH re, members, collect(DISTINCT m.address) AS emails
    SET re.email_addresses = emails
    
    // Connect all users to the ResolvedEntity
    WITH re, members
    UNWIND members AS member
    MERGE (member)-[r:RESOLVES_TO]->(re)
    ON CREATE SET r.created_at = datetime()
    
    RETURN count(DISTINCT re) AS entities_created
""")

print(f"  Created {resolved_result['entities_created'].iloc[0]} ResolvedEntity nodes")

# Step 5: Summary of created entities
print("\nStep 5: Reviewing created ResolvedEntity nodes...\n")

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

resolved_entities = gds.run_cypher("""
    MATCH (re:ResolvedEntity)<-[:RESOLVES_TO]-(u:User)
    WITH re, collect(u.nameRaw) AS member_names, collect(DISTINCT u.nameNormStrip) AS normalized_names
    RETURN 
        re.wcc_id AS wcc_id,
        re.canonical_name AS canonical_name,
        re.member_count AS member_count,
        re.email_addresses[0..3] AS sample_emails,
        member_names AS all_name_variants,
        normalized_names AS unique_normalized_names
    ORDER BY member_count DESC
""")

elapsed_time = time.time() - start_time

print(f"{'='*80}")
print("RESOLUTION COMPLETE")
print(f"{'='*80}")
print(f"  Total time: {elapsed_time:.2f} seconds")
print(f"  ResolvedEntity nodes created: {len(resolved_entities)}")
print(f"  Total users resolved: {resolved_entities['member_count'].sum()}")
print(f"\n")

display(resolved_entities)


Instead of projecting User nodes directly, you can now project ResolvedEntity:

```
MATCH (re:ResolvedEntity)-[:RESOLVES_TO]-(u:User)-[:SENT]->(e:Email)
WITH re, e
```

... continue with your analysis using re instead of u

Or to get all emails for a resolved entity:

```
MATCH (re:ResolvedEntity {canonical_name: 'kenneth lay'})<-[:RESOLVES_TO]-(u:User)
MATCH (u)-[:SENT|RECEIVED|CC_ON|BCC_ON]-(e:Email)
RETURN DISTINCT e
```

## Summary and Next Steps

### What We Accomplished

In this notebook, we successfully resolved duplicate entities using a multi-stage pipeline:

1. **Community Detection** - Used Louvain to partition the graph into ~260 communities
2. **Degree Centrality** - Filtered out generic/spam mailboxes (degree > 19)
3. **Node Similarity** - Found users with overlapping mailbox connections
4. **String Matching** - Validated matches with Jaro-Winkler (85%+ similarity)
5. **Connected Components** - Grouped all aliases into ResolvedEntity nodes

**Results**: ~5,000 high-confidence entity matches identified and linked

### Using Resolved Entities

Now that entities are resolved, you can query using `ResolvedEntity` nodes:

```cypher
// Find all emails for "Kenneth Lay" (all aliases combined)
MATCH (re:ResolvedEntity {canonical_name: 'Kenneth Lay'})<-[:RESOLVES_TO]-(u:User)
MATCH (u)-[:SENT|RECEIVED|CC_ON|BCC_ON]-(e:Email)
RETURN DISTINCT e

// Count emails per resolved entity
MATCH (re:ResolvedEntity)<-[:RESOLVES_TO]-(u:User)
MATCH (u)-[:SENT]->(e:Email)
RETURN re.canonical_name, count(DISTINCT e) as emails_sent
ORDER BY emails_sent DESC
LIMIT 20

// Find communication between resolved entities
MATCH (re1:ResolvedEntity)<-[:RESOLVES_TO]-(u1:User)
MATCH (re2:ResolvedEntity)<-[:RESOLVES_TO]-(u2:User)
MATCH (u1)-[:SENT]->(e:Email)<-[:RECEIVED]-(u2)
WHERE re1 <> re2
RETURN re1.canonical_name, re2.canonical_name, count(e) as emails
ORDER BY emails DESC
LIMIT 10
```