author: indu varshini (eastridge analytics);
id: entity-resolution-using-graph-algorithms-with-neo4j;
categories: getting-started,partner-integrations;
environments: web status: Published feedback link: https://github.com/Snowflake-Labs/sfguides/issues;
tags: Getting Started, Data Science, Data Engineering, Twitter

# Resolving Duplicate Entities Using Graph Algorithms with Neo4j
---
# Overview 

Duration: 2 mins

## What Is Neo4j Graph Analytics For Snowflake?

Neo4j helps organizations find hidden relationships and patterns across billions of data connections deeply, easily, and quickly. Neo4j Graph Analytics for Snowflake brings to the power of graph directly to Snowflake, allowing users to run 65+ ready-to-use algorithms on their data, all without leaving Snowflake!

## Why Entity Resolution Matters

Inconsistent or duplicate customer records are a common challenge in modern data platforms. Whether caused by typos, multiple sign-ups, or disconnected systems, these issues can impact reporting, personalization, and fraud detection. Graph algorithms help by comparing relationships and properties to identify overlapping entities.

## Prerequisites
The Native App [Neo4j Graph Analytics](https://app.snowflake.com/marketplace/listing/GZTDZH40CN/neo4j-neo4j-graph-analytics) for Snowflake

## What You Will Need:
- A [Snowflake account](https://signup.snowflake.com/) with appropriate access to databases and schemas.
- Neo4j Graph Analytics application installed from the Snowflake marketplace. Access the marketplace via the menu bar on the left hand side of your screen, as seen below: 


## What You Will Build:
- A graph-based model that normalizes profile attributes as separate nodes
- A method to use graph-based similarity and community detection for entity resolution  

## What You Will Learn:

- How to reshape tabular data into a normalized graph model
- How to use FastRP and KNN to identify structurally similar profiles
- How to cluster profiles with Louvain for deduplication
- How to visualize the graphs in the notebook itself
- How to read and write directly from and to your Snowflake tables






# Loading the Data
---
Duration: 5 mins

## Dataset Overview:
This dataset contains multiple customer identities with potential duplicates across name, contact, and address fields. The goal is to resolve identities into unique profiles by connecting shared attributes as graph nodes.

For the purposes of the demo, the database will be named `ER_DEMO`. Using the CSV, `er_mock_data.csv`, found [here](https://github.com/neo4j-product-examples/snowflake-graph-analytics/tree/main/entity-resolution), we are going to create a new table called `CUSTOMER_ENTITIES_DEMO_DATA` via the Snowsight data upload method. 

Follow through this Snowflake [documentation](https://docs.snowflake.com/en/user-guide/data-load-web-ui) on creating a table from 'Load data using the web interface'.

In the pop up, 
1. Upload the CSV `er_mock_data.csv` using the browse option. 
2. Under `Select or create a database and schema`, please create a database with name `ER_DEMO`.
3. Under `Select or create a table`, please click on the '+' symbol and create a new table named `CUSTOMER_ENTITIES_DEMO_DATA`.

Now, a new table named `CUSTOMER_ENTITIES_DEMO_DATA` will be created under `er_demo.public` with the provided CSV.

# Setting Up
---
Duration: 5 mins

## Permissions
One of the most usefull aspects of Snowflake is the ability to have roles with specific permissions, so that you can have many people working in the same database without worrying about security. The Neo4j app requires the creation of a few different roles. But before we get started granting different roles, we need to ensure that you are using `accountadmin` to grant and create roles. Lets do that now:

In [None]:
USE ROLE ACCOUNTADMIN;

Next we can set up the necessary roles, permissions, and resource access to enable Graph Analytics to operate on the demo data within the `er_demo.public` schema (this schema is where the data will be stored by default). 

We will create a consumer role (gds_role) for users and administrators, grant the gds_role and GDS application access to read from and write to tables and views, and ensure the future tables are accessible. We will also provide the application with access to the compute pool and warehouse resources required to run the graph algorithms at scale.


In [None]:
-- Create an account role to manage the GDS application
CREATE ROLE IF NOT EXISTS gds_role;
GRANT APPLICATION ROLE neo4j_graph_analytics.app_user TO ROLE gds_role;
GRANT APPLICATION ROLE neo4j_graph_analytics.app_admin TO ROLE gds_role;

--Grant permissions for the application to use the database
GRANT USAGE ON DATABASE er_demo TO APPLICATION neo4j_graph_analytics;
GRANT USAGE ON SCHEMA er_demo.public TO APPLICATION neo4j_graph_analytics;

--Create a database role to manage table and view access
CREATE DATABASE ROLE IF NOT EXISTS gds_db_role;

GRANT ALL PRIVILEGES ON FUTURE TABLES IN SCHEMA er_demo.public TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA er_demo.public TO DATABASE ROLE gds_db_role;

GRANT ALL PRIVILEGES ON FUTURE VIEWS IN SCHEMA er_demo.public TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON ALL VIEWS IN SCHEMA er_demo.public TO DATABASE ROLE gds_db_role;

GRANT CREATE TABLE ON SCHEMA er_demo.public TO DATABASE ROLE gds_db_role;


--Grant the DB role to the application and admin user
GRANT DATABASE ROLE gds_db_role TO APPLICATION neo4j_graph_analytics;
GRANT DATABASE ROLE gds_db_role TO ROLE gds_role;

GRANT USAGE ON DATABASE ER_DEMO TO ROLE GDS_ROLE;
GRANT USAGE ON SCHEMA ER_DEMO.PUBLIC TO ROLE GDS_ROLE;

GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA ER_DEMO.PUBLIC TO ROLE GDS_ROLE;
GRANT CREATE TABLE ON SCHEMA ER_DEMO.PUBLIC TO ROLE GDS_ROLE;
GRANT SELECT, INSERT, UPDATE, DELETE ON FUTURE TABLES IN SCHEMA ER_DEMO.PUBLIC TO ROLE GDS_ROLE;

Now we will switch to the role we just created:

In [None]:
use warehouse NEO4J_GRAPH_ANALYTICS_APP_WAREHOUSE;
use role gds_role;
use database er_demo;
use schema public;


# Cleaning Our Data
---
Duration: 5 mins

We need our data to be in a particular format in order to work with Graph Analytics. In general, it should be like so:

**Node Tables:** The first column, `nodeId` should be uniquely identifying each node in the graph.

**Relationship Tables:** There should be two nodes as source and target between which a relationship exists. 

To get ready for Graph Analytics, reshape your tables as follows:

---

**NODES**
- **Identity** – `identity_id` - Unique identifier for each row in the dataset
- **Name** – concatenation of `first_name`, `last_name`, or nickname
- **Email** – `email_address`
- **Phone** – `phone_number`
- **Birth Year** – `birth_year`
- **Address** – combined from `street_number`, `street_name`, `zip_code`
- **Profile** – each unique customer profile (resolved), represented by `profile_id`

**RELATIONSHIPS**
- Identity — `HAS_NAME` → Name
- Identity — `HAS_EMAIL` → Email
- Identity — `HAS_PHONE` → Phone
- Identity — `HAS_BIRTH_YEAR` → BirthYear
- Identity — `HAS_ADDRESS` → Address


In [None]:
CREATE OR REPLACE TABLE node_identities AS
SELECT DISTINCT identity_id AS identity_id FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE node_names AS
SELECT DISTINCT LOWER(first_name || ' ' || last_name) AS full_name FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE node_emails AS
SELECT DISTINCT LOWER(email_address) AS email_id FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE node_phones AS
SELECT DISTINCT phone_number AS phone_number FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE node_birth_years AS
SELECT DISTINCT birth_year AS birth_year FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE node_addresses AS
SELECT DISTINCT LOWER(street_number || ' ' || street_name || ' ' || zip_code) AS full_address FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE node_profiles AS
SELECT DISTINCT profile_id AS profile_id FROM customer_entities_demo_data;


Now, we will merge all the node tables to a single `all_nodes_tbl`

In [None]:
CREATE OR REPLACE TABLE all_nodes_tbl AS
SELECT identity_id::STRING AS nodeId FROM node_identities
UNION
SELECT full_name::STRING AS nodeId FROM node_names
UNION
SELECT email_id::STRING AS nodeId FROM node_emails
UNION
SELECT phone_number::STRING AS nodeId FROM node_phones
UNION
SELECT birth_year::STRING AS nodeId FROM node_birth_years
UNION
SELECT full_address::STRING AS nodeId FROM node_addresses;


In [None]:
select * from all_nodes_tbl

Below we will create the relationship tables.

In [None]:

CREATE OR REPLACE TABLE rel_identity_name AS
SELECT 
    identity_id, 
    LOWER(first_name || ' ' || last_name) AS full_name
FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE rel_identity_email AS
SELECT 
    identity_id, 
    LOWER(email_address) AS email_id
FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE rel_identity_phone AS
SELECT 
    identity_id, 
    phone_number
FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE rel_identity_birth_year AS
SELECT 
    identity_id,
    birth_year
FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE rel_identity_address AS
SELECT 
    identity_id,
    LOWER(street_number || ' ' || street_name || ' ' || zip_code) AS full_address
FROM customer_entities_demo_data;

CREATE OR REPLACE TABLE rel_identity_profile AS
SELECT 
    identity_id,
    profile_id
FROM customer_entities_demo_data;


We will merge all relationships into one big relationship table for easier analyses.

In [None]:
CREATE OR REPLACE TABLE all_relationships_tbl AS
SELECT identity_id::STRING AS sourceNodeId, full_name::STRING AS targetNodeId FROM rel_identity_name
UNION
SELECT identity_id::STRING, email_id::STRING FROM rel_identity_email
UNION
SELECT identity_id::STRING, phone_number::STRING FROM rel_identity_phone
UNION
SELECT identity_id::STRING, birth_year::STRING FROM rel_identity_birth_year
UNION
SELECT identity_id::STRING, full_address::STRING FROM rel_identity_address;


Let's see how our data is present in the relationships table.

You can also preview the data of these tables from the ER_DEMO database under public schema on Snowsight Databases page. Please refer this [documentation](https://docs.snowflake.com/en/user-guide/ui-snowsight-data-databases-table) for more information.

In [None]:
select * from all_relationships_tbl

# Running Graph Algorithms
---
Duration: 15mins

To get started we will first begin by finding the similarity between the nodes using FastRP and KNN algorithms



## 1. Similarity between Entity Nodes


You can find more information about these algorithms in our [documentation](https://neo4j.com/docs/snowflake-graph-analytics/current/algorithms/).


## Fast Random Projection (FastRP)


We compute embeddings as follows:


In [None]:
CALL Neo4j_Graph_Analytics.graph.fast_rp('CPU_X64_XS', {
  'project': {
    'defaultTablePrefix': 'er_demo.public',
    'nodeTables': ['all_nodes_tbl'],
    'relationshipTables': {
      'all_relationships_tbl': {
        'sourceTable': 'all_nodes_tbl',
        'targetTable': 'all_nodes_tbl',
        'orientation': 'UNDIRECTED'

      }
    }
  },
  'compute': {
    'mutateProperty': 'embedding',
    'embeddingDimension': 128
  },
  'write': [{
    'nodeLabel': 'all_nodes_tbl',
    'outputTable': 'er_demo.public.all_nodes_fast_rp',
    'nodeProperty': 'embedding'
  }]
});

In [None]:
SELECT
  nodeid,
  embedding
FROM er_demo.public.all_nodes_fast_rp;

Now that we have generated node embeddings, we can now proceed to use these in KNN similarity detection algorithm.

## K-Nearest Neighbors (KNN)

With embeddings in place, KNN helps us find structurally similar identities — even if they’re not directly connected. It compares the cosine similarity of embeddings to rank the top matches for each node.

This is especially useful in entity resolution, where entities may appear unrelated on the surface but exhibit parallel structural behavior. 

In the context of cosine similarity in the KNN algorithm, a score of:

- 1.0 means the vectors point in exactly the same direction (perfect similarity).

- 0.0 means orthogonal (no similarity).

- –1.0 means completely opposite.



In [None]:
CALL Neo4j_Graph_Analytics.graph.knn('CPU_X64_XS', {
  'project': {
    'defaultTablePrefix': 'er_demo.public',
    'nodeTables': [ 'all_nodes_fast_rp' ],
    'relationshipTables': {}
  },
  'compute': {
    'nodeProperties': ['EMBEDDING'],
    'topK': 3,
    'mutateProperty': 'score',
    'mutateRelationshipType': 'SIMILAR_TO'
  },
  'write': [{
    'outputTable': 'er_demo.public.all_nodes_knn_similarity',
    'sourceLabel': 'all_nodes_fast_rp',
    'targetLabel': 'all_nodes_fast_rp',
    'relationshipType': 'SIMILAR_TO',
    'relationshipProperty': 'score'
  }]
});

Let's look at a count of rows for each similarity score from our KNN algorithm.

In [None]:
SELECT
    TO_DECIMAL(score, 10, 5) AS score_rounded,
    COUNT(*) AS row_count
FROM er_demo.public.all_nodes_knn_similarity
GROUP BY TO_DECIMAL(score, 10, 5)
ORDER BY score_rounded desc;


As we can see from the above results, there are ~2900 nodes which are tagged to be highly similar to one another.

## 2. Louvain Community Detection
After establishing similarity connections between identities, we apply Louvain to detect communities of likely duplicate records. Each cluster groups together identities with similar structural patterns, thus forming a candidate resolved entity.

In [None]:
CALL Neo4j_Graph_Analytics.graph.louvain('CPU_X64_XS', {
  'project': {
    'defaultTablePrefix': 'er_demo.public',
    'nodeTables': ['all_nodes_tbl'],
    'relationshipTables': {
      'all_nodes_knn_similarity': {
        'sourceTable': 'all_nodes_tbl',
        'targetTable': 'all_nodes_tbl'
      }
    }
  },
  'compute': {
    'mutateProperty': 'community'
  },
  'write': [{
    'outputTable': 'er_demo.public.knn_louvain_communities',
    'nodeLabel': 'all_nodes_tbl',
    'nodeProperty': 'community'
  }]
});


In [None]:
SELECT
    community as community_id,
    COUNT(*) AS community_size
FROM er_demo.public.knn_louvain_communities
GROUP BY community
ORDER BY community_size desc;


In [None]:
select * from knn_louvain_communities
where community = 93

## Interpretation

From the louvain community detection, we can see that around 114 communities have been recognized with the largest community size as 38 and the smallest to be 4


# Conclusion and Resources
---
Duration: 2 mins

In this quickstart, you learned how to bring the power of graph insights into Snowflake using Neo4j Graph Analytics.

## What You Learned
By working with a Customer Entities Demo dataset, you were able to:

1. Set up the Neo4j Graph Analytics application within Snowflake.
2. Prepare and project your data into a graph model (users as nodes, transactions as relationships).
3. Ran Node Embeddings and K Nearest Neighbors to identify the structure of nodes in the graph and identify highly similar customer entities. 
4. Ran Louvain Community Detection to identify potential clusters of highly similar identities.

## Resources
- [Neo4j Graph Analytics Documentation](https://neo4j.com/docs/snowflake-graph-analytics/)
- [Installing Neo4j Graph Analytics on SPCS](https://neo4j.com/docs/snowflake-graph-analytics/installation/)