Basket Analysis

Neo4j GDS on Snowflake v0.3.13

Last Updated: 16 May 2025

## Setting Up
Before we run our algorithms, we need to set the proper permissions. But before we get started granting different roles, we need to ensure that you are using `accountadmin` to grant and create roles. Lets do that now:

In [None]:
-- you must be accountadmin to create role and grant permissions
USE ROLE accountadmin;

Create a database which we will use to prepare data for Neo4j Graph Analytics for Snowflake.

In [None]:
-- Create a database which we will use to prepare data for Graph Analytics.
CREATE DATABASE IF NOT EXISTS product_recommendation;
CREATE SCHEMA IF NOT EXISTS product_recommendation.public;
USE SCHEMA product_recommendation.public;

Next let's set up the necessary roles, permissions, and resource access to enable Graph Analytics to operate on data within the neo4j_imdb.public schema. It creates a consumer role (gds_role) for users and administrators, grants the Neo4j Graph Analytics for Snowflake application access to read from and write to tables and views, and ensures that future tables are accessible.

It also provides the application with access to the required compute pool and warehouse resources needed to run graph algorithms at scale.

In [None]:
USE SCHEMA product_recommendation.public;

-- Create a consumer role for users and admins of the GDS application
CREATE ROLE IF NOT EXISTS gds_user_role;
CREATE ROLE IF NOT EXISTS gds_admin_role;
GRANT APPLICATION ROLE neo4j_graph_analytics.app_user TO ROLE gds_user_role;
GRANT APPLICATION ROLE neo4j_graph_analytics.app_admin TO ROLE gds_admin_role;

CREATE DATABASE ROLE IF NOT EXISTS gds_db_role;
GRANT DATABASE ROLE gds_db_role TO ROLE gds_user_role;
GRANT DATABASE ROLE gds_db_role TO APPLICATION neo4j_graph_analytics;

-- Grant access to consumer data
GRANT USAGE ON DATABASE product_recommendation TO ROLE gds_user_role;
GRANT USAGE ON SCHEMA product_recommendation.public TO ROLE gds_user_role;

-- Required to read tabular data into a graph
GRANT SELECT ON ALL TABLES IN DATABASE product_recommendation TO DATABASE ROLE gds_db_role;

-- Ensure the consumer role has access to created tables/views
GRANT ALL PRIVILEGES ON FUTURE TABLES IN SCHEMA product_recommendation.public TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA product_recommendation.public TO DATABASE ROLE gds_db_role;
GRANT CREATE TABLE ON SCHEMA product_recommendation.public TO DATABASE ROLE gds_db_role;
GRANT CREATE VIEW ON SCHEMA product_recommendation.public TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON FUTURE VIEWS IN SCHEMA product_recommendation.public TO DATABASE ROLE gds_db_role;
GRANT ALL PRIVILEGES ON ALL VIEWS IN SCHEMA product_recommendation.public TO DATABASE ROLE gds_db_role;

-- Compute and warehouse access
GRANT USAGE ON WAREHOUSE NEO4J_GRAPH_ANALYTICS_APP_WAREHOUSE TO APPLICATION neo4j_graph_analytics;

In [None]:
USE ROLE gds_user_role;
USE SCHEMA product_recommendation.public;

This example uses data from the Snowflake sample database, SNOWFLAKE_SAMPLE_DATA 

See https://docs.snowflake.com/en/user-guide/sample-data-using


In [None]:
-- The application reads data from tables that represent nodes and relationships.
-- Nodes are usually represented by entity tables, like persons or products.
-- Relationships are foreign keys between entity tables (1:1, 1:n) or via mapping tables (n:m).
-- In addition, the application expects certain naming conventions on column names.
-- If the data is not yet in the right format, we can use views to get there.

-- For our analysis, we will use two different types of nodes: parts and orders.
-- We want to find similar parts by looking at the orders in which they appeared.
-- The relationships will be the line items linking a part to an order.
-- The result will be a new table containing pairs of parts including their similarity score.

-- We start by creating two views to represent our node tables.
-- The application requires a node table to contain a 'nodeId' column.
-- Since we do not need any node properties, this will be the only column we project.
-- Note that the `nodeId` column is used to uniquely identify a node in the table.
-- The uniqueness is usually achieved by using the primary key in that table, here 'p_partkey'.
CREATE OR REPLACE VIEW parts AS
SELECT p_partkey AS nodeId FROM snowflake_sample_data.tpch_sf1.part;

-- We do the same for the orders by projecting the `o_orderkey` to 'nodeId'.
CREATE OR REPLACE VIEW orders AS
SELECT o_orderkey AS nodeId FROM snowflake_sample_data.tpch_sf1.orders;

-- The line items represent the relationship between parts and orders.
-- The application requires a `sourceNodeId` and a `targetNodeId` column to identify.
-- Here, a part is the source of a relationship and an order is the target.
CREATE OR REPLACE VIEW part_in_order AS
SELECT
    l_partkey AS sourceNodeId,
    l_orderkey AS targetNodeId
FROM snowflake_sample_data.tpch_sf1.lineitem;

Next, we want to consider the warehouse that the Neo4j Graph Analytics for Snowflake application will use to execute queries.
For this example a MEDIUM size warehouse, so we configure the application's warehouse accordingly

In [None]:
ALTER WAREHOUSE NEO4J_GRAPH_ANALYTICS_APP_WAREHOUSE SET WAREHOUSE_SIZE='MEDIUM';

In [None]:
SELECT TO_CHAR(SOURCENODEID), TO_CHAR(TARGETNODEID) FROM product_recommendation.public.PART_IN_ORDER LIMIT 10;

Once the session is started, we can project our node and relationship views into a Neo4j Graph Analytics for Snowflake in-memory graph. The graph will be identified by the name "parts_in_orders".

* The mandatory parameters are the node tables and the relationship tables.
* A node table mapping points from a table/view to a node label that is used in the Neo4j Graph Analytics for Snowflake graph.
* The name of node label is based on the table/view name used in the projection, and case is preserved.
For example, the rows of 'product_recommendation.public..Part' will be nodes labeled as 'Part'.
* Relationship tables need a bit more configuration.
We need to specify source and target tables.
* The relationships are represented as typed relationships is the Neo4j Graph Analytics for Snowflake graph, where similarly to nodes, the table/view name is taken as the relationship type.
* For example, 'product_recommendation.public..part_in_order' below gives rise to the relationship 'part_in_order' in the Neo4j Graph Analytics for Snowflake graph.
* We also specify the optional read concurrency to optimize building the graph projection.
* The concurrency can be set to the number of cores available on the compute pool node.

The graph we project is a so-called bipartite graph, as it contains two types of nodes and all relationships point from one type to the other.
The node similarity algorithm looks at all pairs of nodes of the first type and calculates the similarity for each pair based on common relationships.
In our case, the algorithm will calculate the similarity between two parts based on the orders in which they appear.
The algorithm produces new relationships between parts, the relationship property is the similarity score.
For further information on the node similarity algorithm, please refer to the [Neo4j Graph Analytics for Snowflake documentation](https://neo4j.com/docs/snowflake-graph-analytics/current/)

Once the algorithm has finished, we can write the results back to Snowflake tables for further analysis.
We want to write back the similarity relationships between parts. 
The specified table will contain the original source and target node ids and the similarity score.

In [None]:
CALL neo4j_graph_analytics.graph.node_similarity('CPU_X64_L', {
  'project': {
    'defaultTablePrefix': 'product_recommendation.public',
    'nodeTables': ['Part_VW','Order_VW'], 
    'relationshipTables': {
      'part_in_order': {
        'sourceTable': 'Part_VW',
        'targetTable': 'Order_VW'
      }
    }
  },
  'compute': { 'topK': 2,
                'concurrency': 28 },
  'write': [
    {
    'sourceLabel':          'Part_VW',
    'targetLabel':          'Part_VW',
    'relationshipType':     'SIMILAR_TO',
    'relationshipProperty': 'similarity',
    'outputTable':          'product_recommendation.public.part_similar_to_part'
    }
  ]
});

After writing the table, we need to ensure that our current role is allowed to read it.
Alternatively, we can also grant access to all future tables created by the application.

In [None]:
GRANT SELECT ON product_recommendation.public.PART_SIMILAR_TO_PART TO ROLE gds_user_role;

Since the results are now stored in Snowflake, we can query them and join them with our original data.
For example, we can find the names of the most similar parts based on the similarity score.
Simply speaking, this could be used as a recommendation system for parts.

In [None]:
SELECT DISTINCT p_source.p_name, p_target.p_name, sim.similarity
FROM product_recommendation.public.PARTS p_source
    JOIN product_recommendation.public.PART_SIMILAR_TO_PART sim
        ON p_source.p_partkey = sim.sourcenodeid
    JOIN product_recommendation.public.PARTS p_target
        ON p_target.p_partkey = sim.targetnodeid
ORDER BY sim.similarity DESC LIMIT 10;

In [None]:
USE ROLE ACCOUNTADMIN;
GRANT OWNERSHIP ON TABLE product_recommendation.public.part_similar_to_part TO ROLE gds_user_role REVOKE CURRENT GRANTS;
USE ROLE gds_user_role;

The Neo4j Graph Analytics for Snowflake service is a long-running service and should be stopped when not in use.
Once we completed our analysis, we can stop the session, which suspends the container service.
We can restart the session at any time to continue our analysis.