# Running Graph algorithms with the GDS Python Client 

The Graph Data Science plugin for Neo4j makes it easy to spin up a graph, run some algorithms, get your insights, and tear it back down again. Follow this notebook to run GDS on your own graphs. 

If you don't have your own graph -- or you're in the mood for some spicy new nodes -- check out these datasets:
1. [The movie recommendations dataset from this talk](https://github.com/neo4j-graph-examples/recommendations/tree/main/data)
2. [Northwind retail dataset](https://github.com/neo4j-graph-examples/northwind)
3. [FinCEN money laundering dataset](https://github.com/jexp/fincen)
4. [StackOverflow dataset](https://github.com/neo4j-graph-examples/stackoverflow)

If you don't know how to set up a graph in Neo4j:
- [Download Neo4j for Desktop](https://neo4j.com/download/neo4j-desktop/?edition=desktop&flavour=osx&release=2.0.3&offline=false)
- [Get a free Aura DB instance](https://neo4j.com/product/auradb/)
- Head to the Graph Academy website to learn [how to get started with Neo4j](https://graphacademy.neo4j.com/)

## How to use this notebook
There are three GDS-related notebooks in this repo.

1. 1_gds-client-leiden.ipynb (this notebook) contains the GDS Python Client walkthrough. You'll: 
    - Connect to a local instance
    - Project a graph
    - Run the Leiden algorithm to generate movie recommendations.
2. 2_gds-client-fastrp.ipynb contains a similar walkthrough, demonstrating how to use FastRP to generate node embeddings.
3. 3_gds-client-algorithm-bank.ipynb contains a bank of algorithms you can run, along with some descriptions of what they do and how to use them.

If you're currently attending the talk, feel free to modify and run this notebook on your own instance.

To test this notebook on the movies dataset used as an example, you can [recreate it from the same .dump file](https://github.com/neo4j-graph-examples/recommendations/blob/main/data/recommendations-50.dump).

To load the dump, just click the three dots on an instance and choose 'Load database from file'.

![Load from dump](../docs/load_from_dump.png)

## Generate recommendations with GDS and Leiden

This section will walk you through:
- [Setting up the GDS Python Client and connecting to an instance](#set-up-the-gds-python-client-and-connect-to-an-instance)
- [Projecting a graph](#project-your-nodes-and-relationships)
- [Running the Leiden algorithm on a projected graph](#running-the-leiden-algorithm)

### Set up the GDS Python Client and connect to an instance

1. Install GDS and Pandas. 

***Note:*** _You do not need pandas to run the GDS client. We're including it here to demonstrate how you can query with GDS and use the outputs in a familiar format._

In [None]:
%pip install neo4j graphdatascience pandas

2. Connect to your instance. 

    This assumes you already have an instance running.

    Replace with your own instance's details and run.

In [None]:
from graphdatascience import GraphDataScience
from graphdatascience import ServerVersion

# Use Neo4j URI and credentials according to your setup
# NEO4J_URI could look similar to "bolt://my-server.neo4j.io:7687"

NEO4J_URI = "your_uri"                # If on a local instance, something like "neo4j://127.0.0.1:7687"
NEO4J_USER = "your_username"          # Probably "neo4j"
NEO4J_PASSWORD = "your_password" 

gds = GraphDataScience(
  NEO4J_URI,
  auth=(
    NEO4J_USER,
    NEO4J_PASSWORD
  ),
  database = "database_name"          # If using the movies dump, it is likely "recommendations-embeddings-50"
)

# Check the installed GDS version on the server
print(rf"All systems are go. GDS Version: {gds.server_version()}")
assert gds.server_version() >= ServerVersion(1, 8, 0)

All systems are go. GDS Version: 2.21.0


#### Troubleshooting
Make sure:
1. You have the URI from the correct instance
2. Your password is correct
3. Your username is correct
4. Your database name is correct -- this will be different from your project name

#### Helpers
If this is your first time through this notebook, skip to the next section.

The following two cells will:
- Drop the graph from memory
- Remove any properties we created

Only use them if you want to start again.

In [9]:
# Run this only if you want to drop the graph and start again.
gds.graph.drop('rec-simple')

In [10]:
# Clean up the old properties if they exist
query ="""
MATCH (n)
WHERE n.communityId IS NOT NULL
REMOVE n.communityId
"""

removed = gds.run_cypher(query)
print(removed)

Empty DataFrame
Columns: []
Index: []


### Project your nodes and relationships

Projecting graphs from the GDS Client is 1000x simpler than in the Browser. If you don't know what 'Project' means yet, don't worry. Let's just do it first, and then analyse what happened.

1. Assign a variable to the nodes you want to analyse.
2. Assign a variable to the relationships you want to analyse. Make sure you use the correct directionality for these relationships. 
    Leiden, the algorithm we're about to run, _cannot_ run on DIRECTED relationships. You can [check the requirements for each algorithm in the docs.](https://neo4j.com/docs/graph-data-science/current/algorithms/leiden/)
3. Project the graph

It's generally good practice to check how much memory a graph will actually use first, and then project a graph -- so let's do that.

In [11]:
# We define how we want to project our database into GDS
node_projection = ['User', 'Movie', 'Genre']
relationship_projection = {'RATED': {'orientation': 'UNDIRECTED'}, 'IN_GENRE': {'orientation': 'UNDIRECTED'}}

# Before actually going through with the projection, let's check how much memory is required
result = gds.graph.project.estimate(node_projection, relationship_projection)

print(f"Required memory for native loading: {result['requiredMemory']}")

Required memory for native loading: [2814 KiB ... 3006 KiB]


Memory requirements are low, so let's go ahead and project the graph.

The basic projection pattern is always the same.

In [12]:
# For this small graph memory requirement is low. Let us go through with the projection
G, result = gds.graph.project('rec-simple', node_projection, relationship_projection)

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

The projection took 80 ms
Graph 'rec-simple' node count: 9816
Graph 'rec-simple' node labels: ['User', 'Movie', 'Genre']


### What you just did
You just 'projected' a graph. 'Project' means you:
1. Pulled a bunch of nodes and/or relationships and properties out of the main graph
2. Reconstructed a new graph in memory out of those

Now, when you call 'gds.' and reference 'rec-simple', the operations you run will run on the 'rec-simple' projection -- not the main graph.

### Why this is awesome
1. You can spin up as many projections as you like, as often as you like and drop them when you're done.
2. Your actions on the projection will have no effect on the main graph (unless you 'write' them).
3. You can run more complex algorithms on subsets of nodes and relationships, rather than trying to analyse a subset within the entire database.

In short, you are _almost_ literally 'projecting' a _new_ sub-graph image into memory.

![Graph projection vis](../docs/projection.png)

## Running the Leiden algorithm

**What is it?:** Leiden is just one of of many 'community detection' algorithms you can run on your projected graph.

**What is 'community detection'?:** It is exactly as literal as it sounds: community detection finds natural communities extant within network structures.

**But what does that mean?:** 

Let's say: 

- Your graph contains two nodeLabels: 'Customer', 'Product'.
- You have 300 unique 'Customer' nodes, and two unique 'Product' nodes 'whole_milk' and 'skimmed_milk'.
- 100 <span style="color:yellow">(:Customers)</span> have <span style="color:yellow">-[:PURCHASED]-> (:Product{type:'whole_milk'})</span>
- 100 <span style="color:yellow">(:Customers)</span> have <span style="color:yellow">-[:PURCHASED]-> (:Product{type:'skimmed_milk'})</span>
- 100 have <span style="color:yellow">-[:Purchased]-></span> both.

Check out the video below to see that graph. More importantly, notice what happens when we pull the 'whole_milk' and 'skimmed_milk' entities away from each other.

<div style="padding:44.79% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/1121959606?title=0&amp;byline=0&amp;portrait=0&amp;badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" style="position:absolute;top:0;left:0;width:100%;height:100%;" title="milk_vid"></iframe></div><script src="https://player.vimeo.com/api/player.js"></script>

If the player doesn't work, feel free to watch here:
https://vimeo.com/1121959606

Initially, these groups may appear to be one gigantic glob. But, when we pull the central Product nodes away, we can see three distinct clusters emerge. The central cluster appears because it is being pulled with equal force in both directions.

Running the Leiden algorithm on this graph would likely create 3 communityId labels: One for customers who bought milk, another for those who bought skimmed_milk and another for those who bought both.

With a larger, more interconnected graph, the basic concept is the same -- but the distinct clusters are much more difficult to disentangle. This is where Leiden comes in.

Run the cell below to see what happens.

In [13]:
# You can run Leiden stats simply by writing the pattern 'gds.<algorithm>.stats('projectedGraphName')
gds.leiden.stats(
    G
)

ranLevels                                                                3
didConverge                                                           True
nodeCount                                                             9816
communityCount                                                           7
communityDistribution    {'min': 18, 'p5': 18, 'max': 3285, 'p999': 328...
modularity                                                        0.302786
modularities             [0.28962122804133794, 0.2997101911536924, 0.30...
preProcessingMillis                                                      0
computeMillis                                                          155
postProcessingMillis                                                     7
configuration            {'theta': 0.01, 'jobId': '5faf718d-24d8-4075-8...
Name: 0, dtype: object

You just ran the Leiden algorithm and got some stats about how it performed on your projection. 

For reference, here is the schema for the movie graph used for this talk:

![Schema](../docs/schema.png)


If we run leiden.stats on the Movie Graph used as an example throughout this talk, we get the following results:

- **ranLevels**                                                                4

- **didConverge**                                                           True

- **nodeCount**                                                             9816

- **communityCount**                                                           6

- **communityDistribution**    {'min': 18, 'p5': 18, 'max': 3457, 'p999': 345...}

- **modularity**                                                        0.292067

- **modularities**             [0.2812212567323293, 0.29131578214456594, 0.29...]

- **preProcessingMillis**                                                      0

- **computeMillis**                                                          155

- **postProcessingMillis**                                                     4

- **configuration**            {'theta': 0.01, 'jobId': '27b3becb-8335-480c-b...}

- **Name:** 0, **dtype:** object

Now that you have run this, you can essentially run any algorithm you like with GDS.

### **Remember the four options: Stats, Stream, Mutate, Write.**

There are four main options for implementing an algorithm:
- **Stats:** Tells you what the algorithm is likely to do when you run it for real.
- **Stream:** Gives you the actual results of a run, without writing _anything_ to either the main graph, or the projected graph.
- **Mutate:** Writes the results to your projected graph -- but it does not write to the main graph. Anything written by mutate, will disappear once you drop the graph.
- **Write:** Writes the results directly to the main graph. So, if you create new relationships or properties, they will persist, even after you drop the projected graph. 

It is usually a good idea to run each of these options in this order before committing to a 'write'. 

Otherwise you will end up with 20 different properties like 'leiden_community_final', 'leiden_community_final_finished', 'leiden_community_final_finished_final_7.11'.

When going through this section, it will be helpful to [have the docs side-by-side with you](https://neo4j.com/docs/graph-data-science/current/algorithms/leiden/):

### ***Stats***
Running 'stats' with any algorithm will provide you with...**drumroll**...statistics. They help you to understand how the algorithm has operated on your graph. 

You can use the statistics to inform your configuration settings.

For each algorithm, the kinds of statistics you get will differ.

In the next cell, you will re-run the basic stats algorithm we ran in the last cell. But afterwards, you'll understand why we would want to run it.

In [14]:
# Run this and then read the cell below to see why these matter.
gds.leiden.stats(
    G
)

ranLevels                                                                3
didConverge                                                           True
nodeCount                                                             9816
communityCount                                                           5
communityDistribution    {'min': 922, 'p5': 922, 'max': 3689, 'p999': 3...
modularity                                                        0.301945
modularities             [0.2865375297112447, 0.29896634117304205, 0.30...
preProcessingMillis                                                      0
computeMillis                                                           79
postProcessingMillis                                                     0
configuration            {'theta': 0.01, 'jobId': '9a2c5d84-563e-4db5-a...
Name: 0, dtype: object

### Basic Stats: Reference

The key metrics to bear in mind for Leiden are:  
- **ranLevels**: Tells you how many times Leiden refined the clusters before stopping.  
    - **Why consider this?:** ranLevels tells you how hard Leiden had to work before it converged. Neither a high or low number = 'good' or 'bad'. It is just a description. However, it can reveal issues in your graph structure, or point towards Leiden configurations to change.  

    - **What to do:** Consider this number in tandem with 'didConverge' and some other settings referenced in the next cell. On its own, ranLevels is just information.  
  
  
- **didConverge**: Tells you if the algorithm reach a stable state. A state is considered 'stable' when all nodes are in the 'local optimal partition'. Don't worry about what that means for now. Just think of it as 'nodes are in best place for nodes'.  
    - **Why consider this?:** If True, all good. If False, the algorithm did not converge within the allowable maxSteps, and you may want to change your settings.  

    - **What to do:** First increase maxSteps and run again to see if it converges. You can also consider changing other settings explored in the next section.  

- **nodeCount:** Shows how many nodes were considered for clustering.  
    - **Why consider this?:** Knowing this number can help you to identify errors in your configurations or projections. This number should match the number of nodes you intended to process.   

    - **What to do:** If your nodeCount does not match your expectations, first check your algorithm settings to ensure you have not filtered nodes out. If you have not filtered any nodes out, check your projection settings. Ensure you have included all nodes.  

- **communityCount**: Tells you how many distinct communities Leiden found in your graph.  
    - **Why consider this?:** Dissecting a network into communities is partly subjective. If you have a graph of 10 Million nodes, and you want to produce recommendations, 7 'communities' may not be enough to define relevant suggestions.  

    - **What to do:** Check the community distribution first (below). In our case, the largest community contains 3912 members. I think this is fine for recommendations. You may disagree -- it's really up to you. In the next section, you'll see how to configure Leiden to identify smaller, more granular communities, or larger, coarser communities.  

- **communityDistribution**: Shows you the distribution of cluster sizes. Our smallest cluster contains ~150 members. Our largest contains ~3,900. This is fine. Also, bear in mind, this will change slightly every time you run it. That is also fine.  
    - **Why consider this?:** The ratios of cluster sizes determine how meaningful your interpretations will be. If your smallest cluster has 5 nodes, and your largest cluster has 1 million, comparisons between them may prove meaningless. However, it depends on your dataset. Perhaps, those five people are the only five people in your dataset who compete in extreme ironing competitions.   

    - **What to do:** Bear in mind, in some contexts, you may sometimes get clusters of just 1 member ('singletons'). If you do see them, it can mean a few things:   
        - Those nodes may need more relationships added to them to connect them with the graph. 
        - They may be genuine outliers. 
        - Your modularity settings are too high.  

- **modularity:** 'Modularity' is a measure of the density difference between a cluster's nodes, and the entire cluster's connections to other clusters. If every Leiden community cluster in your graph only had connections between the nodes inside their own communities, you would have a modularity of 1. If every node in your graph was equally as connected to every other node, you would have a modularity of, or close to, 0.  
    - **Why consider this?:** Modularity is the most important metric to consider here. Ideally, you want higher modularity. There is, however, a realistic limit. For example, you might expect a network containing data scientists, zoo keepers and theatre troupes to have a relatively high modularity. If the network contained only members of European philharmonic orchestras, the modularity could be expected to be relatively lower. The latter group will by definition be more homogenous than the former.   

    - **What to do:** Consider your modularity score in tandem with the communityDistribution, communityCount and your expectations. Play around with the configuration to get the highest modularity you can, without creating unmanageably large clusters, or uselessly small ones.  


The next cell provides you with a MASTER Stats query. It contains _most_ configuration settings you might want to use.   

You are unlikely to need all of these stats at any one time. However, this should serve as a meaning and syntax reference, should you need one.  

In [15]:
# For your reference, here is a version of the algorithm with most configuration settings included. 
# You do not need to write these out every single time -- only the ones that you care about.
# Run this and then read the cell below to see why these settings matter.

gds.leiden.stats(
    G,
    gamma = 40,                                     # Increasing the gamma will push Leiden to break the communities into smaller clusters. Lowering it will allow Leiden to settle for larger clusters.
    theta = 0.01,                                    # Increasing theta will increase how randomly Leiden will smash up communities into smaller clusters during the refinement phase. Lowering it will reduce that randomness.
    randomSeed = 42,
    maxLevels = 5,                                  # maxLevels limits the number of times Leiden is allowed to refine the community structure again.
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = 'myTestGraph_0001',
    includeIntermediateCommunities = False,
    logProgress = True
)

ranLevels                                                                3
didConverge                                                           True
nodeCount                                                             9816
communityCount                                                        1141
communityDistribution    {'min': 1, 'p5': 1, 'max': 184, 'p999': 120, '...
modularity                                                        -0.02298
modularities             [-0.023966193817701313, -0.023018756908473537,...
preProcessingMillis                                                      0
computeMillis                                                           73
postProcessingMillis                                                     0
configuration            {'randomSeed': 42, 'theta': 0.01, 'jobId': 'my...
Name: 0, dtype: object

And that's it -- you now have everything you need to compose, configure, run and interpret Leiden.

If you want to understand what the rest of these configuration settings do, [check out the Leiden docs](https://neo4j.com/docs/graph-data-science/current/algorithms/leiden/).

For now, before you move on, play around with the gamma, theta and maxLevels settings to see how the modularity and community distribution change.

### ***Stream***

Streaming will return the results of the algorithm as Cypher result rows. 

You can use this to get a more granular look at what's happening under the hood.

Run the query below, and see what happens.

In [16]:
gds.leiden.stream(
    G,
    gamma = 40,                                     # Increasing the gamma will push Leiden to break the communities into smaller clusters. Lowering it will allow Leiden to settle for larger clusters.
    theta = 0.01,                                    # Increasing theta will increase how randomly Leiden will smash up communities into smaller clusters during the refinement phase. Lowering it will reduce that randomness.
    randomSeed = 42,
    maxLevels = 5,                                  # maxLevels limits the number of times Leiden is allowed to refine the community structure again.
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = 'myTestGraph_0001',
    includeIntermediateCommunities = False,
    logProgress = True
)

Unnamed: 0,nodeId,intermediateCommunityIds,communityId
0,0,,185
1,1,,1110
2,2,,1238
3,3,,948
4,4,,1107
...,...,...,...
9811,9811,,38
9812,9812,,323
9813,9813,,825
9814,9814,,815


Here, we can see which node Ids were placed into which communities. 

If you wanted to, you could grab a specific nodeId, or a collection of them, for inspection.

### ***Mutate***

Mutate will write a property to the in-memory (projected) graph. You will then be able to reference that property directly.

In [17]:
gds.leiden.mutate(
    G,
    mutateProperty = 'communityId',
    randomSeed = 42,
    gamma = 40,                                     # Increasing the gamma will push Leiden to break the communities into smaller clusters. Lowering it will allow Leiden to settle for larger clusters.
    theta = 0.01,                                    # Increasing theta will increase how randomly Leiden will smash up communities into smaller clusters during the refinement phase. Lowering it will reduce that randomness.
    maxLevels = 5,                                  # maxLevels limits the number of times Leiden is allowed to refine the community structure again.
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = 'myTestGraph_0001',
    includeIntermediateCommunities = False,
    logProgress = True
)

ranLevels                                                                3
didConverge                                                           True
nodeCount                                                             9816
communityCount                                                        1138
preProcessingMillis                                                      0
computeMillis                                                           76
postProcessingMillis                                                     0
mutateMillis                                                             0
nodePropertiesWritten                                                 9816
communityDistribution    {'min': 1, 'p5': 1, 'max': 184, 'p999': 120, '...
modularities             [-0.023970956748866366, -0.023032926939405576,...
modularity                                                       -0.023019
configuration            {'randomSeed': 42, 'mutateProperty': 'communit...
Name: 0, dtype: object

### Note
If you run mutate, you will not be able to set the _same_ property again in the same projection. You _can_ however set a _new_ property. They will all disappear when you drop the graph.

If you really wanted to use only one property name, you could [drop the graph and then spin it up again from scratch](#helpers).

### ***Write***

The next cell will grab the referenced nodeProperties in your projected graph, and write them back to the nodes in your 'real' graph.

In [18]:
# Write properties
gds.graph.nodeProperties.write(G, ["communityId"])

writeMillis                                                         73
graphName                                                   rec-simple
nodeProperties                                           [communityId]
propertiesWritten                                                 9816
configuration        {'jobId': '2bcfe323-db17-45d8-ac1b-d15f41c9959...
Name: 0, dtype: object

The output just tells you:
- **writeMillis:** How long it took to write
- **graphName:** Which in-memory graph we mutated these properties from
- **nodeProperties:** Which node properties were written
- **propertiesWritten:** How many node properties were written

### ***Use your new properties***

So, that's it. We have now:
- Projected a graph into memory
- Run the Leiden algorithm to discover communities
- Assigned our real nodes to their corresponding communities

We can now use these properties in the same ways as any other property.

In [19]:
# Let's ensure the property keys were actually written
gds.run_cypher(
    """
    MATCH (n) WHERE 'n.communityId' IN keys(n)
    RETURN n.name, n.communityId LIMIT 10
    """
)

Unnamed: 0,n.name,n.communityId


In [20]:
# Now let's get a sample of who got assigned to which community
query = """
    CALL gds.graph.nodeProperties.stream('rec-simple', 'communityId')
    YIELD nodeId, propertyValue
    WITH gds.util.asNode(nodeId).name AS node, propertyValue AS communityId
    WITH communityId, collect(node) AS members
    WITH communityId, members, size(members) AS communitySize
    RETURN communityId, communitySize, members
    ORDER BY communitySize DESC
"""

communities = gds.run_cypher(query)
communities

Unnamed: 0,communityId,communitySize,members
0,545,4,"[Daniel Martin, John Thomas, Carolyn Velasquez..."
1,553,3,"[Brittany Fox, Cynthia Sparks, James Francis]"
2,526,2,"[Shelley Booth, Jason Pierce]"
3,637,2,"[Nathan Hall, Joseph Berry]"
4,1198,2,"[Luke Myers, Keith Howell]"
...,...,...,...
1133,1152,0,[]
1134,1153,0,[]
1135,1154,0,[]
1136,1155,0,[]


In [21]:
# Let's double-check that every node got assigned to a community
query = """
    MATCH (n)
    WHERE n.communityId IS NOT NULL
    RETURN count(n)
"""

communities = gds.run_cypher(query)
communities

Unnamed: 0,count(n)
0,9816


## Using the properties in Python

Now we can query the properties normally from the graph database and use them in any way we wish.

First, let's make a dataframe of the nodes and their corresponding communityIds.

In [22]:
# Make a dataframe of each node and its communityId
import pandas as pd

node_props = gds.run_cypher("""
MATCH (n)
WHERE n.communityId IS NOT NULL
RETURN n AS node, n.communityId AS communityId
""")

node_props_df = pd.DataFrame(node_props)
node_props_df.head()

Unnamed: 0,node,communityId
0,"(languages, year, imdbId, runtime, imdbRating,...",1160
1,"(name, communityId)",143
2,"(name, communityId)",108
3,"(name, communityId)",70
4,"(name, communityId)",139


We included both Movie _and_ Person nodes in our Leiden pass. Let's separate them back into their types, so we can reference them independently.

In [23]:
# Separate into users and movies
grouped = (
    node_props_df
    .groupby("communityId")
    .agg({
        "node": lambda nodes: {
            "users": [n for n in nodes if "User" in n.labels],
            "movies": [n for n in nodes if "Movie" in n.labels]
        }
    })
    .reset_index()
)

# Keep only non-empty groups
grouped = grouped[grouped["node"].apply(lambda d: len(d["users"]) > 0 and len(d["movies"]) > 0)]
grouped.head(10)

Unnamed: 0,communityId,node
0,0,"{'users': [('name', 'communityId', 'userId')],..."
1,1,"{'users': [('name', 'communityId', 'userId')],..."
2,2,"{'users': [('name', 'communityId', 'userId')],..."
4,4,"{'users': [('name', 'communityId', 'userId')],..."
5,5,"{'users': [('name', 'communityId', 'userId')],..."
6,6,"{'users': [('name', 'communityId', 'userId')],..."
7,7,"{'users': [('name', 'communityId', 'userId')],..."
9,9,"{'users': [('name', 'communityId', 'userId')],..."
10,10,"{'users': [('name', 'communityId', 'userId')],..."
11,11,"{'users': [('name', 'communityId', 'userId')],..."


Now let's match those users and movies who ended up in the same communities. Hypothetically, this should group them by movies they might watch -- if not necessarily enjoy.

In [24]:
ratings = gds.run_cypher("""
MATCH (u:User)-[ra:RATED]->(m:Movie)
WHERE u.communityId IS NOT NULL AND m.communityId IS NOT NULL
AND u.communityId = m.communityId
RETURN id(u) AS user_id, id(m) AS movie_id, ra.rating AS rating, u.communityId AS communityId, m.title AS title
""")

ratings_df = pd.DataFrame(ratings)
ratings.head()


Unnamed: 0,user_id,movie_id,rating,communityId,title
0,9145,1684,4.0,690,Tron
1,9145,852,3.0,690,Dumbo
2,9145,1036,2.0,690,"Deer Hunter, The"
3,9146,463,3.0,599,Much Ado About Nothing
4,9146,355,3.0,599,Reality Bites


Now, if you wanted to, you could work with these entries directly from a pandas dataframe.

However, let's make a simple recommendation query. The following Cypher will generate a new dataframe providing us with each movie's communityId and it's average rating from within that community.

In [25]:
movie_stats = gds.run_cypher("""
MATCH (u:User)-[ra:RATED]->(m:Movie)                              //Find users who rated movies and the movies they rated.
WHERE u.communityId IS NOT NULL AND m.communityId IS NOT NULL     //But only those who have a communityId.
AND u.communityId = m.communityId                                 //And, of those, only match users and movies whose communityIds are the same.
                             
WITH m, avg(ra.rating) AS avg_rating, count(ra) AS num_ratings    //Get the average rating per movie (from within the community) and the total number of ratings.

RETURN id(m) AS movie_id, 
                m.title AS title, 
                m.communityId AS communityId,
                avg_rating, 
                num_ratings

ORDER BY avg_rating DESC
""")

movie_stats_df = pd.DataFrame(movie_stats)
movie_stats.head()

Unnamed: 0,movie_id,title,communityId,avg_rating,num_ratings
0,2462,Meatballs,684,5.0,1
1,1670,Return from Witch Mountain,684,5.0,1
2,1665,One Magic Christmas,684,5.0,1
3,1301,My Own Private Idaho,801,5.0,1
4,1874,Runaway Train,801,5.0,1


Imagine our platform serves both blockbuster franchise movies and quirky, niche arthouse movies. 

We want to adjust the ratings so that a movie with 1000 reviews does not aggressively push our users into a bottomless bucket of blockbusters.

To do that, let's make the movie with the fewest ratings in each community that community's baseline, and proportionally adjust all other scores within the community to that.

In [26]:
adjusted_list = []

for cid, group in movie_stats_df.groupby("communityId"):
    min_count = group["num_ratings"].min()
    group = group.copy()
    group["proportional_score"] = group["avg_rating"] * (min_count / group["num_ratings"])
    adjusted_list.append(group)

adjusted_movies_df = pd.concat(adjusted_list, ignore_index=True)
adjusted_movies_df.head()


Unnamed: 0,movie_id,title,communityId,avg_rating,num_ratings,proportional_score
0,5938,Battlestar Galactica,0,4.5,1,4.5
1,5864,Patlabor: The Movie (Kidô keisatsu patorebâ: T...,0,4.0,1,4.0
2,6406,Inside Man,0,4.0,1,4.0
3,6883,Battlestar Galactica: Razor,0,4.0,1,4.0
4,2237,Bowfinger,0,3.0,1,3.0


Now, for each community let's get the top 5 movies that could be recommended by their new proportional scores.

In [27]:
top_movies = (
    adjusted_movies_df
    .sort_values(["communityId", "proportional_score"], ascending=[True, False])
    .groupby("communityId")
    .head(5)
)

top_movies.head()


Unnamed: 0,movie_id,title,communityId,avg_rating,num_ratings,proportional_score
0,5938,Battlestar Galactica,0,4.5,1,4.5
1,5864,Patlabor: The Movie (Kidô keisatsu patorebâ: T...,0,4.0,1,4.0
2,6406,Inside Man,0,4.0,1,4.0
3,6883,Battlestar Galactica: Razor,0,4.0,1,4.0
4,2237,Bowfinger,0,3.0,1,3.0


This could work. However, we're at risk of recommending movies that users have already watched over and over again. Let's filter our recommendations by what users have already watched. 

In [28]:
user_watched_movies = ratings_df.groupby("user_id")["movie_id"].apply(set).to_dict()

And, finally, let's recommend the top 3 movies to each user of each community, including only those movies they have not yet rated.

In [29]:
user_recs = []

for _, row in grouped.iterrows():
    cid = row["communityId"]
    users = row["node"]["users"]
    top = top_movies[top_movies["communityId"] == cid]

    for u in users:
        user_id = u.element_id  # or id(u) depending on driver
        watched = user_watched_movies.get(user_id, set())

        movies_with_flags = [
            {
                "title": m["title"],
                "avg_rating": m["avg_rating"],
                "num_ratings": m["num_ratings"],
                "proportional_score": m["proportional_score"],
                "watched": m["movie_id"] in watched
            }
            for _, m in top.iterrows()
        ]

        recs = [m for m in movies_with_flags if not m["watched"]][:3]

        if recs:
            user_recs.append({
                "communityId": cid,
                "user_name": u["name"],
                "movie_rec": recs,
                "top_5": movies_with_flags[:5]
            })

user_recs_df = pd.DataFrame(user_recs)
user_recs_df.sort_values(by=['user_name'])
user_recs_df.head(20)


Unnamed: 0,communityId,user_name,movie_rec,top_5
0,0,Cynthia Garcia,"[{'title': 'Battlestar Galactica', 'avg_rating...","[{'title': 'Battlestar Galactica', 'avg_rating..."
1,1,Francis Clarke,"[{'title': 'Great Escape, The', 'avg_rating': ...","[{'title': 'Great Escape, The', 'avg_rating': ..."
2,2,Elizabeth Sanchez,"[{'title': 'Black Knight', 'avg_rating': 4.5, ...","[{'title': 'Black Knight', 'avg_rating': 4.5, ..."
3,4,Douglas Parsons,"[{'title': 'It Could Happen to You', 'avg_rati...","[{'title': 'It Could Happen to You', 'avg_rati..."
4,5,Julie Nunez,"[{'title': 'Blow Dry (a.k.a. Never Better)', '...","[{'title': 'Blow Dry (a.k.a. Never Better)', '..."
5,6,Mrs. Deanna Garcia PhD,"[{'title': 'Devil and Daniel Johnston, The', '...","[{'title': 'Devil and Daniel Johnston, The', '..."
6,7,Michael Mendoza,"[{'title': 'Squid and the Whale, The', 'avg_ra...","[{'title': 'Squid and the Whale, The', 'avg_ra..."
7,9,Valerie Jackson,"[{'title': 'Sophie's Choice', 'avg_rating': 4....","[{'title': 'Sophie's Choice', 'avg_rating': 4...."
8,10,Denise Brown,"[{'title': 'Alice in Wonderland', 'avg_rating'...","[{'title': 'Alice in Wonderland', 'avg_rating'..."
9,11,Shelby Joseph,[{'title': 'Sunset Blvd. (a.k.a. Sunset Boulev...,[{'title': 'Sunset Blvd. (a.k.a. Sunset Boulev...


## Projecting a graph from a dataframe

It is also possible to project a graph into memory from a pandas dataframe. Our original dataframe 'ratings_df' contained User and Movie nodes. Let's turn it back into an undirected graph in memory.

First we'll assign our nodes.

In [30]:
user_nodes = pd.DataFrame().assign(
    nodeId=ratings_df["user_id"].unique(),
    labels="User",
    communityId=ratings_df.groupby("user_id")["communityId"].first()
)

movie_nodes = pd.DataFrame().assign(
    nodeId=ratings_df["movie_id"].unique(),
    labels="Movie",
    communityId=ratings_df.groupby("user_id")["communityId"].first()
)

nodes = pd.concat([user_nodes, movie_nodes], ignore_index=True)

Next, we'll create the relationships.

In [31]:
relationships = pd.DataFrame().assign(
    sourceNodeId=ratings_df["user_id"],
    targetNodeId=ratings_df["movie_id"],
    relationshipType="RATED", 
    rating=ratings_df["rating"],
)

Finally, we'll project the graph into memory, using the 'construct' method.

In [32]:
G = gds.graph.construct("pandas-to-graph", nodes, relationships, undirected_relationship_types=['*'])

You can do this with any pandas dataframe. 

Remember, this new projection actually has no direct relationship to our original graph. It has been constructed entirely from the entries in our dataframe.

However, now that it is in memory, we can manipulate it just like any other projection.

In [33]:
gds.leiden.stream(
    G,
    gamma = 40,
    theta = 0.01,
    randomSeed = 42,
    maxLevels = 5,
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = 'pandas_graph',
    includeIntermediateCommunities = False,
    logProgress = True
)

Unnamed: 0,nodeId,intermediateCommunityIds,communityId
0,9145,,211
1,9146,,212
2,9147,,213
3,9148,,214
4,9149,,215
...,...,...,...
8054,4531,,483
8055,4570,,483
8056,4390,,483
8057,2646,,483


In [34]:
# Now let's drop that graph.
G.drop()

graphName                                                  pandas-to-graph
database                                     recommendations-embeddings-50
databaseLocation                                                     local
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                             8059
relationshipCount                                                    14958
configuration            {'readConcurrency': 4, 'undirectedRelationship...
density                                                            0.00023
creationTime                           2025-09-27T19:04:11.228921000+01:00
modificationTime                       2025-09-27T19:04:11.228921000+01:00
schema                   {'graphProperties': {}, 'nodes': {'User': {'co...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'User': {'co...
Name: 0, dtype: object

There's plenty more to discover in the [GDS Client Docs](https://neo4j.com/docs/graph-data-science-client/current/).

For a hands-on walkthrough of creating node embeddings with FastRP, go to 2_gds-client-fastrp.ipynb.

If you'd like to see a bunch of algorithms you can use in one place, check out 3_gds-client-algorithm-bank.ipynb