### **GDS Algo Bank**

Below, you will find the full syntax for a bunch of algorithms that are useful for:
- Recommendations
- Service checking

In each case, you will receive the gds.{algorithm}.stream version of each. However, you can easily update these to gds.{algorithm}. stats/mutate/write as needed.

Alongside each query, you will find: 
- An explanation of what it does and how it works.
- A glossary, explaining what each configuration setting does.

First, let's connect these to a graph. The cells below are set up to use the 'recommendations-embeddings-50' dump used throughout the previous two notebooks. With slight modifications, they should work for your database too.

In [None]:
from graphdatascience import GraphDataScience
from graphdatascience import ServerVersion

# Use Neo4j URI and credentials according to your setup
# NEO4J_URI could look similar to "bolt://my-server.neo4j.io:7687"

NEO4J_URI = "your_uri"                # If on a local instance, something like "neo4j://127.0.0.1:7687"
NEO4J_USER = "your_username"          # Probably "neo4j"
NEO4J_PASSWORD = "your_password" 

gds = GraphDataScience(
  NEO4J_URI,
  auth=(
    NEO4J_USER,
    NEO4J_PASSWORD
  ),
  database = "database_name"          # If using the movies dump, it is likely "recommendations-embeddings-50"
)

# Check the installed GDS version on the server
print(rf"All systems are go. GDS Version: {gds.server_version()}")
assert gds.server_version() >= ServerVersion(1, 8, 0)

All systems are go. GDS Version: 2.21.0


In [3]:
# Clean up an existing 'algo-bank' graph
gds.graph.drop('algo-bank')

graphName                                                        algo-bank
database                                     recommendations-embeddings-50
databaseLocation                                                     local
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                             9816
relationshipCount                                                   240688
configuration            {'relationshipProjection': {'RATED': {'aggrega...
density                                                           0.002498
creationTime                           2025-09-27T15:47:37.949447000+01:00
modificationTime                       2025-09-27T17:50:00.139976000+01:00
schema                   {'graphProperties': {}, 'nodes': {'User': {'lp...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'User': {'lp...
Name: 0, dtype: object

In [4]:
# We define how we want to project our database into GDS
node_projection = ['User', 'Movie', 'Genre']
relationship_projection = {'RATED': {'orientation': 'UNDIRECTED'}, 'IN_GENRE': {'orientation': 'UNDIRECTED'}}

# For this small graph memory requirement is low. Let us go through with the projection
G, result = gds.graph.project('algo-bank', node_projection, relationship_projection)

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

The projection took 24 ms
Graph 'algo-bank' node count: 9816
Graph 'algo-bank' node labels: ['User', 'Movie', 'Genre']


## Community Detection Algorithms

**What do they do?:**
Find natural communities within network data.

**Why would I want to use one?:**
Allows you to find clusters of nodes that share similar behavioural patterns.

**Fine, but why though?:**
You could use community detection to detect:
- Users who demonstrate similar behaviours to known fraudulent users you have already flagged.
- Users who demonstrate similar engagement behaviours with products to generate:
    - Recommendations
    - Targeted advertising
- Relationship and behavioural groupings for disambiguation

Any case in which it would be useful to group entities into behaviourally thematic clusters would benefit from community detection.

### Louvain
This is probably the most famous one. Like Leiden, it finds node communities in the network by moving nodes into different communities and checking for improved modularity. In the second step, those clusters are treated as discrete nodes and re-clustered. Again, Leiden does the same thing. 

The difference is that Leiden breaks up large clusters and refines, providing more granular clusters. Louvain can sometimes end up providing one or two gigantic clusters, with a few smaller clusters hanging off them.

#### **Why would I use it?**
If you have an extremely large dataset, which is likely to contain massive communities, Louvain could be a good first step. For instance, you could identify the largest component of users who provide the most revenue to the company. You could then further break that community down using Leiden -- or another Louvain pass with a tighter configuration.

Louvain can also operate on DIRECTED relationship types -- Leiden cannot. If directionality is important for your use-case, you would opt for Louvain.

For more info, [check out Louvain in the GDS docs](https://neo4j.com/docs/graph-data-science/current/algorithms/louvain/).

In [5]:
gds.louvain.stream(
    G,
    maxLevels = 5,                                  # maxLevels limits the number of times Leiden is allowed to refine the community structure again.
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = 'louvain_0001',
    includeIntermediateCommunities = False,
    minCommunitySize=10,                             # Louvain can sometimes create a lot of singleton communities. This won't prevent that from happening, but it will prevent them from getting a communityId property.
    logProgress = True
)

 Louvain:   0%|          | 0/100 [00:00<?, ?%/s]

Unnamed: 0,nodeId,intermediateCommunityIds,communityId
0,0,,9309
1,1,,51
2,2,,51
3,3,,51
4,4,,9691
...,...,...,...
9811,9811,,157
9812,9812,,9549
9813,9813,,23
9814,9814,,157


### Leiden
Does the same thing as Louvain, but includes an extra step to randomly break up clusters during refinement. This has been shown to prevent giant components from dominating the resulting communities.

_**Note:** You can run Leiden only on UNDIRECTED relationships._

### **Why would I use it?**
There is an element of personal preference and intent here. That said, Leiden is kind of 'Louvain but better' in most respects. For a business use-case, Leiden is:
- More reliable
- Capable of identifying more granular communities
- More consistent

You _may_ want to use Louvain to identify large components in your graph where directionality is important, and then use Leiden to identify the more granular components within the larger ones. That's not a recommendation; you could likely achieve most Louvain outcomes with Leiden.

For more info, [check out Leiden in the GDS docs](https://neo4j.com/docs/graph-data-science/current/algorithms/leiden/).

In [6]:
# Leiden
gds.leiden.stream(
    G,
    gamma = 40,                                     # Increasing the gamma will push Leiden to break the communities into smaller clusters. Lowering it will allow Leiden to settle for larger clusters.
    theta = 0.01,                                    # Increasing theta will increase how randomly Leiden will smash up communities into smaller clusters during the refinement phase. Lowering it will reduce that randomness.
    randomSeed = 42,
    maxLevels = 5,                                  # maxLevels limits the number of times Leiden is allowed to refine the community structure again.
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = 'leiden_00001',
    includeIntermediateCommunities = False,
    logProgress = True
)

Unnamed: 0,nodeId,intermediateCommunityIds,communityId
0,0,,277
1,1,,883
2,2,,849
3,3,,555
4,4,,879
...,...,...,...
9811,9811,,195
9812,9812,,87
9813,9813,,1150
9814,9814,,61


### Label Propagation
Label propagation does the following:
- Provides a unique id _('label')_ to every node in the graph
- Iteratively 'propagates' each node's labels to its neighbours
- At each iteration, the labels are counted for each node. Whichever label dominates a section becomes the winning label for that cluster and the others are discarded.
- Iterations continue until each node has the majority label of its neighbours or when we reach a maximum number of user-defined iterations.

**Why would I use this?**
- Label Propagation is faster and less memory intensive than Louvain or Leiden.
- If your network updates frequently, label propagation can be used to update communities within frequently updated networks -- social media, for example.
- If you have already identified some extant communities, LP allows you to pre-label them and have the Label Propagation start from there.

For more info, [check out Label Propagation in the GDS docs](https://neo4j.com/docs/graph-data-science/current/algorithms/label-propagation/). 

In [7]:
# Label Propagation
gds.labelPropagation.stats(
    G,
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = "lp_0001"
)

ranIterations                                                            4
didConverge                                                           True
communityCount                                                           1
communityDistribution    {'min': 9816, 'p5': 9816, 'max': 9816, 'p999':...
preProcessingMillis                                                      0
computeMillis                                                          147
postProcessingMillis                                                     7
configuration            {'nodeWeightProperty': None, 'jobId': 'lp_0001...
Name: 0, dtype: object

In [8]:
# Label Propagation
gds.labelPropagation.stream(
    G,
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = "lp_0001"
)

Unnamed: 0,nodeId,communityId
0,0,1
1,1,1
2,2,1
3,3,1
4,4,1
...,...,...
9811,9811,1
9812,9812,1
9813,9813,1
9814,9814,1


As an example, let's run leiden.mutate with a high gamma to get really small communities. Then, we can run LP, using those communities as initial seeds to get the ball rolling.

In [9]:
gds.leiden.mutate(
    G,
    mutateProperty = 'lp-com',
    randomSeed = 42,
    gamma = 70,                                     # Increasing the gamma will push Leiden to break the communities into smaller clusters. Lowering it will allow Leiden to settle for larger clusters.
    theta = 0.01,                                    # Increasing theta will increase how randomly Leiden will smash up communities into smaller clusters during the refinement phase. Lowering it will reduce that randomness.
    maxLevels = 5,                                  # maxLevels limits the number of times Leiden is allowed to refine the community structure again.
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = 'myTestGraph_0001',
    includeIntermediateCommunities = False,
    logProgress = True
)

ranLevels                                                                3
didConverge                                                           True
nodeCount                                                             9816
communityCount                                                        2401
preProcessingMillis                                                      0
computeMillis                                                           79
postProcessingMillis                                                     0
mutateMillis                                                             0
nodePropertiesWritten                                                 9816
communityDistribution    {'min': 1, 'p5': 1, 'max': 171, 'p999': 91, 'p...
modularities             [-0.08945154880501244, -0.08882954562362515, -...
modularity                                                       -0.088814
configuration            {'randomSeed': 42, 'mutateProperty': 'lp-com',...
Name: 0, dtype: object

In [10]:
# Label Propagation
gds.labelPropagation.stream(
    G,
    nodeLabels = ['*'],
    relationshipTypes = ['*'],
    concurrency = 4,
    jobId = "lp_0001",
    seedProperty = "lp-com"
)

Unnamed: 0,nodeId,communityId
0,0,1966
1,1,1966
2,2,1966
3,3,1966
4,4,1966
...,...,...
9811,9811,1966
9812,9812,1966
9813,9813,1966
9814,9814,1966


The process for running other algorithms is essentially the same, and there are many to choose from. Here are some more common examples and their uses:

[**Weakly Connected Components:**](https://neo4j.com/docs/graph-data-science/current/algorithms/wcc/) Anything that is linked to anything else will end up in the same component. It's useful for finding completely disconnected sections of the graph.

[**Strongly Connected Components:**](https://neo4j.com/docs/graph-data-science/current/algorithms/strongly-connected-components/) Finds the most connected node pairs in a directed graph. Can be used to compute the connectivity of different network configurations when measuring routing performance in multihop wireless networks.

[Check out Community Detection in the docs](https://neo4j.com/docs/graph-data-science/current/algorithms/community/) to discover all of the algorithms at your disposal. 

### PageRank
PageRank is -- or was -- the original algorithm underlying Google search. It measures the importance of each node in a graph relative to the number of incoming relationships it has, and the importance of the nodes providing those relationships.

#### **Why would I use it?**
While its uses for web search are clear, you could apply PageRank to a bunch of use cases. For instance, you could apply it to our movie graph to find the most influential users by ratings. Let's say 'Marvin' has rated thousands of movies in the graph and most are poorly connected to other movies. 'Tom Bombadil' has only rated six movies. Those six movies that 'Tom Bombadil' has rated happen to be connected to other highly-rated movies throughout the network. In PageRank, we might expect Tom Bombadil to have more weight than Marvin.

In short, PageRank cares more about the quality of connections and the strength of their interconnectivity than the total number of connections overall. You can use it to identify the most important entity in any network. Other use cases include:
- **Fraud detection:** Flag accounts connected to influential known fraudsters.
- **Academia:** Identify the most influential researchers or papers in a citation network.
- **Social media influence:** Identify the most influential 'influencers' in a social media network.

Its uses are endless. 

For more info, [check out PageRank in the GDS docs](https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/).

In [11]:
# Page Rank
gds.pageRank.mutate(
    G,
    mutateProperty = 'pagerank',
    maxIterations = 20,
    dampingFactor = 0.85 # Stops the algorithm from getting stuck in loops and dead-ends. Value between 0-1.
)

ranIterations                                                            20
didConverge                                                           False
centralityDistribution    {'min': 0.19735431671142578, 'max': 254.852539...
preProcessingMillis                                                       0
computeMillis                                                            66
postProcessingMillis                                                     11
mutateMillis                                                              0
nodePropertiesWritten                                                  9816
configuration             {'mutateProperty': 'pagerank', 'jobId': '782ed...
Name: 0, dtype: object

In [12]:
# Now let's get a sample of some influential nodes
query = """
    CALL gds.graph.nodeProperties.stream('algo-bank', 'pagerank')
    YIELD nodeId, propertyValue
    WITH gds.util.asNode(nodeId).name AS node, labels(gds.util.asNode(nodeId)) AS label, propertyValue AS pageRank
    WHERE NOT 'Genre' IN label AND node IS NOT NULL
    RETURN pageRank, node, label
    ORDER BY pageRank DESC
"""

ranked = gds.run_cypher(query)
ranked

Unnamed: 0,pageRank,node,label
0,118.955632,Darlene Garcia,[User]
1,76.385576,Karen Avila,[User]
2,74.231969,Robert Brooks,[User]
3,62.810462,Angela Garcia,[User]
4,61.202035,Angela Robertson,[User]
...,...,...,...
666,0.615954,Leah Dixon,[User]
667,0.614889,Hailey Logan,[User]
668,0.601495,Nathan Hall,[User]
669,0.596755,Christina Hardy,[User]


In [13]:
# Betweenness Centrality
gds.betweenness.mutate(
    G,
    mutateProperty = 'betweenness'
)

 Betweenness Centrality:   0%|          | 0/100 [00:00<?, ?%/s]

nodePropertiesWritten                                                  9816
preProcessingMillis                                                       0
computeMillis                                                          6504
postProcessingMillis                                                     30
mutateMillis                                                              0
centralityDistribution    {'min': 0.0, 'max': 14005631.999999998, 'p90':...
configuration             {'mutateProperty': 'betweenness', 'jobId': 'c3...
Name: 0, dtype: object

In [14]:
# Now let's get a sample of those nodes who contribute the most to information flow
query = """
    CALL gds.graph.nodeProperties.stream('algo-bank', 'betweenness')
    YIELD nodeId, propertyValue
    WITH gds.util.asNode(nodeId).name AS node, labels(gds.util.asNode(nodeId)) AS label, propertyValue AS betweenness
    WHERE NOT 'Genre' IN label AND node IS NOT NULL
    RETURN betweenness, node, label
    ORDER BY betweenness DESC
"""

betweenness = gds.run_cypher(query)
betweenness

Unnamed: 0,betweenness,node,label
0,4.218808e+06,Darlene Garcia,[User]
1,2.527380e+06,Robert Brooks,[User]
2,2.139709e+06,Karen Avila,[User]
3,1.744148e+06,Angela Garcia,[User]
4,1.717542e+06,Angela Robertson,[User]
...,...,...,...
666,4.158098e+01,Brandon Drake,[User]
667,3.710053e+01,Stephen Rogers,[User]
668,3.509471e+01,Cassidy Arnold,[User]
669,3.154393e+01,Christina Hardy,[User]


## Good to know
This bank only identifies those algorithms which are likely to be most useful for generating recommendations and checking service health. However, you can get creative with the other algorithms available in GDS.

- [**Node embeddings:**](https://neo4j.com/docs/graph-data-science/current/machine-learning/node-embeddings/) In the previous notebook, '2_gds-client-fastrp.ipynb', we generated node embeddings in our graph using FastRP. These embeddings could be used to improve our recommendations. There are plenty more options available too.
- [**Similarity:**](https://neo4j.com/docs/graph-data-science/current/algorithms/similarity/) We also used node similarity in the previous notebook to find the top 10 most similar neighbours, based on FastRP embeddings.
- [**Path finding:**](https://neo4j.com/docs/graph-data-science/current/algorithms/pathfinding/) Path finding algorithms help you to find the shortest paths between nodes in a network. You could use [Yen's Shortest Path](https://neo4j.com/docs/graph-data-science/current/algorithms/yens/), for example, to find the movies which connect the most influential users across two communities, generating cross-community recommendations.  
