## Before you get started

If you have come here from the first notebook, '1_gds-client-leiden.ipynb', you should already be connected to an instance.

If you are following along with [the movies dataset](https://github.com/neo4j-graph-examples/recommendations/blob/main/data/recommendations-50.dump), adjust the connection details to your own instance and the rest should run correctly.

If you are using a different dataset, make sure you adjust the labels and properties as needed for your graph. 

## **FastRP**
In this notebook, you'll learn:
- What embeddings and FastRP are. [Feel free to skip](#implementing-fastrp).
- How to create FastRP embeddings from the GDS Python client.

### What are embeddings?
- Embeddings are coordinates assigned to objects in a made up digital space ('vector space'). They are usually -- but not always -- used to identify the similarities of objects to one another. For our purposes, 'Embedding' and 'Vector' mean the same thing.

### Why would I do that?

**Sentence Embeddings:** You want to know which sentences in a corpus of texts are the most similar. If you embed the semantic meaning of the sentences, you can extract clusters of similar sentences.

**Node Embeddings** You have a bunch of nodes (Users) in a graph database which exhibit certain behaviours. If you embed those nodes for similarity, you can cluster users by behavioural characteristics.

### Which one are we doing here? 
- Node embeddings using FastRP in Neo4j GDS.

### What is it?
FastRP:
- Gives each of your nodes a coordinate position, based on how similar each node is to every other node. It assesses where each node would live within a pre-computed space.
- In GDS you have the added option of using properties as weights when calculating FastRP embeddings.
- If you like, you can [read the FastRP paper here](https://doi.org/10.48550/arXiv.1908.11512).

You might then filter to only choose recommendations from the topK 10, 20 or 30 movies.

The video below demonstrates what we're actually doing when we ask for topK 10 most similar nodes. 

<div style="padding:44.79% 0 0 0;position:relative;"><iframe src="https://player.vimeo.com/video/1122499506?badge=0&amp;autopause=0&amp;player_id=0&amp;app_id=58479" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" style="position:absolute;top:0;left:0;width:100%;height:100%;" title="milk_vid"></iframe></div><script src="https://player.vimeo.com/api/player.js"></script>

If the player doesn't work, feel free to watch here:
https://vimeo.com/1122454601

Fast RP is not literally _moving_ nodes around. It is analysing them to see how similar they _already_ are.

**Why use it:** 
- Fast RP can be great addition to a recommendations application. For example, with our graph you could grab the top-10 (topK 10) nearest Movie nodes to each user node, filter by those already seen by the user, and then provide the remainder as recommendations.
- You may find that using Community Detection and FastRP together, you find user-specific recommendations that you would not otherwise have found. 

**Consider this scenario:**
A user tends to watch high-octane action movies along with the odd horror. If we built a recommendation engine based solely on their favourite movies, we would miss other behavioural quirks. For example, every other month -- without fail -- this user also watches three romantic-comedies in sequence. 
Their genre profile ignores the romantic-comedy. Their Leiden community with FastRP backup, however, might move them into a group of users who demonstrate similar watching behaviours. Their user group can now receive recommendations for action, horror and romantic-comedies. Those particular recommendations won't be generic either. They will be based on the behaviours of other people displaying similar behaviours to our user.

You can likely think of several reasons for this behaviour. The beauty is, you don't need to.

**But...** In practice, dense nodes in a network can have a strong gravitational effect. So, if you based a recommendation engine on Fast RP embeddings alone, you would likely end up with one super node dominating all recommendations. As with Leiden, this is not necessarily a bad thing -- it just depends on your intent.

**Example:** Let's say you are a massive movie platform, and you want X-number of users to see Y-movie for Z-business-reason. Your intent then is not 'recommend movies people will like'. It is 'recommend movies we need users to watch to users who may be most partial'. If that is the case, you could run FastRP on all nodes, and then find the users who are most similar to that movie. If you need 1000 users to watch it to hit a KPI, just increase topK to find the users who are likely to be more partial to it. 

**Contrast:** Let's say you are a boutique arthouse platform with a large, diverse offering of lesser-known titles. You want to ensure that users can easily find titles they are most likely to enjoy. In this case, you might:
- Run Community Detection (Leiden or Louvain) to identify distinct clusters of users.
- Run FastRP
- Filter allowable recommendations by ensuring that the users' communityIds match the movies' communityIds. In this way, you can go super granular, and get the best movie within a cluster of people, rather than for the entire set.

Bear in mind, these are just some ideas. You could take an entirely different approach. It really is up to you.

### Set up the GDS Python Client and connect to an instance

1. Install GDS and Pandas. 

***Note:*** _You do not need pandas to run the GDS client. We're including it here to demonstrate how you can query with GDS and use the outputs in a familiar format._

In [None]:
%pip install graphdatascience pandas

2. Connect to your instance. 

    This assumes you already have an instance running.

    Replace my details with your details and hit run.

In [None]:
from graphdatascience import GraphDataScience
from graphdatascience import ServerVersion

# Use Neo4j URI and credentials according to your setup
# NEO4J_URI could look similar to "bolt://my-server.neo4j.io:7687"

NEO4J_URI = "your_uri"                # If on a local instance, something like "neo4j://127.0.0.1:7687"
NEO4J_USER = "your_username"          # Probably "neo4j"
NEO4J_PASSWORD = "your_password" 

gds = GraphDataScience(
  NEO4J_URI,
  auth=(
    NEO4J_USER,
    NEO4J_PASSWORD
  ),
  database = "database_name"          # If using the movies dump, it is likely "recommendations-embeddings-50"
)

# Check the installed GDS version on the server
print(rf"All systems are go. GDS Version: {gds.server_version()}")
assert gds.server_version() >= ServerVersion(1, 8, 0)

All systems are go. GDS Version: 2.21.0


#### Troubleshooting
Make sure:
1. You have the URI from the correct instance
2. Your password is correct
3. Your username is correct
4. Your database name is correct -- this will be different from your project name

### Helpers
The following two cells will:
- Drop the graph from memory
- Remove any properties we created

Only use them if you want to start again.

In [27]:
# To drop the graph from memory:
gds.graph.drop('fastrp')

graphName                                                           fastrp
database                                     recommendations-embeddings-50
databaseLocation                                                     local
memoryUsage                                                               
sizeInBytes                                                             -1
nodeCount                                                             9816
relationshipCount                                                   200008
configuration            {'relationshipProjection': {'RATED': {'aggrega...
density                                                           0.002076
creationTime                           2025-09-27T19:18:51.977746000+01:00
modificationTime                       2025-09-27T19:19:00.101949000+01:00
schema                   {'graphProperties': {}, 'nodes': {'User': {'em...
schemaWithOrientation    {'graphProperties': {}, 'nodes': {'User': {'em...
Name: 0, dtype: object

In [28]:
# To remove embeddings written to the main graph
gds.run_cypher (
"""
    MATCH (n)
    WHERE n.embedding IS NOT NULL
    REMOVE n.embedding
"""
)

In [29]:
# To remove relationships written to the main graph
gds.run_cypher(
"""
    MATCH (n)-[r:SIMILAR]->(k)
    DELETE r
"""
)

### Project your nodes and relationships

Projecting graphs from the GDS Client is 1000x simpler than in the Browser. If you don't know what 'Project' means yet, don't worry. Let's just do it first, and then analyse what happened.

1. Assign a variable to the nodes you want to analyse.
2. Assign a variable to the relationships you want to analyse. Make sure you use the correct directionality for these relationships. 
    Leiden, the algorithm we're about to run, _cannot_ run on DIRECTED relationships. You can [check the requirements for each algorithm in the docs.](https://neo4j.com/docs/graph-data-science/current/algorithms/leiden/)
3. Project

It's generally good practice to check how much memory a graph will actually use first, and then project a graph -- so let's do that.

### Set up for FastRP 

With FastRP in GDS, you can also embed the numeric properties of nodes as context weights. 

The projection we created in the gds-client-leiden notebook did not include any properties.

So, we're going to spin up a new projection with the rating property included.

In [30]:
# We define how we want to project our database into GDS
node_projection = ['User', 'Movie', 'Genre']
relationship_projection = {'RATED': {'orientation': 'UNDIRECTED', 'properties': 'rating'}}

# Before actually going through with the projection, let's check how much memory is required
result = gds.graph.project.estimate(node_projection, relationship_projection)

print(f"Required memory for native loading: {result['requiredMemory']}")

Required memory for native loading: [5573 KiB ... 5765 KiB]


In [31]:
# FastRP can run on directed relationships. However, it performs better on UNDIRECTED, so we'll stick with that.
G, result = gds.graph.project('fastrp', node_projection, relationship_projection)

print(f"The projection took {result['projectMillis']} ms")

# We can use convenience methods on `G` to check if the projection looks correct
print(f"Graph '{G.name()}' node count: {G.node_count()}")
print(f"Graph '{G.name()}' node labels: {G.node_labels()}")

The projection took 31 ms
Graph 'fastrp' node count: 9816
Graph 'fastrp' node labels: ['User', 'Movie', 'Genre']


## What you just did
You just 'projected' a graph. 'Project' is an industry jargon term that means you:
1. Pulled a bunch of nodes and/or relationships and properties out of the main graph
2. Reconstructed a new graph in memory out of those

Now, when you call 'gds.' and reference 'rec-simple', the operations you run will run on the 'rec-simple' projection -- not the main graph.

## Why this is awesome
1. You can spin up as many projections as you like, as often as you like and drop them when you're done.
2. Your actions on the projection will have no effect on the main graph (unless you 'write' them).
3. You can run more complex algorithms on subsets of nodes and relationships, rather than trying to analyse a subset within the entire database.

In short, you are _almost_ literally 'projecting' a _new_ sub-graph image into memory.

![Graph projection vis](../docs/projection.png)

### Implementing FastRP
In this section you will:

1. Create FastRP embeddings on the dataset. Click here to [go back to the FastRP explanation](#fastrp).
2. Get stats before you commit them to the projection
3. Mutate the projection with FastRP embeddings
4. Write them back to the nodes in the database

In [32]:
# We can also estimate memory of running algorithms like FastRP, so let's do that first
result = gds.fastRP.mutate.estimate(
    G,
    mutateProperty="embedding",
    randomSeed=42,
    embeddingDimension=64,
    relationshipWeightProperty="rating",
    iterationWeights=[0.8, 1, 1, 1],
)

print(f"Required memory for running FastRP: {result['requiredMemory']}")

Required memory for running FastRP: 8282 KiB


In [33]:
# Now let's run FastRP and mutate our projected graph 'purchases' with the results
result = gds.fastRP.mutate(
    G,
    mutateProperty = 'embedding',               # The property name that will be written
    embeddingDimension = 64,                    # How specific you want the embeddings to be -- higher == more accuracy but also more data
    relationshipWeightProperty = 'rating',      # The weight used to provide additional context. All relationships you used with FastRP must have this or it will not work.
    iterationWeights = [0.8, 1, 1, 1],          # Each number tells the model how much influence to give closer nodes (0.8) or further nodes (1, 1, 1).
    randomSeed = 42                             # The number to base the randomness off, for reproducibility
)

# Let's make sure we got an embedding for each node
print(f"Number of embedding vectors produced: {result['nodePropertiesWritten']}")

Number of embedding vectors produced: 9816


In [34]:
# Now we're going to write relationships back to the database, identifying the most similar nodes.
result = gds.knn.write(
    G,
    nodeLabels = ['*'],                         # Specify the node labels to be included. '*' means all.
    relationshipTypes = ['*'],                  # Specify the relationships to be included.
    topK = 2,                                   # The number of 'closes to node' relationships you want to add.
    concurrency = 1,                            # The number of concurrent threads used to run the algorithm
    nodeProperties = ['embedding'],             # The property to write.
    similarityCutoff = 0.5,                     # How similar two nodes must be to receive a relationship.
    writeRelationshipType = 'SIMILAR',          # The reationship label to write.
    writeProperty = 'score'                     # The property to write on that relationship.
)
print(f"Relationships produced: {result['relationshipsWritten']}")
print(f"Nodes compared: {result['nodesCompared']}")
print(f"Mean similarity: {result['similarityDistribution']['mean']}")

 Knn:   0%|          | 0/100 [00:00<?, ?%/s]

Relationships produced: 19474
Nodes compared: 9816
Mean similarity: 0.9586150390822453


Now, we can use our new relationships to find similar pairs of...

Movies:

In [None]:
gds.run_cypher(
    """
        MATCH (n1:Movie)-[r:SIMILAR]->(n2:Movie)
        RETURN n1.title AS m1, n2.title AS m2, r.score AS similarity
        ORDER BY similarity DESCENDING, n1, n2
    """
)

Unnamed: 0,m1,m2,similarity
0,Lamerica,Dangerous Ground,1.000000
1,Last Summer in the Hamptons,"Day the Sun Turned Cold, The (Tianguo niezi)",1.000000
2,Shopping,Starship Troopers 3: Marauder,1.000000
3,Shopping,Dear White People,1.000000
4,Catwalk,Meatballs Part II,1.000000
...,...,...,...
17705,And the Ship Sails On (E la nave va),American Splendor,0.737737
17706,Mei and the Kittenbus,Equilibrium,0.728011
17707,Why Do Fools Fall In Love?,Necessary Roughness,0.726736
17708,I'll Be Home For Christmas,Rounders,0.719668


Now let's check which users are most similar.

In [36]:
gds.run_cypher(
    """
        MATCH (n1:User)-[r:SIMILAR]->(n2:User)
        RETURN n1.name AS n1, n2.name AS n2, r.score AS similarity
        ORDER BY similarity DESCENDING, n1, n2
    """
)

Unnamed: 0,n1,n2,similarity
0,Aaron Nelson,Katie Collins,0.993487
1,Katie Collins,Aaron Nelson,0.993487
2,Aaron Nelson,Christopher Sanders,0.991559
3,Christopher Sanders,Aaron Nelson,0.991559
4,Heather Morrison,Aaron Nelson,0.990221
...,...,...,...
783,Brian Townsend,John Herrera,0.861167
784,Terry Holder,Donna Ayala,0.852205
785,Andre Stark,Ashley Lloyd,0.851870
786,Terry Holder,Tracey Irwin,0.851741


We can likely assume that our 0.99 similarity pairs will have few differences in watching habits. 

But, what about our targets in the 0.8 - 0.9 range?

Let's see if there are any movies that Nicholas Burke has watched, that Haley Cummings has not.

In [37]:
gds.run_cypher(
    """
        MATCH (u1:User {name: "Haley Cummings"})-[:SIMILAR]->(m:Movie)
        WITH collect(m) as movies, u1
        MATCH (u2:User {name: "Nicholas Burke"})-[:SIMILAR]->(m2:Movie)
        WHERE NOT m2 IN movies
        WITH u1, movies, collect(m2.title) AS recs
        RETURN elementId(u1) AS user_id, u1.name AS name, movies AS watched, recs
    """
)

Unnamed: 0,user_id,name,watched,recs
0,4:8488d011-1dc6-48b6-8afb-0a66841ceb5d:9370,Haley Cummings,"[(languages, year, imdbId, runtime, imdbRating...","[Renaissance, Requiem for a Dream]"


You can use this in tandem with Community Detection algorithms to get granular recommendations for similar groups of people. 

For a more complete list of algorithms you can use, check out the third notebook: 3_gds-client-algorithm-bank.ipynb