# Running the Random Projection Embedding

In this notebook we're going to generate graph embeddings using the Random Projection algorithm. We'll then explore those embeddings using Python Data Science tools.

Let's start by importing some libraries:

In [1]:
from neo4j import GraphDatabase
from sklearn.manifold import TSNE

import numpy as np
import altair as alt
import pandas as pd
import os

Once we've done that we can initialise the Neo4j driver. 

In [2]:
bolt_url = os.getenv("NEO4J_BOLT_URL", "bolt://localhost")
user = os.getenv("NEO4J_USER", "neo4j")
password = os.getenv("NEO4J_PASSWORD", "neo")
driver = GraphDatabase.driver("bolt://graph-embeddings-neo4j", auth=(user, password))

We should have already imported the dataset. We can run the following query to check that the data has been imported:

In [3]:
result = {"label": [], "count": []}
with driver.session(database="neo4j") as session:
    for row in session.run("CALL db.labels()"):
        label = row["label"]
        query = f"MATCH (:`{label}`) RETURN count(*) as count"
        count = session.run(query).single()["count"]
        result["label"].append(label)
        result["count"].append(count)
nodes_df = pd.DataFrame(data=result)
nodes_df.sort_values("count")

result = {"relType": [], "count": []}
with driver.session(database="neo4j") as session:
    for row in session.run("CALL db.relationshipTypes()"):
        relationship_type = row["relationshipType"]
        query = f"MATCH ()-[:`{relationship_type}`]->() RETURN count(*) as count"
        count = session.run(query).single()["count"]
        result["relType"].append(relationship_type)
        result["count"].append(count)
rels_df = pd.DataFrame(data=result)
rels_df.sort_values("count")

display(nodes_df)
display(rels_df)

Unnamed: 0,label,count
0,Place,894


Unnamed: 0,relType,count
0,EROAD,1250


We should have 894 `Place` nodes and 2,500 `EROAD` relationships.

Now let's run some embeddings. We're going to run the streaming version of the Random Projection algorithm. We need to define the following config:

* `nodeProjection` - the node labels to use for our projected graph
* `relationshipProjection` - the relationship types to use for our projected graph
* `embeddingSize` - the size of the vector/list of numbers to create for each node
* `maxIterations` - the number of iterations to run

Let's give it a try:

In [4]:
with driver.session(database="neo4j") as session:
    result = session.run("""
    CALL gds.alpha.randomProjection.stream({
       nodeProjection: "Place",
       relationshipProjection: {
         eroad: {
           type: "EROAD",
           orientation: "UNDIRECTED"
        }
       },
       embeddingSize: 10,
       maxIterations: 1
    })
    YIELD nodeId, embedding
    RETURN gds.util.asNode(nodeId).name AS place, embedding
    LIMIT 10
    """)
    
    embeddings_df = pd.DataFrame([dict(record) for record in result])
embeddings_df    

Unnamed: 0,embedding,place
0,"[0.0, 0.3651483654975891, 0.0, 0.1825741827487...",Larne
1,"[0.0, 0.09128709137439728, 0.09128709137439728...",Belfast
2,"[-0.13693062961101532, 0.13693062961101532, 0....",Dublin
3,"[0.13693062961101532, 0.27386125922203064, -0....",Wexford
4,"[-0.27386125922203064, 0.13693062961101532, 0....",Rosslare
5,"[0.0, 0.18257418274879456, -0.1825741827487945...",La Coruña
6,"[0.5477225184440613, 0.5477225184440613, 0.0, ...",Pontevedra
7,"[0.27386125922203064, 0.27386125922203064, 0.0...",Valença do Minho
8,"[0.3651483654975891, 0.18257418274879456, 0.0,...",Porto
9,"[0.27386125922203064, 0.13693062961101532, 0.1...",Aveiro


So far everything looks good. Let's now store the embeddings in Neo4j, by using the write version of the algorithm:

In [6]:
with driver.session(database="neo4j") as session:
    result = session.run("""
    CALL gds.alpha.randomProjection.write({
       nodeProjection: "Place",
       relationshipProjection: {
         eroad: {
           type: "EROAD",
           orientation: "UNDIRECTED"
        }
       },
       embeddingSize: 10,
       maxIterations: 1,
       writeProperty: $embeddingProperty
    })
    """, {"embeddingProperty": "embeddingRandomProjection"})
    
    embeddings_df = pd.DataFrame([dict(record) for record in result])
embeddings_df    

Unnamed: 0,computeMillis,configuration,createMillis,nodeCount,nodePropertiesWritten,writeMillis
0,55,"{'maxIterations': 1, 'writeConcurrency': 4, 'n...",12,894,894,462


The embeddings will be stored in the `embeddingRandomProjection` property on each node. Let's now build a data frame that contains each place and its embedding so that we can explore them further.

In [9]:
with driver.session(database="neo4j") as session:
    result = session.run("""
    MATCH (p:Place)
    RETURN p.name AS place, p.embeddingRandomProjection AS embedding, p.countryCode AS country
    """)
    X = pd.DataFrame([dict(record) for record in result])
X.head()

Unnamed: 0,country,embedding,place
0,GB,"[-0.18257418274879456, -0.3651483654975891, 0....",Larne
1,GB,"[0.09128709137439728, -0.18257418274879456, -0...",Belfast
2,IRL,"[-0.13693062961101532, -0.27386125922203064, 0...",Dublin
3,IRL,"[0.0, -0.27386125922203064, 0.2738612592220306...",Wexford
4,IRL,"[0.13693062961101532, -0.13693062961101532, -0...",Rosslare


## Visualizing Random Projection embeddings

We can visualize our embeddings with the help of the [t-SNE algorithm](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). 

t-SNE is a dimensionality reduction technique that can be used to reduce high dimensionality objects to 2 or 3 dimensions so that they can be better visualized. We're going to use it to create a scatterplot of our embeddings.

The following code snippet applies t-SNE to the embeddings and then creates a data frame containing each place, its country, as well as x and y coordinates.

In [11]:
X_embedded = TSNE(n_components=2, random_state=6).fit_transform(list(X.embedding))

places = X.place
df = pd.DataFrame(data = {
    "place": places,
    "country": X.country,
    "x": [value[0] for value in X_embedded],
    "y": [value[1] for value in X_embedded]
})
df.head()

Unnamed: 0,place,country,x,y
0,Larne,GB,1.398636,-13.243345
1,Belfast,GB,3.544013,6.745259
2,Dublin,IRL,1.750356,-12.613811
3,Wexford,IRL,15.871904,-6.830082
4,Rosslare,IRL,-20.641287,-15.414942


We can then use the Altair visualization library to create a scatterplot of these coordinates:

In [12]:
chart = alt.Chart(df).mark_circle(size=60).encode(
    x='x',
    y='y',
    tooltip=['place', 'country']
).properties(width=700, height=400, title="Random Projection Embeddings")
chart.save('randomProjection.json')
chart

There don't seem to be any clusters of points in our visualization. It's also hard to tell what each point represents without hovering over them individually. We can color each point based on their `country` property with the following code:

In [13]:
chart = alt.Chart(df).mark_circle(size=60).encode(
    x='x',
    y='y',
    color='country',
    tooltip=['place', 'country']
).properties(width=700, height=400, title="Random Projection Embeddings")

chart.save('randomProjection-color.json')

chart