<p>
  <a href="https://colab.research.google.com/github/neo4j-partners/hands-on-lab-neo4j-and-vertex-ai/blob/main/Lab%204%20-%20Graph%20Data%20Science/embedding.ipynb" target="_blank">
    <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
  </a>
</p>

First off, you'll also need to install a few packages.

In [1]:
!pip install --quiet --upgrade neo4j
!pip install --quiet google-cloud-storage

[?25l[K     |███▊                            | 10 kB 3.2 MB/s eta 0:00:01[K     |███████▍                        | 20 kB 5.6 MB/s eta 0:00:01[K     |███████████                     | 30 kB 7.4 MB/s eta 0:00:01[K     |██████████████▊                 | 40 kB 9.4 MB/s eta 0:00:01[K     |██████████████████▍             | 51 kB 11.3 MB/s eta 0:00:01[K     |██████████████████████          | 61 kB 13.1 MB/s eta 0:00:01[K     |█████████████████████████▊      | 71 kB 14.5 MB/s eta 0:00:01[K     |█████████████████████████████▍  | 81 kB 15.8 MB/s eta 0:00:01[K     |████████████████████████████████| 89 kB 4.1 MB/s 
[?25h  Building wheel for neo4j (setup.py) ... [?25l[?25hdone


You'll need to enter the credentials from your Neo4j instance below.

The default DB_NAME is always neo4j.

In [2]:
DB_URL = "neo4j+s://638c0d30.databases.neo4j.io"
DB_USER = "neo4j"
DB_PASS = "LCrbaVYV4tPrlFiO6mkg6IlhLqNTLVHEUSm0RMqACwQ"
DB_NAME = "neo4j"

In [3]:
import pandas as pd
from neo4j import GraphDatabase

driver = GraphDatabase.driver(DB_URL, auth=(DB_USER, DB_PASS))

First we're going to create an in memory graph represtation of the data in Neo4j Graph Data Science (GDS).

In [6]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
CALL gds.graph.project(
  'graph',                                      
  {
    Company: {},
    Manager: {}
  },
  {
    Owns: {
      properties: {
        shares: {property:'shares', defaultValue:0},
        value: {property:'value', defaultValue: 0}
      }
    }
   }
)
YIELD
  graphName AS graph,
  relationshipProjection AS readProjection,
  nodeCount AS nodes,
  relationshipCount AS rels      """
    ).data()
  )
df = pd.DataFrame(result)
display(df)

Unnamed: 0,nodeProjection,relationshipProjection,graphName,nodeCount,relationshipCount,projectMillis,createMillis
0,"{'__ALL__': {'label': '*', 'properties': {}}}","{'__ALL__': {'orientation': 'NATURAL', 'aggreg...",mygraph,10669,173681,20,20


Note, if you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:

In [6]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
      CALL gds.graph.drop('mygraph')
      """
    ).data()
  )

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 6))



ClientError: ignored

Now we can generate an embedding from that graph. This is a new feature we can use in our predictions. We're using FastRP, which is a more full featured and higher performance of Node2Vec. You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

In [7]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        CALL gds.fastRP.mutate('mygraph',{
        iterationWeights: [0.0, 1.00, 1.00, 0.80, 0.60],
        nodeSelfInfluence: 0.15,
        embeddingDimension: 5,
        randomSeed: 1, 
        mutateProperty:'embedding'
        })
      """
    ).data()
  )
df = pd.DataFrame(result)
display(df)

Unnamed: 0,nodePropertiesWritten,mutateMillis,nodeCount,preProcessingMillis,computeMillis,configuration
0,10669,0,10669,0,13,"{'nodeSelfInfluence': 0.15, 'relationshipWeigh..."


In [8]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        CALL gds.graph.streamNodeProperties
        ('mygraph', ['embedding'])
        YIELD nodeId, nodeProperty, propertyValue
        RETURN nodeId, nodeProperty, propertyValue
      """
    ).data()
  )
df = pd.DataFrame(result)
df.head()

Unnamed: 0,nodeId,nodeProperty,propertyValue
0,66701,embedding,"[0.1060660257935524, -0.1060660257935524, 0.0,..."
1,66702,embedding,"[0.08660254627466202, 0.0, 0.0, -0.08660254627..."
2,66703,embedding,"[0.1060660257935524, 0.0, 0.0, 0.0, 0.10606602..."
3,66704,embedding,"[0.1060660257935524, 0.0, 0.1060660257935524, ..."
4,66705,embedding,"[0.0, 0.0, 0.1060660257935524, 0.0, -0.1060660..."


Now let's grab the relationships.

In [9]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        CALL gds.graph.streamRelationshipProperties
        ('mygraph', ['shares', 'value'])
        YIELD sourceNodeId, targetNodeId, relationshipType, propertyValue
        RETURN sourceNodeId, targetNodeId, relationshipType, propertyValue
      """
    ).data()
  )
df2 = pd.DataFrame(result)
df2.head()

ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 6))



ClientError: ignored

Now we need to take that dataframe and shape it into something that better represents our classification problem.

In [46]:
x = df.pivot(index="nodeId", columns="nodeProperty", values="propertyValue")
x = x.reset_index()
x.columns.name = None
x.head()

Unnamed: 0,nodeId,embedding
0,56032,"[0.1060660257935524, 0.0, 0.0, -0.106066025793..."
1,56033,"[0.0, -0.1060660257935524, 0.0, 0.0, -0.106066..."
2,56034,"[0.0, 0.0, 0.0, -0.1060660257935524, -0.106066..."
3,56035,"[0.0, 0.1060660257935524, 0.0, 0.0, 0.10606602..."
4,56036,"[0.0, 0.08660254627466202, 0.08660254627466202..."


Note that the embedding row is an array. To make this dataset more consumable, we should flatten that out into multiple individual features: embedding_0, embedding_1, ... embedding_n.

In [None]:
FEATURES_FILENAME = "features.csv"

embeddings = pd.DataFrame(x["embedding"].values.tolist()).add_prefix("embedding_")
merged = x.drop(columns=["embedding"]).merge(embeddings, left_index=True, right_index=True)
merged

#features_df = merged.drop(columns=["is_fraudster", "num_transactions", "total_transaction_amnt"])
#train_df = merged.drop(columns=["nodeId"])
#features_df.to_csv(FEATURES_FILENAME, index=False)

Unnamed: 0,nodeId,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4
0,56032,0.106066,0.000000,0.000000,-0.106066,0.000000
1,56033,0.000000,-0.106066,0.000000,0.000000,-0.106066
2,56034,0.000000,0.000000,0.000000,-0.106066,-0.106066
3,56035,0.000000,0.106066,0.000000,0.000000,0.106066
4,56036,0.000000,0.086603,0.086603,-0.086603,0.000000
...,...,...,...,...,...,...
10664,66696,0.000000,0.000000,-0.106066,0.106066,0.000000
10665,66697,0.000000,0.000000,0.000000,0.000000,0.150000
10666,66698,-0.086603,0.000000,-0.086603,0.086603,0.000000
10667,66699,0.000000,-0.106066,-0.106066,0.000000,0.000000
