<p>
  <a href="https://colab.research.google.com/github/neo4j-partners/hands-on-lab-neo4j-and-vertex-ai/blob/main/Lab%205%20-%20Graph%20Data%20Science/embedding.ipynb" target="_blank">
    <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
  </a>
</p>

First off, you'll also need to install a few packages.

In [8]:
!pip install --quiet --upgrade neo4j
!pip install --quiet google-cloud-storage

You'll need to enter the credentials from your Neo4j instance below.

The default DB_NAME is always neo4j.

In [9]:
DB_URL = "neo4j://35.237.130.165:7687"
DB_USER = "neo4j"
DB_PASS = "foo123"
DB_NAME = "neo4j"

In [10]:
import pandas as pd
from neo4j import GraphDatabase

driver = GraphDatabase.driver(DB_URL, auth=(DB_USER, DB_PASS))

First we're going to create an in memory graph represtation of the data in Neo4j Graph Data Science (GDS).

In [21]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
CALL gds.graph.project(
    'mygraph',
    ['Company', 'Manager', 'Holding'],
    {
        OWNS: {orientation: 'UNDIRECTED'},
        PARTOF: {orientation: 'UNDIRECTED'}
    }
)
YIELD
    graphName AS graph,
    relationshipProjection AS readProjection,
    nodeCount AS nodes,
    relationshipCount AS rels
      """
    ).data()
  )
df = pd.DataFrame(result)
display(df)

Unnamed: 0,graph,readProjection,nodes,rels
0,mygraph,"{'PARTOF': {'orientation': 'UNDIRECTED', 'aggr...",458170,1787688


Note, if you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:

In [20]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
      CALL gds.graph.drop('mygraph')
      """
    ).data()
  )

In [17]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
      CALL gds.graph.list()
      """
    ).data()
  )
  print(result)

[{'degreeDistribution': {'p99': 18, 'min': 1, 'max': 6864, 'mean': 3.901800641683218, 'p90': 2, 'p50': 2, 'p999': 420, 'p95': 2, 'p75': 2}, 'graphName': 'mygraph', 'database': 'neo4j', 'memoryUsage': '25 MiB', 'sizeInBytes': 26959496, 'nodeCount': 458170, 'relationshipCount': 1787688, 'configuration': {'relationshipProjection': {'PARTOF': {'orientation': 'UNDIRECTED', 'aggregation': 'DEFAULT', 'type': 'PARTOF', 'properties': {}}, 'OWNS': {'orientation': 'UNDIRECTED', 'aggregation': 'DEFAULT', 'type': 'OWNS', 'properties': {}}}, 'nodeProjection': {'Company': {'label': 'Company', 'properties': {}}, 'Holding': {'label': 'Holding', 'properties': {}}, 'Manager': {'label': 'Manager', 'properties': {}}}, 'relationshipProperties': [], 'creationTime': neo4j.time.DateTime(2022, 3, 25, 14, 41, 33, 724229000, tzinfo=<UTC>), 'validateRelationships': False, 'readConcurrency': 4, 'sudo': False, 'nodeProperties': [], 'username': None}, 'density': 8.516072981112249e-06, 'creationTime': neo4j.time.DateT

Now we can generate an embedding from that graph. This is a new feature we can use in our predictions. We're using FastRP, which is a more full featured and higher performance of Node2Vec. You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

In [22]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        CALL gds.fastRP.mutate('mygraph',{
        embeddingDimension: 16,
        randomSeed: 1, 
        mutateProperty:'embedding'
        })
      """
    ).data()
  )
df = pd.DataFrame(result)
display(df)

Unnamed: 0,nodePropertiesWritten,mutateMillis,nodeCount,preProcessingMillis,computeMillis,configuration
0,458170,0,458170,0,254,"{'nodeSelfInfluence': 0, 'relationshipWeightPr..."


That creates an embedding for each node type.  However, we only want the embedding on the nodes of type holding.

We're going to take the embedding from our projection and write it to the holding nodes in the underlying database.

In [27]:
with driver.session(database=DB_NAME) as session:
  result = session.run(
    """
      CALL gds.graph.writeNodeProperties('mygraph', ['embedding'], ['Holding'])
      YIELD writeMillis
    """
  )
  print(result)

<neo4j.work.result.Result object at 0x7f5287e8b290>


In [31]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        MATCH (n:Holding) RETURN n LIMIT 100
      """
    ).data()
  )

In [33]:
df = pd.DataFrame([dict(record.get('n')) for record in result])
df

Unnamed: 0,shares,cusip,reportCalendarOrQuarter,filingManager,embedding,value,target
0,270,88579Y101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[-0.06891358643770218, 0.1725142002105713, -0....",52024000,False
1,195,00508Y102,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[-0.0012208839179947972, 0.005601316690444946,...",32175000,False
2,4939,00724F101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.11818736791610718, 0.26252642273902893, -0....",2347852000,False
3,1557,02079K305,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.14201897382736206, 0.2642807066440582, 0.14...",3211344000,False
4,837,02079K107,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.30085813999176025, 0.2475065439939499, 0.44...",1731443000,False
...,...,...,...,...,...,...,...
95,1131,78410G104,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.4144970774650574, 0.3067362606525421, -0.21...",313909000,False
96,18914,808524714,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.10019616037607193, 0.13300782442092896, -0....",964228000,False
97,21346,808524854,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.3086536228656769, 0.6544697284698486, 0.043...",1204128000,False
98,7153,808524805,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.05431220680475235, 0.038866832852363586, -0...",269111000,False


Now let's grab the relationships.

In [None]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        CALL gds.graph.streamRelationshipProperties
        ('mygraph', ['shares', 'value'])
        YIELD sourceNodeId, targetNodeId, relationshipType, propertyValue
        RETURN sourceNodeId, targetNodeId, relationshipType, propertyValue
      """
    ).data()
  )
df2 = pd.DataFrame(result)
df2.head()

Unnamed: 0,sourceNodeId,targetNodeId,relationshipType,propertyValue
0,0,3906,Owns,270.0
1,0,3907,Owns,195.0
2,0,3908,Owns,4939.0
3,0,3909,Owns,1557.0
4,0,3910,Owns,837.0


Now we need to take that dataframe and shape it into something that better represents our classification problem.

In [None]:
x = df.pivot(index="nodeId", columns="nodeProperty", values="propertyValue")
x = x.reset_index()
x.columns.name = None
x.head()

Unnamed: 0,nodeId,embedding
0,0,"[0.0, -0.1060660257935524, 0.0, 0.0, -0.106066..."
1,1,"[0.0, 0.0, 0.0, 0.0, 0.0]"
2,2,"[0.0, 0.0, 0.0, 0.0, 0.0]"
3,3,"[0.0, 0.0, 0.0, -0.15000000596046448, 0.0]"
4,4,"[0.0, 0.0, 0.15000000596046448, 0.0, 0.0]"


Note that the embedding row is an array. To make this dataset more consumable, we should flatten that out into multiple individual features: embedding_0, embedding_1, ... embedding_n.

In [None]:
FEATURES_FILENAME = "features.csv"

embeddings = pd.DataFrame(x["embedding"].values.tolist()).add_prefix("embedding_")
merged = x.drop(columns=["embedding"]).merge(embeddings, left_index=True, right_index=True)
merged

#features_df = merged.drop(columns=["is_fraudster", "num_transactions", "total_transaction_amnt"])
#train_df = merged.drop(columns=["nodeId"])
#features_df.to_csv(FEATURES_FILENAME, index=False)

Unnamed: 0,nodeId,embedding_0,embedding_1,embedding_2,embedding_3,embedding_4
0,0,0.000000,-0.106066,0.000000,0.000000,-0.106066
1,1,0.000000,0.000000,0.000000,0.000000,0.000000
2,2,0.000000,0.000000,0.000000,0.000000,0.000000
3,3,0.000000,0.000000,0.000000,-0.150000,0.000000
4,4,0.000000,0.000000,0.150000,0.000000,0.000000
...,...,...,...,...,...,...
11243,11243,0.000000,0.000000,-0.106066,-0.106066,0.000000
11244,11244,-0.075000,0.075000,0.000000,0.075000,-0.075000
11245,11245,0.000000,0.000000,0.000000,0.000000,0.000000
11246,11246,0.000000,0.000000,0.000000,0.000000,0.000000


Now let's write the file to Google Cloud Storage so we can use it in our model.

In [None]:
!pip install --quiet google-cloud-storage

## Define Google Cloud variables
You'll need to set a few variables for your GCP environment.  PROJECT_ID and STORAGE_BUCKET are most critical.  The others will probably work with the defaults given.

In [None]:
# Edit these variables!
PROJECT_ID = "YOUR-PROJECT-ID"
STORAGE_BUCKET = "YOUR-BUCKET-NAME"

# You can leave these defaults
REGION = "us-central1"
STORAGE_PATH = "form13"

In [None]:
import os

os.environ["GCLOUD_PROJECT"] = PROJECT_ID

In [None]:
try:
    from google.colab import auth as google_auth
    google_auth.authenticate_user()
except:
    pass

In [None]:
from google.cloud import storage
client = storage.Client()

In [None]:
bucket = client.bucket(STORAGE_BUCKET)
client.create_bucket(bucket)

In [None]:
# Upload our files to that bucket
for filename in [FEATURES_FILENAME, TRAINING_FILENAME]:
    upload_path = os.path.join(STORAGE_PATH, filename)
    blob = bucket.blob(upload_path)
    blob.upload_from_filename(filename)