<p>
  <a href="https://colab.research.google.com/github/neo4j-partners/hands-on-lab-neo4j-and-vertex-ai/blob/main/Lab%205%20-%20Graph%20Data%20Science/embedding.ipynb" target="_blank">
    <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
  </a>
</p>

# Install Prerequisites
First off, you'll also need to install a few packages.

In [2]:
%pip install --quiet --upgrade graphdatascience
%pip install --quiet google-cloud-storage

[?25l[K     |███▋                            | 10 kB 17.7 MB/s eta 0:00:01[K     |███████▏                        | 20 kB 22.4 MB/s eta 0:00:01[K     |██████████▉                     | 30 kB 28.3 MB/s eta 0:00:01[K     |██████████████▍                 | 40 kB 27.2 MB/s eta 0:00:01[K     |██████████████████              | 51 kB 5.2 MB/s eta 0:00:01[K     |█████████████████████▋          | 61 kB 6.0 MB/s eta 0:00:01[K     |█████████████████████████▎      | 71 kB 3.4 MB/s eta 0:00:01[K     |████████████████████████████▉   | 81 kB 3.8 MB/s eta 0:00:01[K     |████████████████████████████████| 90 kB 2.8 MB/s 
[?25h  Building wheel for neo4j (setup.py) ... [?25l[?25hdone


# Working with Neo4j
You'll need to enter the credentials from your Neo4j instance below.  You can get these from the Neo4j Browser by running the command ":server connect"

The default DB_USER and DB_NAME are always neo4j.

In [3]:
# Edit these variables!
DB_URL = "neo4j+s://d1901d9c.databases.neo4j.io:7687"
DB_PASS = "e4QzO_Ipfii1F6wB7MNMmUm4UxF2S1KAKqS-qWe9DK0"

# You can leave this default
DB_USER = 'neo4j'

In [4]:
from graphdatascience import GraphDataScience
gds = GraphDataScience(DB_URL, auth=(DB_USER, DB_PASS), aura_ds=True)

First we're going to create an in memory graph represtation of the data in Neo4j Graph Data Science (GDS).

In [17]:
result = gds.run_cypher(
  """
    CALL gds.graph.project(
      'mygraph',
      ['Company', 'Manager', 'Holding'],
      {
          OWNS: {orientation: 'UNDIRECTED'},
          PARTOF: {orientation: 'UNDIRECTED'}
      }
    )
    YIELD
      graphName AS graph,
      relationshipProjection AS readProjection,
      nodeCount AS nodes,
      relationshipCount AS rels
  """
)
display(result)

Unnamed: 0,graph,readProjection,nodes,rels
0,mygraph,"{'PARTOF': {'orientation': 'UNDIRECTED', 'aggr...",458170,1787688


Note, if you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:

In [16]:
result = gds.run_cypher(
  """
    CALL gds.graph.drop('mygraph')
  """
)
display(result)

Unnamed: 0,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema
0,mygraph,neo4j,,-1,458170,1787688,{'relationshipProjection': {'PARTOF': {'orient...,9e-06,2022-06-04T01:18:04.672391000+00:00,2022-06-04T01:19:29.366443000+00:00,"{'relationships': {'PARTOF': {}, 'OWNS': {}}, ..."


Now, let's list the details of the graph to make sure the projection was created as we want.

In [32]:
result = gds.run_cypher(
  """
    CALL gds.graph.list()
  """
)
display(result)

Unnamed: 0,degreeDistribution,graphName,database,memoryUsage,sizeInBytes,nodeCount,relationshipCount,configuration,density,creationTime,modificationTime,schema
0,"{'p99': 18, 'min': 1, 'max': 6864, 'mean': 3.9...",mygraph,neo4j,65 MiB,69097107,458170,1787688,{'relationshipProjection': {'PARTOF': {'orient...,9e-06,2022-06-04T01:21:15.265551000+00:00,2022-06-04T01:21:22.627002000+00:00,"{'relationships': {'PARTOF': {}, 'OWNS': {}}, ..."


Now we can generate an embedding from that graph. This is a new feature we can use in our predictions. We're using FastRP, which is a more full featured and higher performance of Node2Vec. You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

In [19]:
result = gds.run_cypher(
  """
  CALL gds.fastRP.mutate('mygraph',{
    embeddingDimension: 16,
    randomSeed: 1,
    mutateProperty:'embedding'
  })
  """
)
display(result)

Unnamed: 0,nodePropertiesWritten,mutateMillis,nodeCount,preProcessingMillis,computeMillis,configuration
0,458170,0,458170,0,221,"{'nodeSelfInfluence': 0, 'relationshipWeightPr..."


That creates an embedding for each node type.  However, we only want the embedding on the nodes of type holding.

We're going to take the embedding from our projection and write it to the holding nodes in the underlying database.

In [20]:
result = gds.run_cypher(
  """
    CALL gds.graph.writeNodeProperties('mygraph', ['embedding'], ['Holding'])
    YIELD writeMillis
  """
)
display(result)

Unnamed: 0,writeMillis
0,2377


In [5]:
result = gds.run_cypher(
  """
    MATCH (n:Holding) RETURN n
  """
)
display(result)

Unnamed: 0,n
0,"(shares, cusip, reportCalendarOrQuarter, filin..."
1,"(shares, cusip, reportCalendarOrQuarter, filin..."
2,"(shares, cusip, reportCalendarOrQuarter, filin..."
3,"(shares, cusip, reportCalendarOrQuarter, filin..."
4,"(shares, cusip, reportCalendarOrQuarter, filin..."
...,...
446917,"(shares, cusip, reportCalendarOrQuarter, filin..."
446918,"(shares, cusip, reportCalendarOrQuarter, filin..."
446919,"(shares, cusip, reportCalendarOrQuarter, filin..."
446920,"(shares, cusip, reportCalendarOrQuarter, filin..."


Note that this query will take 2-3 minutes to run as it's grabbing nearly half a million nodes along with all their properties and our new embedding.

In [24]:
import pandas as pd
df = pd.DataFrame([dict(record.items()) for record in result['n']])
df

Unnamed: 0,shares,cusip,reportCalendarOrQuarter,filingManager,embedding,value,nameOfIssuer,target
0,270,88579Y101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[-0.06891358643770218, 0.1725142002105713, -0....",52024000,3M Co,False
1,195,00508Y102,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[-0.0012208839179947972, 0.005601316690444946,...",32175000,Acuity Brands Inc,False
2,4939,00724F101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.11818736791610718, 0.26252642273902893, -0....",2347852000,Adobe Systems Inc,False
3,1557,02079K305,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.14201897382736206, 0.2642807066440582, 0.14...",3211344000,Alphabet Inc A,False
4,837,02079K107,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,"[0.30085813999176025, 0.2475065439939499, 0.44...",1731443000,Alphabet Inc C,False
...,...,...,...,...,...,...,...,...
446917,56874,911312106,09-30-2021,LEE DANNER & BASS INC,"[-0.10755999386310577, 0.09883543848991394, 0....",10357000,United Parcel Svc. Cl B,False
446918,231000,92552v100,09-30-2021,LEE DANNER & BASS INC,"[-0.19772684574127197, 0.19481556117534637, 0....",12721000,ViaSat Inc,True
446919,55104,92826C839,09-30-2021,LEE DANNER & BASS INC,"[0.19159139692783356, 0.5284039974212646, 0.15...",12274000,Visa Inc,True
446920,79459,931142103,09-30-2021,LEE DANNER & BASS INC,"[0.16196903586387634, 0.5445767045021057, 0.60...",11075000,Wal-Mart Stores Inc.,False


Note that the embedding row is an array. To make this dataset more consumable, we should flatten that out into multiple individual features: embedding_0, embedding_1, ... embedding_n.


In [25]:
embeddings = pd.DataFrame(df['embedding'].values.tolist()).add_prefix("embedding_")
merged = df.drop(columns=['embedding']).merge(embeddings, left_index=True, right_index=True)
merged

Unnamed: 0,shares,cusip,reportCalendarOrQuarter,filingManager,value,nameOfIssuer,target,embedding_0,embedding_1,embedding_2,...,embedding_6,embedding_7,embedding_8,embedding_9,embedding_10,embedding_11,embedding_12,embedding_13,embedding_14,embedding_15
0,270,88579Y101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,52024000,3M Co,False,-0.068914,0.172514,-0.270718,...,0.142907,0.707541,0.255056,0.168236,0.014960,-0.183164,0.214030,-0.149799,0.436635,-0.393186
1,195,00508Y102,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,32175000,Acuity Brands Inc,False,-0.001221,0.005601,-0.008561,...,-0.249999,-0.027272,0.210765,0.232817,0.275897,-0.276700,0.338566,-0.479243,0.789944,-0.336510
2,4939,00724F101,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,2347852000,Adobe Systems Inc,False,0.118187,0.262526,-0.081295,...,-0.591563,0.398565,0.239687,-0.259427,-0.352596,0.025593,0.626399,0.247615,0.445566,0.094596
3,1557,02079K305,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,3211344000,Alphabet Inc A,False,0.142019,0.264281,0.142079,...,-0.150988,1.035650,0.370033,-0.064532,-0.239375,-0.003771,0.124420,0.027277,0.077211,-0.477533
4,837,02079K107,03-31-2021,LEDERER & ASSOCIATES INVESTMENT COUNSEL/CA,1731443000,Alphabet Inc C,False,0.300858,0.247507,0.449530,...,-0.415947,0.365203,-0.428251,-0.000808,0.304490,0.002407,-0.100384,-0.317436,0.368293,-0.037012
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
446917,56874,911312106,09-30-2021,LEE DANNER & BASS INC,10357000,United Parcel Svc. Cl B,False,-0.107560,0.098835,0.136811,...,-0.246284,0.304195,0.127205,-0.927462,-0.280658,0.211018,-0.149063,-0.060732,-0.073806,-0.089820
446918,231000,92552v100,09-30-2021,LEE DANNER & BASS INC,12721000,ViaSat Inc,True,-0.197727,0.194816,0.100359,...,-0.246779,0.370052,0.573178,0.061018,-0.038240,-0.423521,-0.037658,-0.372200,-0.453317,-0.374625
446919,55104,92826C839,09-30-2021,LEE DANNER & BASS INC,12274000,Visa Inc,True,0.191591,0.528404,0.158589,...,-0.766427,-0.110386,0.576788,-0.022191,-0.768111,-0.192285,-0.119118,0.234878,0.124492,-0.344153
446920,79459,931142103,09-30-2021,LEE DANNER & BASS INC,11075000,Wal-Mart Stores Inc.,False,0.161969,0.544577,0.602678,...,-0.070697,-0.004854,0.217884,-0.519946,-0.045873,-0.047081,-0.103562,0.011971,-0.375597,-0.362491


Now that we have the data formatted properly, let's split it into a training and a testing set and write those to disk.

In [26]:
df = merged

df['split']=df['reportCalendarOrQuarter']
df['split']=df['split'].replace(['03-31-2021', '06-30-2021', '09-30-2021'], ['TRAIN', 'VALIDATE', 'TEST'])

df = df.drop(columns=['reportCalendarOrQuarter'])

df.to_csv('embedding.csv', index=False)

# Authenticate your Google Cloud Account
Now let's write the file to Google Cloud Storage so we can use it in our model.  To do so, we must first authenticate.

In [27]:
# Edit these variables!
PROJECT_ID = 'neo4jbusinessdev' #'YOUR-PROJECT-ID'
STORAGE_BUCKET = 'form13foo123jfgroijgreoij' #'NAME-OF-BUCKET-TO-CREATE'

# You can leave this default
REGION = 'us-central1'

In [28]:
import os
os.environ['GCLOUD_PROJECT'] = PROJECT_ID

In [29]:
try:
    from google.colab import auth as google_auth
    google_auth.authenticate_user()
except:
    pass

# Upload to Google Cloud Storage
Now we can upload our data sets to our bucket.

In [30]:
from google.cloud import storage
client = storage.Client()

In [31]:
bucket = client.bucket(STORAGE_BUCKET)
bucket.location=REGION
client.create_bucket(bucket)

  


<Bucket: form13foo123jfgroijgreoij>

In [32]:
filename='embedding.csv'
upload_path = os.path.join('form13', filename)
blob = bucket.blob(upload_path)
blob.upload_from_filename(filename)