# Create Embedding
In this notebook, we'll connect to a Neo4j instance.  We'll load data and compute an embedding.  The notebook exports that data to pandas and then CSV files.

## Using the Neo4j API
Let's connect to our Neo4j deployment.  First off, install the Neo4j Graph Data Science package.

In [None]:
%pip install graphdatascience

Now, you're going to need the connection string and credentials from the deployment you created above.

In [None]:
from graphdatascience import GraphDataScience
# WARNING: Update these values with the Neo4j server endpoint and password you specified in the previous lab!
DB_URL = 'neo4j://server-endpoint:7687'
DB_PASS = 'password'

# You can leave this default
DB_USER = 'neo4j'
gds = GraphDataScience(DB_URL, auth=(DB_USER, DB_PASS))
print ("Look Mom, I was able to connect to Neo4j from Sagemaker.")


## Graph Data Science
Now we're going to use Neo4j Graph Data Science to create an in memory graph represtation of the data.  We'll enhance that represation with features we engineer using a graph embedding.

In [None]:
result = gds.run_cypher(
  """
    CALL gds.graph.project(
      'mygraph',
      ['Company', 'Manager', 'Holding'],
      {
          OWNS: {orientation: 'UNDIRECTED'},
          PARTOF: {orientation: 'UNDIRECTED'}
      }
    )
    YIELD
      graphName AS graph,
      relationshipProjection AS readProjection,
      nodeCount AS nodes,
      relationshipCount AS rels
  """
)
display(result)

If you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:

In [None]:
# WARNING!!!: Execute this cell only if you get an error saying the graph already exists!
# If you execute this, remember to re-run the above cell to create the mygraph!
result = gds.run_cypher(
  """
    CALL gds.graph.drop('mygraph')
  """
)
display(result)

Now, let's list the details of the graph to make sure the projection was created as we want.

In [None]:
result = gds.run_cypher(
  """
    CALL gds.graph.list()
  """
)
display(result)

Now we can generate an embedding from that graph. This is a new feature we can use in our predictions. We're using FastRP, which is a more full featured and higher performance of Node2Vec. You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

There are a bunch of parameters we could adjust in this.  One of the most obvious is the embeddingDimension.  The documentation covers many more.

In [None]:
result = gds.run_cypher(
  """
  CALL gds.fastRP.mutate('mygraph',{
    embeddingDimension: 16,
    randomSeed: 1,
    mutateProperty:'embedding'
  })
  """
)
display(result)

That creates an embedding for each node type.  However, we only want the embedding on the nodes of type holding.

We're going to take the embedding from our projection and write it to the holding nodes in the underlying database.

In [None]:
result = gds.run_cypher(
  """
    CALL gds.graph.writeNodeProperties('mygraph', ['embedding'], ['Holding'])
    YIELD writeMillis
  """
)
display(result)

In [None]:
result = gds.run_cypher(
  """
    MATCH (n:Holding) RETURN n
  """
)
display(result)

Note that this query will take 2-3 minutes to run as it's grabbing nearly half a million nodes along with all their properties and our new embedding.

## Pandas
Now we're going to reformat the query output.

In [None]:
import pandas as pd
df = pd.DataFrame([dict(record.items()) for record in result['n']])
df

Note that the embedding row is an array. To make this dataset more consumable, we should flatten that out into multiple individual features: embedding_0, embedding_1, ... embedding_n.


In [None]:
embeddings = pd.DataFrame(df['embedding'].values.tolist()).add_prefix("embedding_")
merged = df.drop(columns=['embedding']).merge(embeddings, left_index=True, right_index=True)
merged

Now that we have the data formatted properly, let's split it into training, testing and validation sets.  We'll write those to disk.

In [None]:
df = merged

train = df.loc[df['reportCalendarOrQuarter'] == '03-31-2021']
train.to_csv('train.csv', index=False)

test = df.loc[df['reportCalendarOrQuarter'] == '06-30-2021']
test = test.drop(['target'], axis=1)
test.to_csv('test.csv', index=False)

validate = df.loc[df['reportCalendarOrQuarter'] == '09-30-2021']
validate = validate.drop(['target'], axis=1)
validate.to_csv('validate.csv', index=False)