# Form13
This notebook described how to use Neo4j and SageMaker together.  It focuses on a dataset from the SEC containing Form13 information.

Click this button to open the notebook in SageMaker Studio Lab. [![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/neo4j-partners/neo4j-sagemaker/blob/main/form13.ipynb)

## Install Prerequisites
First off, you'll also need to install a few packages.

In [2]:
%pip install graphdatascience

Note: you may need to restart the kernel to use updated packages.


## Deploy Neo4j
You're going to need a Neo4j deployment to run this lab.  The easiest way to get that is via the [AWS Marketplace](https://aws.amazon.com/marketplace/seller-profile?id=23ec694a-d2af-4641-b4d3-b7201ab2f5f9).  Select "Neo4j Enterprise Edition" and deploy that.  Suggested parameters are:

* Graph Database Version - 4.4.6
* Install Graph Data Sceince - True
* Graph Data Science License Key - None
* Install Bloom - True
* Bloom License Key - None
* Password - Enter something here
* Node Count - 1
* Instance Type - t3.medium
* Key Name - Pick a key you have
* SSH CIDR - 0.0.0.0/0

The Marketplace listing deploys an ASG.  When deployment is complete, you can get the IP address of your Neo4j node from that ASG.  You can view deployed ASGs [here](https://us-east-1.console.aws.amazon.com/ec2autoscaling).

## Modify Neo4j Config
Neo4j's database protocol is called Bolt.  By default it runs on port 7687.  SageMaker Studio Lab blocks outbound access on all ports except for 80, 443 and 53.  To work around this, we're going to modify our Neo4j config.

Open the AWS Console.  Locate your instance and click "Connect" to open an SSH terminal window.  Run these commands:

In [None]:
sed -i s/#dbms.connector.bolt.listen_address=:7687/dbms.connector.bolt.listen_address=:53/g /etc/neo4j/neo4j.conf
sed -i s/#dbms.connector.bolt.advertised_address=:7687/dbms.connector.bolt.advertised_address=:53/g /etc/neo4j/neo4j.conf
sudo service neo4j restart

## Using the Neo4j API


In [1]:
%pip install neo4j

Note: you may need to restart the kernel to use updated packages.


In [3]:
# Edit these variables!
DB_URL = 'neo4j://ec2-54-152-83-136.compute-1.amazonaws.com:80'
DB_PASS = 'foo123'

# You can leave these defaults
DB_USER = "neo4j"
DB_NAME = "neo4j"

In [4]:
import pandas as pd
from neo4j import GraphDatabase

driver = GraphDatabase.driver(DB_URL, auth=(DB_USER, DB_PASS))

In [5]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        CREATE CONSTRAINT IF NOT EXISTS ON (p:Company) ASSERT (p.cusip) IS NODE KEY;
      """
    ).data()
  )
df = pd.DataFrame(result)
display(df)

Unable to retrieve routing information
Transaction failed and will be retried in 1.1374500022472662s (Unable to retrieve routing information)
Unable to retrieve routing information


ServiceUnavailable: Unable to retrieve routing information

## Working with Neo4j
You'll need to enter the credentials from your Neo4j instance below.

In [11]:
# Edit these variables!
DB_URL = 'neo4j://ec2-54-152-83-136.compute-1.amazonaws.com:7687'
DB_PASS = 'foo123'

# You can leave this defaults
DB_USER = 'neo4j'

In [12]:
from graphdatascience import GraphDataScience
gds = GraphDataScience(DB_URL, auth=(DB_USER, DB_PASS))

In [13]:
result = gds.run_cypher(
    """
        CREATE CONSTRAINT IF NOT EXISTS ON (p:Company) ASSERT (p.cusip) IS NODE KEY;
        CREATE CONSTRAINT IF NOT EXISTS ON (p:Manager) ASSERT (p.filingManager) IS NODE KEY;
        CREATE CONSTRAINT IF NOT EXISTS ON (p:Holding) ASSERT (p.filingManager, p.cusip, p.reportCalendarOrQuarter) IS NODE KEY;
    """
)
display(result)

Unable to retrieve routing information


ServiceUnavailable: Unable to retrieve routing information

In [None]:
result = gds.run_cypher(
    """
        LOAD CSV WITH HEADERS FROM 'https://storage.googleapis.com/neo4j-datasets/form13/2021.csv' AS row
        MERGE (c:Company {cusip:row.cusip})
        ON CREATE SET
            c.nameOfIssuer=row.nameOfIssuer
    """
)
display(result)

First we're going to create an in memory graph represtation of the data in Neo4j Graph Data Science (GDS).

In [None]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
CALL gds.graph.create(
    'mygraph',
    ['Company', 'Manager', 'Holding'],
    {
        OWNS: {orientation: 'UNDIRECTED'},
        PARTOF: {orientation: 'UNDIRECTED'}
    }
)
YIELD
    graphName AS graph,
    relationshipProjection AS readProjection,
    nodeCount AS nodes,
    relationshipCount AS rels
      """
    ).data()
  )
df = pd.DataFrame(result)
display(df)

Note, if you get an error saying the graph already exists, that's probably because you ran this code before. You can destroy it using this command:

In [None]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
      CALL gds.graph.drop('mygraph')
      """
    ).data()
  )

Now, let's list the details of the graph to make sure the projection was created as we want.

In [None]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
      CALL gds.graph.list()
      """
    ).data()
  )
  print(result)

Now we can generate an embedding from that graph. This is a new feature we can use in our predictions. We're using FastRP, which is a more full featured and higher performance of Node2Vec. You can learn more about that [here](https://neo4j.com/docs/graph-data-science/current/algorithms/fastrp/).

In [None]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        CALL gds.fastRP.mutate('mygraph',{
        embeddingDimension: 16,
        randomSeed: 1,
        mutateProperty:'embedding'
        })
      """
    ).data()
  )
df = pd.DataFrame(result)
display(df)

That creates an embedding for each node type.  However, we only want the embedding on the nodes of type holding.

We're going to take the embedding from our projection and write it to the holding nodes in the underlying database.

In [None]:
with driver.session(database=DB_NAME) as session:
  result = session.run(
    """
      CALL gds.graph.writeNodeProperties('mygraph', ['embedding'], ['Holding'])
      YIELD writeMillis
    """
  )
  print(result)

In [None]:
with driver.session(database=DB_NAME) as session:
  result = session.read_transaction(
    lambda tx: tx.run(
      """
        MATCH (n:Holding) RETURN n
      """
    ).data()
  )

Note that this query will take 2-3 minutes to run as it's grabbing nearly half a million nodes along with all their properties and our new embedding.

In [None]:
df = pd.DataFrame([dict(record.get('n')) for record in result])
df

Note that the embedding row is an array. To make this dataset more consumable, we should flatten that out into multiple individual features: embedding_0, embedding_1, ... embedding_n.


In [None]:
embeddings = pd.DataFrame(df['embedding'].values.tolist()).add_prefix("embedding_")
merged = df.drop(columns=['embedding']).merge(embeddings, left_index=True, right_index=True)
merged

Now that we have the data formatted properly, let's split it into a training and a testing set and write those to disk.

In [None]:
df = merged

df['split']=df['reportCalendarOrQuarter']
df['split']=df['split'].replace(['03-31-2021', '06-30-2021', '09-30-2021'], ['TRAIN', 'VALIDATE', 'TEST'])

df = df.drop(columns=['reportCalendarOrQuarter'])

df.to_csv('embedding.csv', index=False)

# Upload to Amazom S3
Now let's create a bucket and upload our data set to it.  Then we'll be able to access the data in SageMaker in the next lab.

In [None]:
filename='embedding.csv'

#to do