# Load 10K Filings

In this notebook we add 10K filings with embeddings to our graph

## Setup
First, check to ensure you're using the `neo4j_genai` kernel with the following command. This kernel has the necessary runtime and dependencies for this notebook. If you see a different kernel, try changing the kernel to `neo4j_genai` in the upper right corner of the screen.

In [None]:
import sys
import os
os.path.basename(sys.executable.replace("/bin/python",""))

Next we install and import some libraries 

Now import needed packages

In [None]:
import json
import numpy as np
import os
import re
from string import Template

# Neo4j
from graphdatascience import GraphDataScience

## Loading 10-K Documents with Embeddings

In [None]:
import pandas as pd
edf = pd.read_csv(embedding_entries)

In [None]:
load_dotenv('credentials.env', override=True)
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'
# You will need to change these to match
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')

In [None]:
gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=True
)
gds.set_database('neo4j')

In [None]:
gds.run_cypher('CREATE INDEX company_name IF NOT EXISTS FOR (n:Company) ON (n.companyName)')
gds.run_cypher('CREATE CONSTRAINT unique_document_id IF NOT EXISTS FOR (n:Document) REQUIRE (n.documentId) IS NODE KEY')

In [None]:
edf = pd.read_csv('form10k-docs.csv')
edf

In [None]:
%%time

embedding_entries = edf.to_dict(orient='records')
total = len(embedding_entries)
count = 0
for d in chunks(embedding_entries, 100):
    gds.run_cypher('''
    UNWIND $records AS record
    MATCH(c:Company {companyName:record.companyName})
    MERGE(b:Document {documentId:record.contextId})
    ON CREATE SET b.documentType='FORM_10K_ITEM1', b.seqId = record.seqId, b.textEmbedding = record.textEmbedding, b.text = record.text
    MERGE(c)-[:HAS]->(b)
    RETURN count(b) as cnt
    ''', params = {'records':d})
    count += len(d)
    print(f'loaded {count} of {total}')

## Check Data

In [None]:
# Check node count
gds.run_cypher('MATCH(doc:Document) RETURN count(doc)')

Note that we were only pulling 10K docs for a minorty of companies. It should be fine for this setting, but in a more rigerious setting you may want to try and pull more.  There are likely two factors attributing to this.  1) We used company names to search EDGAR which resulted in many misses and dups which were discarded. In a more rigerious setting we would investigate other endpoints and use more parsing to extract EDGAR kick keys to exact match companies when pulling forms. 2) Not all companies in the dataset are obligated to file 10Ks. 

In [None]:
# Check count and percentage of companies with 10K docs.  Note it is the minority
gds.run_cypher('''
MATCH(b:Company)
WITH b, count{(b)-[:HAS]->(d:Document)} AS docCount
WITH count(b) AS total, sum(toInteger(docCount > 0)) AS numWithDocs
RETURN total, numWithDocs, round(100*toFloat(numWithDocs)/toFloat(total), 2) As PercWithDocs
''')

Note that there are duplicate names.  For the purposes of these demos, we will treat it as entity resolution (meaning that we treat companies with the same name as belonging to the same entity). In a more rigerious setting we would need to disambiguate with the CUSIP or other EDGAR keys. 

In [None]:
# Show duplicates via HAS relationship
gds.run_cypher('''
MATCH(b:Company)
RETURN count(b) AS totalCompanies, count(DISTINCT b.companyName) AS uniqueCompanyNames
''')