# Load 10K Filings

In this notebook, we add 10K filings with embeddings to our graph.

## Setup
First, check to ensure you're using the `neo4j_genai` kernel with the following command. This kernel has the necessary runtime and dependencies for this notebook. If you see a different kernel, try changing the kernel to `neo4j_genai` in the upper right corner of the screen.

In [None]:
import sys
import os
os.path.basename(sys.executable.replace("/bin/python",""))

Now import needed packages

In [None]:
import json
import numpy as np
import os
import re
from string import Template
import pandas as pd

# Neo4j
from graphdatascience import GraphDataScience

# Google Cloud
from google.cloud import storage

## Get 10K Documents with Embeddings
You can skip this step if you ran the part 0 notebook to generate the embeddings.  This downloads pre-run documents and embeddings from 10K Item 1.

In [None]:
# Skip this if you ran part 0

storage_client = storage.Client()
(storage_client
 .bucket('neo4j-datasets')
 .blob('form10k/form10k-doc-embeddings.csv')
 .download_to_filename('form10k-doc-embeddings.csv'))

## Loading 10K Documents with Embeddings into Neo4j

In [None]:
emb_df = pd.read_csv('form10k-doc-embeddings.csv')

In [None]:
# Make sure to transform textEmbeddings to a list instead of String.  json.loads should do the trick
emb_df['textEmbedding'] = emb_df['textEmbedding'].apply(json.loads)

In [None]:
emb_df

Provide your Neo4j credentials.  We need the DB conection URL, the username (probably `neo4j`), and your password.

In [None]:
#database credentials
NEO4J_URI= "<neo4j+s://xxxxx.databases.neo4j.io>"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD= "<password>"

In [None]:
gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=True
)
gds.set_database('neo4j')

Remember to create indexes. We will be merging 10K documents by `companyName`. In a production setting, we would want to use a better identifier here (like we did with cusip for Company) However, this should suffice for our intents and purposes as we are just getting acquainted to learning about semantic search.

In [None]:
gds.run_cypher('CREATE INDEX company_name IF NOT EXISTS FOR (n:Company) ON (n.companyName)')
gds.run_cypher('CREATE CONSTRAINT unique_document_id IF NOT EXISTS FOR (n:Document) REQUIRE (n.documentId) IS NODE KEY')

Due to the size of the documents we will want to transform the dataframe into a list of dict that we can chunk up and insert via parameterized query.

In [None]:
emb_entries = emb_df.to_dict(orient='records')

In [None]:
def chunks(xs, n=5):
    n = max(1, n)
    return [xs[i:i + n] for i in range(0, len(xs), n)]

In [None]:
%%time

total = len(emb_entries)
count = 0
for d in chunks(emb_entries, 100):
    gds.run_cypher('''
    UNWIND $records AS record
    MATCH(c:Company {companyName:record.companyName})
    MERGE(b:Document {documentId:record.contextId})
    SET b.documentType='FORM_10K_ITEM1', b.seqId = record.seqId, b.textEmbedding = record.textEmbedding, b.text = record.text
    MERGE(c)-[:HAS]->(b)
    RETURN count(b) as cnt
    ''', params = {'records':d})
    count += len(d)
    print(f'loaded {count} of {total}')

## Check Data

In [None]:
# Check node count
gds.run_cypher('MATCH(doc:Document) RETURN count(doc)')

Note that we were only getting 10K docs for a minority of companies. It should be fine for this, but in a more rigorous setting, you may want to try and pull more.  There are likely a few factors attributing to this. 

1. We used company names to search EDGAR which resulted in many misses and dups which were discarded. In a more rigorous setting, we would investigate other endpoints and use more parsing to extract EDGAR cik keys for exact matching companies when pulling forms.

2. Company names are not consistent across form13 filings, so even if we successfully pull on one version of a company name, we may not be able to merge it into the graph via the one company name represented there. 

3. Not all companies in the dataset are obligated to file 10Ks. 

In [None]:
# Check count and percentage of companies with 10K docs.  Note it is the minority
gds.run_cypher('''
MATCH(b:Company)
WITH b, count{(b)-[:HAS]->(d:Document)} AS docCount
WITH count(b) AS total, sum(toInteger(docCount > 0)) AS numWithDocs
RETURN total, numWithDocs, round(100*toFloat(numWithDocs)/toFloat(total), 2) As PercWithDocs
''')

Note that there are duplicate names.  For our purposes here, we will treat it as entity resolution, meaning that we treat companies with the same name as belonging to the same overarching entity for semantic search. In a more rigorous setting, we would need to disambiguate with the CUSIP or other EDGAR keys.

In [None]:
# Show duplicates via HAS relationship
gds.run_cypher('''
MATCH(b:Company)
RETURN count(b) AS totalCompanies, count(DISTINCT b.companyName) AS uniqueCompanyNames
''')