# Connect Form 10k Chunk Nodes

The 10k Chunk Nodes are individual chunks of text from the 10k documents. 

In Cypher, each chunk looks like this:
```cypher
(:Chunk 
  chunkId: string
  chunkSeqId: int
  cik: int
  cusip6: string
  f10kItem: string
  source: string
  text: string
  textEmbedding: [floats]
)
```

We'll now create a complete source context around the chunks:

1. connect `(:Chunk)-[:NEXT]->(:Chunk)` for each chunk with same `source` and `f10kItem`
  - the `source` is unique to each 10k document
  - the `f10kItem` is the name of the form section (e.g. "Item 1A. Risk Factors")
2. create a `(:Form)` for each `source`
3. connect each `(:Chunk)-[:PART_OF]->(:Form)`
4. connect each `(:Form)-[:SECTION]->(first:Chunk)` for the first chunk of each section

## Imports

In [32]:
from dotenv import load_dotenv
import os

# Common data processing
import json
from pandas import DataFrame
import pandas as pd
from typing import List, Tuple, Union
from numpy.typing import ArrayLike
from progress.bar import Bar

# Langchain
from langchain.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_community.chat_models import ChatOpenAI

## Set up Neo4j

In [33]:
# Load from environment
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE') or 'neo4j'

# Global constants
VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'


In [34]:
# Create a knowledge graph using Langchain's Neo4j integration.
# This will be used for direct querying of the knowledge graph. 
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

# OpenAI for creating embeddings
embeddings_model = OpenAIEmbeddings()

# Create a vector store for the Neo4j knowledge graph
# This will be used for vector similarity queries.
vector_store = Neo4jVector.from_existing_index(
    OpenAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
)
vector_store = Neo4jVector.from_existing_graph(
    embedding=embeddings_model,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)
retriever = vector_store.as_retriever()
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), chain_type="stuff", retriever=retriever
)


## Cypher Queries to tranform the Knowledge Graph

In [35]:
# Find all distinct sources for the text chunks in the knowledge graph...

# MATCH a single node pattern 
# RETURN just the `source` property from each node
cypher = """
  MATCH (all:Chunk) 
  RETURN DISTINCT all.source as source LIMIT 10
"""

kg.query(cypher)

[{'source': 'https://www.sec.gov/Archives/edgar/data/1528396/000152839623000089/0001528396-23-000089-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/6955/000000695523000034/0000006955-23-000034-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/22444/000002244423000126/0000022444-23-000126-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/804328/000080432823000055/0000804328-23-000055-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/1005731/000149315223037384/0001493152-23-037384-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/1041803/000104180323000054/0001041803-23-000054-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/1616533/000162828023034807/0001628280-23-034807-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/1084869/000143774923025967/0001437749-23-025967-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/1477246/000095017023050149/0000950170-23-

In [36]:
# Gather all the chunks from a single source section...

# MATCH the same single node pattern 
# RETURN only the first source
# WITH that source, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# RETURN those chunks, ordered by their `chunkSeqId` property
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source LIMIT 1
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section { .source, .f10kItem, .chunkSeqId} AS section_chunks
      ORDER BY same_section.chunkSeqId ASC
  }
  RETURN section_chunks
"""

kg.query(cypher)

[{'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1528396/000152839623000089/0001528396-23-000089-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 0}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1528396/000152839623000089/0001528396-23-000089-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 1}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1528396/000152839623000089/0001528396-23-000089-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 2}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1528396/000152839623000089/0001528396-23-000089-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 3}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1528396/000152839623000089/0001528396-23-000089-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 4}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1528396/000152839623000089/0001528396-

In [37]:
# Recreate the section text from the chunks...

# MATCH the same single node pattern 
# RETURN only the first source
# WITH that source, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# RETURN text from those chunks, concatenated into a single string using `reduce()`
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source LIMIT 1
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section { .source, .f10kItem, .chunkSeqId, .text} AS section_chunks
      ORDER BY same_section.chunkSeqId ASC
  }
  RETURN reduce(section = "", chunk in collect(section_chunks) | section + chunk.text) as section
"""

kg.query(cypher)

[{'section': ">Item 1.\nBusiness\nOverview and Purpose\nGuidewire delivers a leading platform that property and casualty (“P&C”) insurers trust to engage, innovate, and grow efficiently. Guidewire’s platform combines core operations, digital engagement, analytics, machine learning, and artificial intelligence (“AI”) applications delivered as a cloud service or self-managed software. We began our principal operations in 2001.\nOur core operational services and products are InsuranceSuite Cloud, InsuranceNow, and InsuranceSuite for self-managed installations. These services and products are transactional systems of record that support the entire insurance lifecycle, including insurance product definition, distribution, underwriting, policyholder services, and claims management. Our digital engagement applications enable digital sales, omni-channel service, and enhanced claims experiences for policyholders, agents, vendor partners, and field personnel. Our analytics offerings enable insur

In [38]:
# Connect section chunks into a linked list..

# MATCH the same single node pattern 
# RETURN only the first source
# WITH that source, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# WITH those chunks collected together
# CALL apoc.nodes.link() to create a linked list
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source LIMIT 1
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section AS section_chunks
      ORDER BY same_section.chunkSeqId ASC
  }
  WITH collect(section_chunks) as section
  CALL apoc.nodes.link(section, "NEXT", {avoidDuplicates: true}) 
"""

kg.query(cypher)


[]

In [39]:
# Connect the linked list of chunks to a parent `Form` node..

# MATCH the same single node pattern 
# RETURN only the first source
# WITH that source, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# WITH those chunks collected together
# CALL apoc.nodes.link() to create a linked list
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source LIMIT 1
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section AS section_chunks
      ORDER BY same_section.chunkSeqId ASC
  }
  WITH source, collect(section_chunks) as section
  MERGE (parent:Form {source: source })
    ON CREATE 
      SET parent.formId = section[0].formId
      SET parent.cik = section[0].cik
      SET parent.cusip6 = section[0].cusip6
"""

kg.query(cypher)

[]

In [42]:
# Connect all section chunks to their parent `Form` node...

# MATCH a double node pattern, for the `Chunk` and `Form` nodes
# WHERE the `Chunk` and `Form` nodes have the same `formId` property
# (this is exactly like a JOIN in SQL)
# connect the pairs with a (:Chunk)-[:PART_OF]->(:Form) relationship
cypher = """
  MATCH (section:Chunk), (parent:Form)
  WHERE section.formId = parent.formId
  MERGE (section)-[:PART_OF]->(parent)
"""

kg.query(cypher)

[]

In [43]:
# Connect all parent `Form` nodes to the "head" of each section linked list...

# MATCH a double node pattern, for the `Chunk` and `Form` nodes
# WHERE the `Chunk` and `Form` nodes have the same `formId` property
# (this is exactly like a JOIN in SQL)
# connect the pairs with a (:Chunk)-[:PART_OF]->(:Form) relationship
cypher = """
  MATCH (section:Chunk), (parent:Form)
  WHERE section.formId = parent.formId
    AND section.chunkSeqId = 0
  MERGE (parent)-[:SECTION {f10kItem:section.f10kItem}]->(section)
"""

kg.query(cypher)

[]

## Neo4j Utility Functions

In [133]:
def neo4j_vector_search(kg: Neo4jGraph, embeddings_model: OpenAIEmbeddings,
                        index_name: str, query: str, top_k: int = 10) -> List:
  """Search for similar nodes using the Neo4j vector index"""
  embedded_query = embeddings_model.embed_query(query)
  vector_search = f"""
    CALL db.index.vector.queryNodes($index_name, $top_k, $embedding) yield node, score
    RETURN node.text AS result
  """
  similar = kg.query(vector_search, params={'embedding': embedded_query, 'index_name':index_name, 'top_k': top_k})
  return similar

## Prepare Neo4j indexes

In [40]:
# Create a uniqueness constraint on the textId property of Text nodes 
kg.query('CREATE CONSTRAINT unique_form IF NOT EXISTS FOR (n:Form) REQUIRE n.formId IS UNIQUE')


[]

## Load Form 10k documents

1. iterate through all the files in the directory
2. batch load sets of the files
3. for each file, load the content and split the text into chunks
4. for each chunk, create a graph Node that includes metadata and the chunk text

In [114]:
%%time

all_file_names = ['../source-data-pull/form10k/data/form10k-clean/' + x for x in os.listdir('../source-data-pull/form10k/data/form10k-clean/')]
counter = 0
for file_names in batches(all_file_names, 20):
    counter += len(file_names)
    print(f'=== Processing {counter-len(file_names)}:{counter} of {len(all_file_names)} ===')
    # get and split text data
    print('Loading and splitting Text Files...')
    doc_df = get_and_split_txt_data(file_names)
    # perform text embedding
    print('Performing Text Embedding...')
    add_text_embeddings(doc_df)
    #load nodes
    print('Loading Nodes...')
    load_nodes(graph, doc_df.drop(columns='textEmbedding'), 'chunkId', 'Chunk')
    print(f'Done Processing {counter-len(file_names)}:{counter}')

    # Merge text embeddings using set vector property
    records = doc_df[['chunkId', 'textEmbedding']].to_dict('records')
    print(f'======  loading Document text embeddings ======')
    total = len(records)
    print(f'staging {total:,} records')
    cumulative_count = 0
    for recs in batches(records, n=100):
        res = kg.query('''
        UNWIND $recs AS rec
        MATCH(n:Chunk {chunkId: rec.chunkId})
        CALL db.create.setNodeVectorProperty(n, "textEmbedding", rec.textEmbedding)
        RETURN count(n) AS propertySetCount
        ''', params={'recs': recs})
        cumulative_count += res[0].get('propertySetCount')
        print(f'Set {cumulative_count:,} of {total:,} text embeddings')

=== Processing 0:20 of 79 ===
Loading and splitting Text Files...
Performing Text Embedding...
Loading Nodes...
staging 2,485 records

Using This Cypher Query:
```
UNWIND $recs AS rec
MERGE(n:Chunk {chunkId: rec.chunkId})
SET n.cik = rec.cik, n.cusip6 = rec.cusip6, n.source = rec.source, n.f10kItem = rec.f10kItem, n.chunkSeqId = rec.chunkSeqId, n.text = rec.text
RETURN count(n) AS nodeLoadedCount
```

Done Processing 0:20
staging 2,485 records
Set 100 of 2,485 text embeddings
Set 200 of 2,485 text embeddings
Set 300 of 2,485 text embeddings
Set 400 of 2,485 text embeddings
Set 500 of 2,485 text embeddings
Set 600 of 2,485 text embeddings
Set 700 of 2,485 text embeddings
Set 800 of 2,485 text embeddings
Set 900 of 2,485 text embeddings
Set 1,000 of 2,485 text embeddings
Set 1,100 of 2,485 text embeddings
Set 1,200 of 2,485 text embeddings
Set 1,300 of 2,485 text embeddings
Set 1,400 of 2,485 text embeddings
Set 1,500 of 2,485 text embeddings
Set 1,600 of 2,485 text embeddings
Set 1,700 

## Example queries

In [124]:
question = 'Who makes hydraulic and mechanical tools?'

In [139]:
# Vector search using our utility function
neo4j_vector_search(kg, embeddings_model, VECTOR_INDEX_NAME, question, top_k=3)

[{'result': "The Company's critical raw material is steel. Out of Brazil, the Company sources three basic types of steel which are carbon steel, high speed steel and carbide cylinders. The Company has a number of long-term suppliers in Europe, Asia and Brazil and its sourcing mix is distributed according to the pricing including exchange rates. The U.S. sources steel, and small amounts of aluminum and brass through distributors. None of these suppliers accounts for more than 5% of the Company's purchases\nFor over 140 years, the Company has been a recognized leader in providing measurement and cutting solutions to industry. Measurement tools consist of precision instruments such as micrometers, vernier calipers, height distributors, depth gages, electronic gages, dial indicators, steel rules, combination squares, custom, non-contact gaging such as vision, optical and laser measurement systems. The Company believes advanced, non-contact systems with easy-to use software will be attracti

In [140]:
# Vector search using the langchain vector store
docs_with_score = vector_store.similarity_search_with_score(question, k=3)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.8979572653770447

text: The Company's critical raw material is steel. Out of Brazil, the Company sources three basic types of steel which are carbon steel, high speed steel and carbide cylinders. The Company has a number of long-term suppliers in Europe, Asia and Brazil and its sourcing mix is distributed according to the pricing including exchange rates. The U.S. sources steel, and small amounts of aluminum and brass through distributors. None of these suppliers accounts for more than 5% of the Company's purchases
For over 140 years, the Company has been a recognized leader in providing measurement and cutting solutions to industry. Measurement tools consist of precision instruments such as micrometers, vernier calipers, height distributors, depth gages, electronic gages, dial indicators, steel rules, combination squares, custom, non-contact gaging such as vision, optical and laser measurement s

In [141]:
# Vector search using the langchain retriever over the Neo4j vector store
retriever.get_relevant_documents(question)[0]

Document(page_content="\ntext: The Company's critical raw material is steel. Out of Brazil, the Company sources three basic types of steel which are carbon steel, high speed steel and carbide cylinders. The Company has a number of long-term suppliers in Europe, Asia and Brazil and its sourcing mix is distributed according to the pricing including exchange rates. The U.S. sources steel, and small amounts of aluminum and brass through distributors. None of these suppliers accounts for more than 5% of the Company's purchases\nFor over 140 years, the Company has been a recognized leader in providing measurement and cutting solutions to industry. Measurement tools consist of precision instruments such as micrometers, vernier calipers, height distributors, depth gages, electronic gages, dial indicators, steel rules, combination squares, custom, non-contact gaging such as vision, optical and laser measurement systems. The Company believes advanced, non-contact systems with easy-to use softwar

In [142]:
chain(
    {"question": question},
    return_only_outputs=True,
)

{'answer': 'Enerpac Tool Group Corp. is the company that makes hydraulic and mechanical tools.\n',
 'sources': 'https://www.sec.gov/Archives/edgar/data/6955/000000695523000034/0000006955-23-000034-index.htm'}