# Connect Form 10k Chunk Nodes

The 10k Chunk Nodes are individual chunks of text from the 10k documents. 

In Cypher, each chunk looks like this:
```cypher
(:Chunk 
  chunkId: string
  chunkSeqId: int
  cik: int
  cusip6: string
  f10kItem: string
  source: string
  text: string
  textEmbedding: [floats]
)
```

You will now create a complete source context around the chunks:

1. connect `(:Chunk)-[:NEXT]->(:Chunk)` for each chunk with same `source` and `f10kItem`
  - the `source` is unique to each 10k document
  - the `f10kItem` is the name of the form section (e.g. "Item 1A. Risk Factors")
2. create a `(:Form)` for each `source`
3. connect each `(:Chunk)-[:PART_OF]->(:Form)`
4. connect each `(:Form)-[:SECTION]->(first:Chunk)` for the first chunk of each section

## Imports

In [31]:
from dotenv import load_dotenv
import os

# Common data processing
import json
from pandas import DataFrame
import pandas as pd
from typing import List, Tuple, Union
from numpy.typing import ArrayLike
from progress.bar import Bar

# Langchain
from langchain.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_community.chat_models import ChatOpenAI

## Set up Neo4j

In [32]:
# Load from environment
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE') or 'neo4j'

# Global constants
VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'


## Neo4j Utility Functions

In [43]:
def neo4j_vector_search(kg: Neo4jGraph, embeddings_model: OpenAIEmbeddings,
                        index_name: str, query: str, top_k: int = 10) -> List:
  """Search for similar nodes using the Neo4j vector index"""
  embedded_query = embeddings_model.embed_query(query)
  vector_search = f"""
    CALL db.index.vector.queryNodes($index_name, $top_k, $embedding) yield node, score
    RETURN node.text AS result
  """
  similar = kg.query(vector_search, params={'embedding': embedded_query, 'index_name':index_name, 'top_k': top_k})
  return similar

## Prepare Neo4j indexes

In [61]:
# Create a knowledge graph using Langchain's Neo4j integration.
# This will be used for direct querying of the knowledge graph. 
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

In [62]:
# Create a uniqueness constraint on the textId property of Text nodes 
kg.query('CREATE CONSTRAINT unique_form IF NOT EXISTS FOR (n:Form) REQUIRE n.formId IS UNIQUE')


[]

## Cypher Queries to tranform the Knowledge Graph

In [35]:
# Find all distinct sources for the text chunks in the knowledge graph...

# MATCH a single node pattern 
# RETURN just the `source` property from each node
cypher = """
  MATCH (all:Chunk) 
  RETURN DISTINCT all.source as source LIMIT 10
"""

kg.query(cypher)

[{'source': 'https://www.sec.gov/Archives/edgar/data/1630113/000149315223022974/0001493152-23-022974-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/1696558/000121390023052460/0001213900-23-052460-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/1596993/000159699323000033/0001596993-23-000033-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/108385/000010838523000022/0000108385-23-000022-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/14693/000001469323000074/0000014693-23-000074-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/704562/000168316823004329/0001683168-23-004329-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/709283/000070928323000013/0000709283-23-000013-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/69733/000143774923016924/0001437749-23-016924-index.htm'},
 {'source': 'https://www.sec.gov/Archives/edgar/data/1000045/000095017023030037/0000950170-23-0

In [64]:
# Gather all the chunks from a single source section...

# MATCH the same single node pattern 
# RETURN only the first source
# WITH that source, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# RETURN those chunks, ordered by their `chunkSeqId` property
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source LIMIT 1
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section { .f10kItem, .chunkSeqId, .source} AS section_chunks
      ORDER BY same_section.chunkSeqId ASC
  }
  RETURN section_chunks
"""

kg.query(cypher)

[{'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1630113/000149315223022974/0001493152-23-022974-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 0}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1630113/000149315223022974/0001493152-23-022974-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 1}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1630113/000149315223022974/0001493152-23-022974-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 2}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1630113/000149315223022974/0001493152-23-022974-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 3}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1630113/000149315223022974/0001493152-23-022974-index.htm',
   'f10kItem': 'item1',
   'chunkSeqId': 4}},
 {'section_chunks': {'source': 'https://www.sec.gov/Archives/edgar/data/1630113/000149315223022974/0001493152-

In [68]:
# Recreate the section text from the chunks...

# MATCH the same single node pattern 
# RETURN only the first source
# WITH that source, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# RETURN text from those chunks, concatenated into a single string using `reduce()`
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source LIMIT 1
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section.chunkSeqId AS chunk_sequence, same_section.text AS sequenced_text
      ORDER BY chunk_sequence ASC
  }
  RETURN reduce(section = "", chunk_text in collect(sequenced_text) | section + chunk_text) as section
"""

kg.query(cypher)

[{'section': '>ITEM 1. BUSINESS\n\n\n\xa0\n\n\nSummary\n\n\n\xa0\n\n\nBiotricity Inc. (the\n“Company”, “Biotricity”, “we”, “us”, “our”) is a leading-edge medical\ntechnology company focused on biometric data monitoring and diagnostic solutions. We deliver innovative, remote monitoring solutions\nto the medical, healthcare, and consumer markets, with a focus on diagnostic and post-diagnostic solutions for lifestyle and chronic\nillnesses. We approach the diagnostic side of remote patient monitoring by applying innovation within existing business models where\nreimbursement is established. We believe this approach reduces the risk associated with traditional medical device development and\naccelerates the path to revenue. In post-diagnostic markets, we intend to apply medical grade biometrics to enable consumers to\nself-manage, thereby driving patient compliance and reducing healthcare costs. Our initial focus was on the diagnostic mobile\ncardiac outpatient monitoring (COM) market. Sin

In [8]:
# Connect section chunks into a linked list..

# MATCH the same single node pattern 
# RETURN only the first source
# WITH that source, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# WITH those chunks collected together
# CALL apoc.nodes.link() to create a linked list
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source LIMIT 1
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section AS section_chunks
      ORDER BY same_section.chunkSeqId ASC
  }
  WITH collect(section_chunks) as section
  CALL apoc.nodes.link(section, "NEXT", {avoidDuplicates: true}) 
"""

kg.query(cypher)


[]

In [39]:
# Connect all section chunks into a linked list..

# MATCH the same single node pattern 
# WITH all DISTINCT sources, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# WITH those chunks collected together
# CALL apoc.nodes.link() to create a linked list
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section AS section_chunks
      ORDER BY same_section.chunkSeqId ASC
  }
  WITH collect(section_chunks) as section
  CALL apoc.nodes.link(section, "NEXT", {avoidDuplicates: true}) 
"""

kg.query(cypher)


[]

In [40]:
# Connect the linked list of chunks to a parent `Form` node..

# MATCH the same single node pattern 
# RETURN only the first source
# WITH that source, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# WITH those chunks collected together
# CALL apoc.nodes.link() to create a linked list
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all.source as source
  CALL {
      WITH source
      MATCH (same_section:Chunk)
      WHERE same_section.source = source
        AND same_section.f10kItem = "item1"
      RETURN same_section AS section_chunks
      ORDER BY same_section.chunkSeqId ASC
  }
  WITH source, collect(section_chunks) as section
  MERGE (parent:Form {source: source })
    ON CREATE 
      SET parent.formId = section[0].formId
      SET parent.cik = section[0].cik
      SET parent.cusip6 = section[0].cusip6
"""

kg.query(cypher)

[]

In [41]:
# Connect all section chunks to their parent `Form` node...

# MATCH a double node pattern, for the `Chunk` and `Form` nodes
# WHERE the `Chunk` and `Form` nodes have the same `formId` property
# (this is exactly like a JOIN in SQL)
# connect the pairs with a (:Chunk)-[:PART_OF]->(:Form) relationship
cypher = """
  MATCH (section:Chunk), (parent:Form)
  WHERE section.formId = parent.formId
  MERGE (section)-[:PART_OF]->(parent)
"""

kg.query(cypher)

[]

In [42]:
# Connect all parent `Form` nodes to the "head" of each section linked list...

# MATCH a double node pattern, for the `Chunk` and `Form` nodes
# WHERE the `Chunk` and `Form` nodes have the same `formId` property
# (this is exactly like a JOIN in SQL)
# connect the pairs with a (:Chunk)-[:PART_OF]->(:Form) relationship
cypher = """
  MATCH (section:Chunk), (parent:Form)
  WHERE section.formId = parent.formId
    AND section.chunkSeqId = 0
  MERGE (parent)-[:SECTION {f10kItem:section.f10kItem}]->(section)
"""

kg.query(cypher)

[]

## Example cypher queries

In [69]:
# Retrieve a "window" of chunks...

# MATCH a 3 node pattern, anchored by a specified `Chunk` 
# and extending 1 node before and after that node.
# RETURN the text from each chunk
# by first extracting an array of `chunk.text` properties
# then concatenating into a single string using 
# the standard library functin `apoc.text.join()`
cypher = """
    MATCH chunkWindow=(:Chunk)-[:NEXT]->(:Chunk {chunkId: $chunkIdParam})-[:NEXT]->()
    RETURN apoc.text.join([ chunk in nodes(chunkWindow) | chunk.text ], " ") as windowText
    """

kg.query(cypher,
         params={'chunkIdParam': '0001493152-23-022974-item1-chunk0001'})

[{'windowText': '>ITEM 1. BUSINESS\n\n\n\xa0\n\n\nSummary\n\n\n\xa0\n\n\nBiotricity Inc. (the\n“Company”, “Biotricity”, “we”, “us”, “our”) is a leading-edge medical\ntechnology company focused on biometric data monitoring and diagnostic solutions. We deliver innovative, remote monitoring solutions\nto the medical, healthcare, and consumer markets, with a focus on diagnostic and post-diagnostic solutions for lifestyle and chronic\nillnesses. We approach the diagnostic side of remote patient monitoring by applying innovation within existing business models where\nreimbursement is established. We believe this approach reduces the risk associated with traditional medical device development and\naccelerates the path to revenue. In post-diagnostic markets, we intend to apply medical grade biometrics to enable consumers to\nself-manage, thereby driving patient compliance and reducing healthcare costs. Our initial focus was on the diagnostic mobile\ncardiac outpatient monitoring (COM) market. 

Notice that this query may fail to find a matching pattern if the before or
after node is missing. For example, try `chunkIdParam`:`0001493152-23-022974-item1-chunk0000`,
which is the first chunk in the section.

To handle that scenario, we'll use variable length pattern matching.

In [70]:
# Retrieve a "window" of chunks, even when there is no before or after...

# MATCH a 3 node pattern, anchored by a specified `Chunk` 
# and extending 0 _or_ 1 node before and after that node,
# using a variable length relationship pattern. This may
# find multiple paths, so 
# WITH the `chunkWindow` paths ordered by descending length, LIMIT to the first path
# to get only the `longestChunkWindow`
# RETURN the text from each chunk in the longestChunkWindow
# by first extracting an array of `chunk.text` properties
# then concatenating into a single string using 
# the standard library functin `apoc.text.join()`
cypher = """
  MATCH chunkWindow=(:Chunk)-[:NEXT*0..1]->(:Chunk {chunkId: $chunkIdParam})-[:NEXT*0..1]->()
  WITH chunkWindow as longestChunkWindow ORDER BY length(chunkWindow) DESC LIMIT 1
  RETURN apoc.text.join([ chunk in nodes(longestChunkWindow) | chunk.text ], " ") as windowText
    """

kg.query(cypher,
         params={'chunkIdParam': '0001493152-23-022974-item1-chunk0000'})

[{'windowText': '>ITEM 1. BUSINESS\n\n\n\xa0\n\n\nSummary\n\n\n\xa0\n\n\nBiotricity Inc. (the\n“Company”, “Biotricity”, “we”, “us”, “our”) is a leading-edge medical\ntechnology company focused on biometric data monitoring and diagnostic solutions. We deliver innovative, remote monitoring solutions\nto the medical, healthcare, and consumer markets, with a focus on diagnostic and post-diagnostic solutions for lifestyle and chronic\nillnesses. We approach the diagnostic side of remote patient monitoring by applying innovation within existing business models where\nreimbursement is established. We believe this approach reduces the risk associated with traditional medical device development and\naccelerates the path to revenue. In post-diagnostic markets, we intend to apply medical grade biometrics to enable consumers to\nself-manage, thereby driving patient compliance and reducing healthcare costs. Our initial focus was on the diagnostic mobile\ncardiac outpatient monitoring (COM) market. 

## Prepare langchain for using the Knowledge Graph

In [80]:


# OpenAI for creating embeddings
embeddings_model = OpenAIEmbeddings()

# Create a langchain vector store from the existing Neo4j knowledge graph.
vector_store = Neo4jVector.from_existing_graph(
    embedding=embeddings_model,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)
# Create a retriever from the vector store
retriever = vector_store.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
chain = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), chain_type="stuff", retriever=retriever
)

retrieval_query = """
MATCH chunkWindow=(:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->()
WITH node, score, chunkWindow as longestChunkWindow ORDER BY length(chunkWindow) DESC LIMIT 1
RETURN apoc.text.join([ chunk in nodes(longestChunkWindow) | chunk.text ], " ") as text,
    score,
    node {.*, textEmbedding: Null, text: Null} AS metadata
"""

vector_store_with_cypher = Neo4jVector.from_existing_index(
    embeddings_model,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database="neo4j",
    index_name=VECTOR_INDEX_NAME,
    text_node_property=VECTOR_SOURCE_PROPERTY,
    retrieval_query=retrieval_query,
)

# Create a retriever from the vector store
retriever_with_cypher = vector_store_with_cypher.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
chain_with_cypher = RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), chain_type="stuff", retriever=retriever_with_cypher
)

## Example questions

In [96]:
question = 'Who makes GPUs that is seeing an increase in demand?'

In [97]:
# Vector search using the langchain retriever over the Neo4j vector store
retriever.get_relevant_documents(question)[0]

Document(page_content='\ntext: The growth in demand for associative processing computing solutions is being driven by the increasing market adoption and usage of graphics processing unit (“GPU”) and CPU farms for AI processing of large data collections, including parallel computing in scientific research.  However, the large-scale usage of GPU and CPU farms for AI processing of data is demonstrating the limits of GPU and CPU processing speeds and resulting in ever higher energy consumption. The amounts of data being processed, which is coming from increasing numbers of users and continuously increasing amounts of collected data, has resulted in efforts to split and store the processed data among multiple databases, through a process called sharding.  Sharding substantially increases processing costs and worsens the power consumption factors associated with processing so much data.  As the environmental impacts of data processing are becoming increasingly important, and complex workload

In [98]:
# Vector search using the langchain retriever over the Neo4j vector store
retriever_with_cypher.get_relevant_documents(question)[0]

Document(page_content='Table of Contents\nIndustry and Market Strategy\nAssociative Processing Unit Computing Market Overview\nThe markets for associating processing computing solutions are significant and growing rapidly.  The total addressable market (“TAM”) for APU search applications, which is the market where GSI is focusing its commercialization efforts, has been determined by GSI to be approximately $232 billion in 2023, and growing at a compound annual growth rate (“CAGR”) of 13% to $380 billion by 2027. GSI has similarly determined that the Serviceable Available Market (“SAM”) for APU search applications is approximately $7.1 billion in 2023, and anticipated to grow at a CAGR of 16% to $12.8 billion by 2027. The search market segments included in GSI’s TAM and SAM analyses include vector search HPC. Some market applications in these segments are computer vision, synthetic aperture radar, drug discovery, and cybersecurity; and service markets such as NoSQL, Elasticsearch, and O

In [99]:
chain(
    {"question": question},
    return_only_outputs=True,
)

{'answer': 'NVIDIA Corporation is a company that makes GPUs and is seeing an increase in demand.\n',
 'sources': 'https://www.sec.gov/Archives/edgar/data/1126741/000155837023011516/0001558370-23-011516-index.htm'}

In [100]:
chain_with_cypher(
    {"question": question},
    return_only_outputs=True,
)

{'answer': 'NVIDIA Corporation is a company that makes GPUs and is seeing an increase in demand.\n',
 'sources': 'https://www.sec.gov/Archives/edgar/data/1126741/000155837023011516/0001558370-23-011516-index.htm'}