# Expand Context

The Form 10k Chunk Nodes are individual chunks of text that used
to be part of an entire document.

You can reconstruct the original context by connecting the nodes
with relationships. It also makes the data easy to navigate and understand.

That is super helpful when you're building an application, for debugging and testing.

And, you can provide a better user experience. Your users will be able 
to directly interact with the data and even provide feedback that will
improve subsequent answers. 

You will create a connected context by making the following
changes to the knowledge graph:

1. Extract, create `(:Form)` nodes for each original source Form.
2. Enhance, add a summarized text property to each `(:Form)` node.
3. Expand, connect each `(:Chunk)` to the `(:Form)` node that it is part of

The graph will look like this...

```cypher
(:Form 
  formId: string //  a unique identifier for the form
  cik: int // the Central Index Key for the company that filed the form
  cusip6: string // the CUSIP identifier for the company
  source: string // a link back to the original 10k document
  summary: string // text summary generated with the LLM 
  summaryEmbeddings: float[] // vector embedding of summary
)
```

```cypher
// Chunks are part of a Form
(:Chunk)-[:PART_OF]->(:Form)
// Forms link to the first Chunk in a section based on the `f10kItem` value from the Form 10k
(:Form)-[:SECTION {f10kItem: string}]->(:Chunk)
```

## Setup

In [None]:
%run 'shared.ipynb'

# Extract Form Nodes

There will be one Form node for every Form 10k.

Like creating the Chunks, we'll loop through all Form 10k files,
extracting the sections we want to use as the Form nodes.

In [None]:
# A helper function for creating a Form record from the file information
def extract_form10k_form_from_file(file):
    file_as_object = json.load(open(file)) # open the json file
    form_id = file[file.rindex('/') + 1:file.rindex('.')] # extract form id from file name
    full_text = f"""About {file_as_object['names']}...
      {file_as_object['item1'] if 'item1' in file_as_object else ''}
      {file_as_object['item1a'] if 'item1a' in file_as_object else ''}
      {file_as_object['item7'] if 'item7' in file_as_object else ''}
      {file_as_object['item7a'] if 'item7a' in file_as_object else ''}
      """

    form_with_metadata = {
      'formId': f'{form_id}', # pulled from the filename
      # metadata from file...
      'names': file_as_object['names'],
      'cik': file_as_object['cik'],
      'cusip6': file_as_object['cusip6'],
      'source': file_as_object['source'],
      'fullText': full_text
    }
    
    return form_with_metadata

In [None]:
# Create a uniqueness constraint on the textId property of Text nodes 
gdb.execute_query('CREATE CONSTRAINT unique_form IF NOT EXISTS FOR (n:Form) REQUIRE n.formId IS UNIQUE')


In [None]:
# Create parent `Form` nodes for each form..

merge_form_node_query = """
MERGE (f:Form {formId: $formInfo.formId })
  ON CREATE 
    SET f.names = $formInfo.names
    SET f.source = $formInfo.source
    SET f.cik = $formInfo.ci
    SET f.cusip6 = $formInfo.cusip6
RETURN f.formId
"""

# Helper function to create nodes for all chunks.
# This will use the `merge_chunk_node_query` to create a `:Chunk` node for each chunk.
def create_form_node(form_with_metadata):
  print(f"Creating `:Form` node for form ID {form_with_metadata['formId']}")
  gdb.execute_query(merge_form_node_query, 
          formInfo=form_with_metadata
          )

### Create *all* parent form nodes

You can now create a Form node for each of the distinct formIds
using distinct rows.

For each row, 
`MERGE` a new `(:Form)` node
with source, cik, and cusip6 properties.


In [None]:
%%time

import glob

IMPORT_DATA_DIRECTORY = f"{DATA_DIR}/form10k/"

all_file_names = glob.glob(IMPORT_DATA_DIRECTORY + "*.json")
counter = 0

all_forms = []

for file_name in all_file_names:
    counter += 1
    print(f'=== Processing {counter} of {len(all_file_names)} ===')
    # get form data from the files
    print(f'Reading Form10k file {file_name}...')
    form_with_metadata = extract_form10k_form_from_file(file_name)
    all_forms.append(form_with_metadata)
    # create node
    print('Creating Form Node...')
    create_form_node(form_with_metadata)
    print(f'Done Processing {file_name}')

# Check the number of nodes in the graph
gdb.execute_query("MATCH (n:Form) RETURN count(n) as formCount").records

# Enhance - create summary property, with embedding

During the file processing above, the text from all interesting
items were added to the `fullText` property of the `all_forms` dictionaries.

We'll use an LLM to summarize the text and create an embedding.

Both the text summary and the embdding will be added to the Form nodes.

In [None]:
# Create an embedding to find out the dimensions of the vector
text_embedding = embeddings_api.embed_query("embed this text using an LLM")
vector_dimensions = len(text_embedding) 

print(f"Text embeddings will have {vector_dimensions} dimensions")
# Create a vector index called "form_10k_forms" the `summaryEmbedding`` property of nodes labeled `Form`. 
gdb.execute_query("""
         CREATE VECTOR INDEX `form_10k_forms` IF NOT EXISTS
          FOR (f:Form) ON (f.summaryEmbedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: $vectorDimensionsParam,
            `vector.similarity_function`: 'cosine'    
         }}
""",
  vectorDimensionsParam=vector_dimensions
)

# Check the vector indexes in the graph
gdb.execute_query('SHOW VECTOR INDEXES')

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 60000,
    chunk_overlap  = 0,
    length_function = len,
    is_separator_regex = False,
)

for form_info in all_forms:
  split_text = text_splitter.split_text(form_info['fullText'])
  summary = ''
  for partial_text in split_text:
    partial_summary = chat_api.invoke(
      f"""Write a single, very brief sentence summary of {form_info['names'][0]}'s business
       based on the following information...\n {partial_text}
      """)
    summary += partial_summary.content + '\n\n'
  print(f"Summarized {form_info['names'][0]}'s form-10k in {len(summary)} characters. Here's a preview...")
  print(f"\t{summary[:120]}")
  form_info['summary'] = summary
  summary_embedding = embeddings_api.embed_query(summary)
  form_info['summaryEmbedding'] = summary_embedding
  print(f"\tUpdating form with ID {form_info['formId']} with summary and embedding...")
  gdb.execute_query("""
    MATCH (f:Form {formId: $formInfoParam.formId})
      SET f.summary = $formInfoParam.summary 
    WITH f
      CALL db.create.setNodeVectorProperty(f, "summaryEmbedding", $formInfoParam.summaryEmbedding)    
    """, 
      formInfoParam=form_info
  )


# Expand - connect Chunks to Form nodes

### Connect chunks to parent Form nodes

Next, connect the chunks to the Form they're part of.

In [None]:
# Connect all chunks to their parent `Form` node...

# MATCH a paired node pattern, for the `Chunk` and `Form` nodes
# WHERE the `Chunk` and `Form` nodes have the same `formId` property
# connect the pairs into a (:Chunk)-[:PART_OF]->(:Form) relationship
cypher = """
  MATCH (c:Chunk), (f:Form)
    WHERE c.formId = f.formId
  MERGE (c)-[newRelationship:PART_OF]->(f)
  RETURN count(newRelationship)
"""

gdb.execute_query(cypher).records

### Connect `Form` to head of `Chunk` list

You can add one more relationship to the graph, connecting
the `Form` to the first `Chunk` of each section.

This is similar to the previous query,
but also checks that the chunk sequence id is 0.

The `SECTION` relationship that connects the Form to the
first chunk will also get an `f10kItem` property.

This is a kindness for humans looking at the knowledge graph,
enabling them to eaily navigate from a Form to the beginning
of a particular section.

In [None]:
# Connect all parent `Form` nodes to the "head" of each section linked list...

# MATCH a paired node pattern, for the `Chunk` and `Form` nodes
# WHERE the `Chunk` and `Form` nodes have the same `formId` property
# (this is exactly like a JOIN in SQL)
# connect the pairs with a (:Chunk)-[:PART_OF]->(:Form) relationship
cypher = """
  MATCH (headOfSection:Chunk), (f:Form)
  WHERE headOfSection.formId = f.formId
    AND headOfSection.chunkSeqId = 0
  WITH headOfSection, f
    MERGE (f)-[newRelationship:SECTION {item:headOfSection.item}]->(headOfSection)
  RETURN count(newRelationship)
"""

gdb.execute_query(cypher).records

# Examples

### Vector search with graph pattern

You can now create a question answering chain.

The default Neo4jVector uses a basic cypher query
to peform vector similarity search.

That query can be extended to do whatever you
want in a Cypher.

This Cypher query extension will receive two variables: `node` and `score`
and it should should return three fields: `text`, `score`, and `metadata`.

  - The `text` should be plain text to be passed to the LLM.
  - The `score` column should be the similarity score of the text.
  - The `metadata` can be any additional information you want to pass, like the source of the text.


In this example, we'll use the previous/next chunks to expand the context of the text passed to the LLM.

Create two QA chains, one with and one without the chunk window.


In [None]:

# Create a langchain vector store from the existing Neo4j knowledge graph.
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=embeddings_api,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)
# Create a retriever from the vector store
windowless_retriever = neo4j_vector_store.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
windowless_chain = prettifyChain(RetrievalQAWithSourcesChain.from_chain_type(
    chat_api, 
    chain_type="stuff", 
    retriever=windowless_retriever
))

In [None]:
retrieval_query_window = """
MATCH window=
    (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
WITH node, score, window as longestWindow 
  ORDER BY length(window) DESC LIMIT 1
WITH nodes(longestWindow) as chunkList, node, score
  UNWIND chunkList as chunkRows
WITH collect(chunkRows.text) as textList, node, score
RETURN apoc.text.join(textList, " \n ") as text,
    score,
    node {.source} AS metadata
"""

vector_store_window = Neo4jVector.from_existing_index(
    embedding=embeddings_api,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database="neo4j",
    index_name=VECTOR_INDEX_NAME,
    text_node_property=VECTOR_SOURCE_PROPERTY,
    retrieval_query=retrieval_query_window
)

# Create a retriever from the vector store
retriever_window = vector_store_window.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
chain_window = prettifyChain(RetrievalQAWithSourcesChain.from_chain_type(
    chat_api, 
    chain_type="stuff", 
    retriever=retriever_window
))

## Example questions - compare with and without context window

In [None]:
question = "Tell me about Netapp's business."

In [None]:
windowless_chain(question)

In [None]:
chain_window(question)