# Expanded Context

### Expand - Connect Form 10k Chunk Nodes

The Form 10k Chunk Nodes are individual chunks of text that used
to be part of an entire document.

You can reconstruct the original context by connecting the nodes
with relationships. 

This is the second step in creating a knowledge graph. 

First, create nodes. Then, connect the nodes with relationships.

The creates a connected context which can improve answers in a RAG application.

It also makes the data easy to navigate and understand.

That is super helpful when you're building an application, for debugging and testing.

And, you can provide a better user experience. Your users will be able 
to directly interact with the data and even provide feedback that will
improve subsequent answers. 

You will create a connected context by making the following
changes to the knowledge graph:

1. First, connect each chunk into a linked list.
2. Second, create `(:Form)` nodes for each original source Form.
3. Third, connect each chunk to the parent `(:Form)` node that it is a part of.


## Imports

### Script - import libraries

You need to import some libaries, as usual, so let's do that first.

In [2]:
from dotenv import load_dotenv
import os

# Common data processing
import textwrap

# Langchain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings


## Global constants

### Script - set up global constants

You will also define some global constants for the database connection,
and the data model.

In [3]:
# Load from environment
load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
NEO4J_DATABASE = os.getenv('NEO4J_DATABASE') or 'neo4j'

# Global constants
VECTOR_INDEX_NAME = 'form_10k_chunks'
VECTOR_NODE_LABEL = 'Chunk'
VECTOR_SOURCE_PROPERTY = 'text'
VECTOR_EMBEDDING_PROPERTY = 'textEmbedding'


In [4]:
print(NEO4J_URI, NEO4J_USERNAME)

neo4j://localhost:7687 neo4j


In [5]:
# Create a knowledge graph using Langchain's Neo4j integration.
# This will be used for direct querying of the knowledge graph. 
kg = Neo4jGraph(
    url=NEO4J_URI, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE
)

### `Form` Nodes

You already have Chunk nodes.

You want to create a new node to represent the 10k form itself.

Each node will have a `Form` label, and the following properties:
  - formId - a unique identifier for the form
  - source - a link back to the original 10k document
  - cik - the Central Index Key for the company that filed the form
  - cusip6 - the CUSIP identifier for the company

As a node, it will look like this:
```cypher
(:Form 
  formId: string
  source: string
  cik: int
  cusip6: string
)
```

(Possible slide?)

## Cypher Queries to tranform the Knowledge Graph

### First, create Form nodes

You can now use these Cypher features to transform the graph.

You will start by creating `Form` nodes. 
There will be one Form node for every Form 10k.

The first step is to look through all the Chunks and 
find the `formId` that they came from.

In [13]:
# Find distinct information needed for each Form Node...

# MATCH the same single node pattern 
# RETURN DISTINCT rows of `source`, `formId`, `cik`, and `cusip6` properties


cypher = """
  MATCH (anyChunk:Chunk) 
  WITH anyChunk LIMIT 1
  RETURN anyChunk { .names, .source, .formId, .cik, .cusip6 } as formInfo
"""
form_info_list = kg.query(cypher)

form_info_list

all_form_info = [form_info['formInfo'] for form_info in form_info_list]

all_form_info[0]

{'cik': '1650372',
 'source': 'https://www.sec.gov/Archives/edgar/data/1650372/000165037223000040/0001650372-23-000040-index.htm',
 'formId': '0001650372-23-000040',
 'names': ['ATLASSIAN CORP PLC', 'ATLASSIAN CORPORATION PLC'],
 'cusip6': 'G06242'}

### Create a constraint for Forms

Before creating Form nodes,
create a uniqueness constraint on nodes with a `Form` label
requiring that the `formId` property is unique.

In [7]:
# Create a uniqueness constraint on the textId property of Text nodes 
kg.query('CREATE CONSTRAINT unique_form IF NOT EXISTS FOR (n:Form) REQUIRE n.formId IS UNIQUE')


[]

### Create *all* parent form nodes (skip for single form)

You can now create a Form node for each of the distinct formIds
using distinct rows.

For each row, 
`MERGE` a new `(:Form)` node
with source, cik, and cusip6 properties.


In [11]:
# Create parent `Form` nodes for each form..

# MATCH the same single node pattern 
# WITH DISTINCT rows of `source`, `formId`, `cik`, and `cusip6` properties
# MERGE a new `Form` node with the `formId` property
# and SET the `source`, `cik`, and `cusip6` properties
cypher = """
  MATCH (all:Chunk) 
  WITH DISTINCT all {.names, .source, .formId, .cik,.cusip6} as formInfo
    MERGE (f:Form {formId: formInfo.formId })
      ON CREATE 
        SET f.names = formInfo.names
        SET f.source = formInfo.source
        SET f.cik = formInfo.ci
        SET f.cusip6 = formInfo.cusip6
    RETURN count(f) as formCount
"""
kg.query(cypher)

[{'formCount': 10}]

### Script - connect chunks into linked list

You can now create relationships between all
nodes in that list of chunks,
effectively creating a linked list from the
first chunk to the last.

The extra step here is calling the `apoc.nodes.link` 
procedure to create a linked list.


### Script - connect *all* chunks (multiple forms)

You can now connect all the chunks into linked lists
by looping through all the distinct formIds and sections,
calling the same cypher query as above 
to connect the related section chunks.

In [14]:
# Connect *all* section chunks into a linked list..

# MATCH the same single node pattern 
# WITH all DISTINCT sources, CALL a subquery
# within the subquery, MATCH a new single node pattern
# WHERE the chunks have the same `source` property 
# and are part of the same section named by the `f10kItem` property
# WITH those chunks collected together
# CALL apoc.nodes.link() to create a linked list
cypher = """
  MATCH (from_same_form_and_section:Chunk)
  WHERE from_same_form_and_section.formId = $formIdParam
    AND from_same_form_and_section.f10kItem = $f10kItemParam
  WITH from_same_form_and_section
    ORDER BY from_same_form_and_section.chunkSeqId ASC
  WITH collect(from_same_form_and_section) as section_chunk_list
    CALL apoc.nodes.link(section_chunk_list, "NEXT", {avoidDuplicates: true})
  RETURN size(section_chunk_list)
"""

distinct_form_ids = list(map(lambda form_info: form_info['formId'], all_form_info))
for form_id in distinct_form_ids:
  for form10kItemName in ['item1', 'item1a', 'item7', 'item7a']:
    kg.query(cypher, params={'formIdParam':form_id, 'f10kItemParam': form10kItemName})


### Script - connect chunks to parent Form nodes

Next, you can connect the chunks to the Form they're part of.

Match a Chunk and a Form
where they have the same `formId`,
then `MERGE` a new `PART_OF` relationship between them.

In [15]:
# Connect all chunks to their parent `Form` node...

# MATCH a double node pattern, for the `Chunk` and `Form` nodes
# WHERE the `Chunk` and `Form` nodes have the same `formId` property
# connect the pairs into a (:Chunk)-[:PART_OF]->(:Form) relationship
cypher = """
  MATCH (c:Chunk), (f:Form)
    WHERE c.formId = f.formId
  MERGE (c)-[newRelationship:PART_OF]->(f)
  RETURN count(newRelationship)
"""

kg.query(cypher)

[{'count(newRelationship)': 544}]

### Script - connect `Form` to head of `Chunk` list

You can add one more relationship to the graph, connecting
the `Form` to the first `Chunk` of each section.

This is similar to the previous query,
but also checks that the chunk sequence id is 0.

The `SECTION` relationship that connects the Form to the
first chunk will also get an `f10kItem` property.

This is a kindness for humans looking at the knowledge graph,
enabling them to eaily navigate from a Form to the beginning
of a particular section.

In [26]:
# Connect all parent `Form` nodes to the "head" of each section linked list...

# MATCH a double node pattern, for the `Chunk` and `Form` nodes
# WHERE the `Chunk` and `Form` nodes have the same `formId` property
# (this is exactly like a JOIN in SQL)
# connect the pairs with a (:Chunk)-[:PART_OF]->(:Form) relationship
cypher = """
  MATCH (headOfSection:Chunk), (f:Form)
  WHERE headOfSection.formId = f.formId
    AND headOfSection.chunkSeqId = 0
  WITH headOfSection, f
    MERGE (f)-[newRelationship:SECTION {f10kItem:headOfSection.f10kItem}]->(headOfSection)
  RETURN count(newRelationship)
"""

kg.query(cypher)

[{'count(newRelationship)': 40}]

## Example cypher queries

### Script - pattern match from form to head of section 

For example, you can get the first chunk for a section
using a pattern match from a form to the chunk
connected by a `SECTION` relationship with a matching `f10kItem` property.

As you'd expect the text of this "first chunk" looks familiar.

In [21]:
all_form_info[0]['formId']

'0001650372-23-000040'

In [27]:
cypher = """
  MATCH (f:Form)-[r:SECTION]->(first:Chunk)
    WHERE f.formId = $formIdParam
        AND r.f10kItem = $f10kItemParam
  RETURN first.chunkId as chunkId, first.text as text
"""

first_chunk_info = kg.query(cypher, params={
    'formIdParam': all_form_info[0]['formId'], 
    'f10kItemParam': 'item1'
})

first_chunk_info


[{'chunkId': '0001650372-23-000040-item1-chunk0000',
  'text': '>ITEM\xa01. BUSINESS\nCompany Overview\n\xa0\nOur mission is to unleash the potential of every team.\nOur products help teams organize, discuss and complete shared work — delivering superior outcomes for their organizations.\nOur primary products include Jira Software and Jira Work Management for planning and project management, Confluence for content creation and sharing, Trello for capturing and adding structure to fluid, fast-forming work for teams, Jira Service Management for team service, management and support applications, Jira Align for enterprise agile planning, and Bitbucket for code sharing and management. Together, our products form an integrated system for organizing, discussing and completing shared work, becoming deeply entrenched in how teams collaborate and how organizations run. The Atlassian platform is the common technology foundation for our products that drives connection between teams, information, a

# Prepare langchain for using the Knowledge Graph

### Script - chain with cypher

You can now create a question and answer chain.

The default Neo4jVector uses a basic cypher query
to peform vector similarity search.

That query can be extended to do whatever you
want in a Cypher.

...

This Cypher query extension will receive two variables: `node` and `score`.

The query should return three columns: `text`, `score`, and `metadata`.

The `text` should be plain text to be passed to the LLM.

The `score` column should be the similarity score of the text.

The `metadata` can be any additional information you want to pass, 

like the source of the text.

...

The Cypher itself can do whatever you want to create those outputs.

This example has a literal string that is prepended to any text.


Create an extended Neo4j vector store by providing
a parameter called `retrieval_query`
where we pass in the Cypher query extension.

...

The retriever and chain construction is exactly the same as before.


...

This "extra text" will be prepended to any chunks that are
found by the vector search. This expands the context of the
information passed to the LLM. This is new information that
may not have been present in the chunk text.



### Script - langchain with and without the window

You now know how to customize the results of a vector
search by extending it with Cypher.

You could use this capability to expand the context
around a chunk with the chunk window query.

Let's try this out and compare results.

...

First, create a chain that uses the default cypher
query included with Neo4jVector. That query performs
a vector search using the specified configuration.

Call it "windowless_chain".

In [44]:
def prettify(chain):
    def prettychain(question:str):
      response = chain({"question": question},return_only_outputs=True,)
      print(textwrap.fill(response['answer'], 80))
    return prettychain

In [45]:

# Create a langchain vector store from the existing Neo4j knowledge graph.
neo4j_vector_store = Neo4jVector.from_existing_graph(
    embedding=OpenAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    index_name=VECTOR_INDEX_NAME,
    node_label=VECTOR_NODE_LABEL,
    text_node_properties=[VECTOR_SOURCE_PROPERTY],
    embedding_node_property=VECTOR_EMBEDDING_PROPERTY,
)
# Create a retriever from the vector store
windowless_retriever = neo4j_vector_store.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
windowless_chain = prettify(RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), 
    chain_type="stuff", 
    retriever=windowless_retriever
))

### Script - chain with cypher

Now create another chain that uses the
chunk window query.

Your goal here is to expand the context with adjacent
chunks which may be relevant to providing a complete
answer.

To do that, use the chunk window query, then pull
out text from each chunk that is in the window.

Finally, all that text will be concatenated together
to provide a complete context for the LLM.

...

That the top `MATCH` clause uses the
special `node` variable in the middle of the pattern

There is an extra `WITH` that collects the text from
each chunk into a list.

The `RETURN` clause uses the `apoc.text.join` function
to concatenate that list of text.

In [46]:
retrieval_query_window = """
MATCH window=
    (:Chunk)-[:NEXT*0..1]->(node)-[:NEXT*0..1]->(:Chunk)
WITH node, score, window as longestWindow 
  ORDER BY length(window) DESC LIMIT 1
WITH nodes(longestWindow) as chunkList, node, score
  UNWIND chunkList as chunkRows
WITH collect(chunkRows.text) as textList, node, score
RETURN apoc.text.join(textList, " \n ") as text,
    score,
    node {.source} AS metadata
"""

vector_store_window = Neo4jVector.from_existing_index(
    embedding=OpenAIEmbeddings(),
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database="neo4j",
    index_name=VECTOR_INDEX_NAME,
    text_node_property=VECTOR_SOURCE_PROPERTY,
    retrieval_query=retrieval_query_window
)

# Create a retriever from the vector store
retriever_window = vector_store_window.as_retriever()

# Create a chatbot Question & Answer chain from the retriever
chain_window = prettify(RetrievalQAWithSourcesChain.from_chain_type(
    ChatOpenAI(temperature=0), 
    chain_type="stuff", 
    retriever=retriever_window
))

## Example questions - compare with and without context window

In [38]:
question = "Tell me about Fedex's business."

In [47]:
windowless_chain(question)

FedEx Corporation provides customers and businesses worldwide with a broad
portfolio of transportation, e-commerce, and business services through its
operating companies, including FedEx Express, FedEx Ground, and FedEx Freight.
FedEx operates collaboratively and innovates digitally as one company. FedEx has
a global network that connects more than 99% of the world's gross domestic
product. FedEx has introduced innovative solutions and initiatives to improve
long-term profitability and service quality.


In [48]:
chain_window(question)

FedEx Corporation provides customers and businesses worldwide with a broad
portfolio of transportation, e-commerce, and business services through its
operating companies, FedEx Express and FedEx Ground.
