# Expand Context

The Chunk Nodes are individual chunks of text that used
to be part of an entire document.

You can reconstruct the original context by connecting the nodes
with relationships. It also makes the data easy to navigate and understand.

That is super helpful when you're building an application, for debugging and testing.

And, you can provide a better user experience. Your users will be able 
to directly interact with the data and even provide feedback that will
improve subsequent answers. 

You will create a connected context by making the following
changes to the knowledge graph:

1. Extract, create `(:Document)` nodes for each original source Form.
2. Enhance, add a summarized text property to each `(:Form)` node.
3. Expand, connect each `(:Chunk)` to the `(:Form)` node that it is part of

The graph will look like this...

```cypher
(:Document
  documentId: string
  documentUri: string
)

(:Section 
  sectionId: string //  a unique identifier for the form
  level: integer
  text: string
  summary: string // text summary generated with the LLM 
  summaryEmbedding: float[] // vector embedding of summary
  documentUri: string
  header1: string
  header2: string
  header3: string
)
```

```cypher
// Document contains sections
(:Document)-[:CONTAINS]->(:Section)

// Section contains Chunks
(:Section)-[:CONTAINS]->(:Chunk)
```

## Setup

In [15]:
%run 'shared.ipynb'

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
#*****************************************************************
# Neo4j
#*****************************************************************
NEO4J_URI=bolt://neo4j-1:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=abc123abc123
NEO4J_DATABASE=neo4j

# either ollama or openai
EMBEDDING_API=ollama
EMBEDDING_MODEL=mxbai-embed-large
# either ollama or openai
CHAT_API=ollama
CHAT_MODEL=llama3

#OLLAMA_URL=http://192.168.1.102:11434
#OLLAMA_URL=http://172.20.10.2:11434
OLLAMA_URL=http://host.docker.internal:11434
OPEN_API_KEY=

DATA_DIR=

Connecting to Neo4j at bolt://neo4j-1:7687 as neo4j
Embedding with ollama using mxbai-embed-large
Chatting with ollama using llama3


# Extract Section Nodes

There will be one Document node for each document.

In [25]:
# Create parent sections from Chunk nodes

cypher_create_from_chunks = """
            MATCH (c:Chunk)
            WITH c, apoc.util.sha1([coalesce(c.header4, c.header3, c.header2, c.header1)]) AS sectionId
            MERGE (s:Section {sectionId: sectionId } )
            SET s.header1 = c.header1
            , s.header2 = c.header2
            , s.header3 = c.header3
            , s.documentUri = c.documentUri
            , s.level = CASE WHEN c.header4 IS NOT NULL THEN 4
                             WHEN c.header3 IS NOT NULL THEN 3
                             WHEN c.header2 IS NOT NULL THEN 2
                             WHEN c.header1 IS NOT NULL THEN 1
                        END 
            , s.text = coalesce(c.header4, c.header3, c.header2, c.header1)   
            MERGE (s)-[newRel:CONTAINS]->(c)
            """


In [26]:
gdb.execute_query(cypher_create_from_chunks)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff3a26aed0>, keys=[])

In [27]:
# Create all other sections based on the parent nodes created before

cypher_create_other_sections = """
            MATCH (e:Section)
            WHERE e.level = $level_to_create+1
            WITH e, CASE WHEN $level_to_create = 1 THEN e.header1
                    WHEN $level_to_create = 2 THEN e.header2
                    WHEN $level_to_create = 3 THEN e.header3
                    END as text
            WITH e, text, apoc.util.sha1([ text ]) AS sectionId    
            MERGE (s:Section {sectionId: sectionId} )
            set s.header1 = e.header1
                , s.header2 = e.header2
                , s.documentUri = e.documentUri
                , s.level = $level_to_create
                , s.text = text    
            MERGE (s)-[newRel:CONTAINS]->(e)
            """


In [28]:
gdb.execute_query(cypher_create_other_sections, level_to_create=3)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff66538190>, keys=[])

In [29]:
gdb.execute_query(cypher_create_other_sections, level_to_create=2)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff3a26b610>, keys=[])

In [30]:
gdb.execute_query(cypher_create_other_sections, level_to_create=1)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff3a2dcc10>, keys=[])

In [31]:
# Create a uniqueness constraint on the sectionId property of Section nodes 
gdb.execute_query('CREATE CONSTRAINT unique_section IF NOT EXISTS FOR (n:Section) REQUIRE n.sectionId IS UNIQUE')


EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff3a23ab90>, keys=[])

### We can remove the :NEXT relationships between the Chunks (linked list)

In [32]:
# removes the NEXT relationships as they are no longer needed
cypher_delete_next_rel = """
MATCH (c:Chunk)-[r:NEXT]->(c1:Chunk) DELETE r
"""
gdb.execute_query(cypher_delete_next_rel)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff3a43a6d0>, keys=[])

# Extract Document nodes

You can now create a Document node for each of the distinct documentUri's of the level 1 nodes.


In [33]:
# Create all other sections based on the parent nodes created before

cypher_create_documents = """
            MATCH (s:Section)
            WHERE s.level = 1
            MERGE (d:Document {documentId: apoc.util.sha1([s.documentUri]) } )
            SET d.documentUri = s.documentUri
            MERGE (d)-[newRel:CONTAINS]->(s)
            """

gdb.execute_query(cypher_create_documents)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff3a256790>, keys=[])

# Parent retriever

In [None]:
search_results = neo4j_vector_search(
    'What is the size of the rink', 'chunks_vector'
)
for result in search_results:
    print (str(result["score"]) + " : " + result["node"]["chunkId"] + " : " + result["node"]["text"] )
    print ("------------")


In [None]:
retrieval_query_parent = """

    WITH node AS chunk, score AS score ORDER BY score 
    OPTIONAL MATCH (chunk:Chunk)<-[:CONTAINS]-(parent)
    OPTIONAL MATCH (parent)-[:CONTAINS]->(s:Chunk)
    WITH chunk, s, score ORDER BY s.chunkSeqId ASC
    WITH collect(s.text) as textList, chunk.text as text, score AS score
    RETURN apoc.text.join(textList, " \n ")  as text,
    score,  {} AS metadata 
    ORDER BY score desc

"""

vector_store_parent = Neo4jVector.from_existing_index(
    embedding=embeddings_api,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database="neo4j",
    index_name=VECTOR_INDEX_NAME,
    text_node_property=VECTOR_SOURCE_PROPERTY,
    retrieval_query=retrieval_query_parent
)

# Create a retriever from the vector store
retriever_parent = vector_store_parent.as_retriever(search_kwargs={'k': 1})

# Create a chatbot Question & Answer chain from the retriever
chain_parent = prettifyChain(RetrievalQA.from_chain_type(
    chat_api, 
    chain_type="stuff", 
    retriever=retriever_parent,
    chain_type_kwargs={"verbose": True}
))


In [None]:
docs = retriever_parent.invoke("what happens if a player is injured")
docs

In [None]:
chain_parent("what happens if a player is injured")

In [None]:

retrieval_query_parent = """
    WITH node AS chunk, score AS score ORDER BY score 
    OPTIONAL MATCH (chunk:Chunk)<-[:CONTAINS]-(parent)
    OPTIONAL MATCH (parent)-[:CONTAINS]->(s:Chunk)
    WITH chunk, s, score ORDER BY s.chunkSeqId ASC
    WITH collect(s.text) as textList, chunk.text as text, score AS score
    RETURN apoc.text.join(textList, " \n ")  as text,
    score,  {} AS metadata 
    ORDER BY score desc
"""


def neo4j_vector_search_2(question, retrieval_query):
  """Search for similar nodes using the Neo4j vector index"""
  vector_search_query = """
    CALL db.index.vector.queryNodes($index_name, $top_k, $question_embedding) 
        YIELD node, score
  """ + retrieval_query
  similar = []

  print ("Using vector index: " + str(VECTOR_INDEX_NAME))
    
  question_embedding = embeddings_api.embed_query(question)
  return gdb.execute_query(vector_search_query,
                      question=question, 
                      question_embedding=question_embedding, 
                      index_name=VECTOR_INDEX_NAME, 
                      top_k=10
                    ).records

search_results = neo4j_vector_search_2(
    'injured player', retrieval_query_parent
)
search_results[0]
    

### To view one `Section` node and it's chunks

```
MATCH (c:Chunk)<-[:CONTAINS]-(s:Section)
WHERE c.header4 = '8.1. INJURED PLAYER'
    RETURN c, s
```    

### Remove all `Section` and `Document` nodes

In [None]:
cypher_remove_document_nodes = """
MATCH (d:Section) DETACH DELETE d
"""

cypher_remove_section_nodes = """
MATCH (d:Section) DETACH DELETE d
"""

gdb.execute_query(cypher_remove_document_nodes)
gdb.execute_query(cypher_remove_section_nodes)