# Expand Context

The Chunk Nodes are individual chunks of text that used
to be part of an entire document.

You can reconstruct the original context by connecting the nodes
with relationships. It also makes the data easy to navigate and understand.

That is super helpful when you're building an application, for debugging and testing.

And, you can provide a better user experience. Your users will be able 
to directly interact with the data and even provide feedback that will
improve subsequent answers. 

You will create a connected context by making the following
changes to the knowledge graph:

1. Extract, create `(:Document)` nodes for each original source Form.
2. Enhance, add a summarized text property to each `(:Form)` node.
3. Expand, connect each `(:Chunk)` to the `(:Form)` node that it is part of

The graph will look like this...

```cypher
(:Document
  docId: string
  uri: string
  summary: text
  summaryEmbedding: float[] // vector embedding of summary
)

(:Section 
  sectionId: string //  a unique identifier for the form
  summary: string // text summary generated with the LLM 
  summaryEmbedding: float[] // vector embedding of summary
)
```

```cypher
// Document contains sections
(:Document)-[:CONTAINS]->(:Section)

// Section contains Chunks
(:Section)-[:CONTAINS]->(:Chunk)
```

## Setup

In [248]:
%run 'shared.ipynb'

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
Connecting to Neo4j at bolt://neo4j-1:7687 as neo4j
Using data from /home/jovyan/data/single
Embedding with ollama using mxbai-embed-large
Chatting with ollama using llama3


# Extract Section Nodes

There will be one Document node for each document.

In [144]:
# Create parent sections from Chunk nodes

cypher_create_from_chunks = """
            MATCH (c:Chunk)
            WITH c, apoc.util.sha1([coalesce(c.header4, c.header3, c.header2, c.header1)]) AS sectionId
            MERGE (s:Section {sectionId: sectionId } )
            SET s.header1 = c.header1
            , s.header2 = c.header2
            , s.header3 = c.header3
            , s.documentUri = c.documentUri
            , s.level = CASE WHEN c.header4 IS NOT NULL THEN 4
                             WHEN c.header3 IS NOT NULL THEN 3
                             WHEN c.header2 IS NOT NULL THEN 2
                             WHEN c.header1 IS NOT NULL THEN 1
                        END 
            , s.text = coalesce(c.header4, c.header3, c.header2, c.header1)   
            MERGE (s)-[newRel:CONTAINS]->(c)
            """


In [145]:
gdb.execute_query(cypher_create_from_chunks)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff33d63a50>, keys=[])

In [148]:
# Create all other sections based on the parent nodes created before

cypher_create_other_sections = """
            MATCH (e:Section)
            WHERE e.level = $level_to_create+1
            WITH e, CASE WHEN $level_to_create = 1 THEN e.header1
                    WHEN $level_to_create = 2 THEN e.header2
                    WHEN $level_to_create = 3 THEN e.header3
                    END as text
            WITH e, text, apoc.util.sha1([ text ]) AS sectionId    
            MERGE (s:Section {sectionId: sectionId} )
            set s.header1 = e.header1
                , s.header2 = e.header2
                , s.documentUri = e.documentUri
                , s.level = $level_to_create
                , s.text = text    
            MERGE (s)-[newRel:CONTAINS]->(e)
            """


In [149]:
gdb.execute_query(cypher_create_other_sections, level_to_create=3)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff33dd6f90>, keys=[])

In [150]:
gdb.execute_query(cypher_create_other_sections, level_to_create=2)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff33fe4390>, keys=[])

In [151]:
gdb.execute_query(cypher_create_other_sections, level_to_create=1)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff33d62ed0>, keys=[])

In [152]:
# Create a uniqueness constraint on the textId property of Text nodes 
gdb.execute_query('CREATE CONSTRAINT unique_section IF NOT EXISTS FOR (n:Section) REQUIRE n.sectionId IS UNIQUE')


EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff33dd9610>, keys=[])

In [107]:
MATCH (c:Chunk)-[r:NEXT]->(c1:Chunk) DELETE r

SyntaxError: invalid syntax (1942152747.py, line 1)

## Extract Document nodes

You can now create a Document node for each of the distinct documentUri's of the level 1 nodes.


In [159]:
# Create all other sections based on the parent nodes created before

cypher_create_documents = """
            MATCH (s:Section)
            WHERE s.level = 1
            MERGE (d:Document {documentId: apoc.util.sha1([s.documentUri]) } )
            SET d.documentUri = s.documentUri
            MERGE (d)-[newRel:CONTAINS]->(s)
            """

gdb.execute_query(cypher_create_documents)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff33bce510>, keys=[])

# Enhance - create summary property, with embedding

During the file processing above, the text from all interesting
items were added to the `fullText` property of the `all_forms` dictionaries.

We'll use an LLM to summarize the text and create an embedding.

Both the text summary and the embdding will be added to the Section nodes.

In [108]:
# Create an embedding to find out the dimensions of the vector
text_embedding = embeddings_api.embed_query("embed this text using an LLM")
vector_dimensions = len(text_embedding) 

print(f"Text embeddings will have {vector_dimensions} dimensions")
# Create a vector index called "sections_vector" the `summaryEmbedding`` property of nodes labeled `Section`. 
gdb.execute_query("""
         CREATE VECTOR INDEX `sections_vector` IF NOT EXISTS
          FOR (s:Section) ON (s.summaryEmbedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: $vectorDimensionsParam,
            `vector.similarity_function`: 'cosine'    
         }}
""",
  vectorDimensionsParam=vector_dimensions
)

# Check the vector indexes in the graph
gdb.execute_query('SHOW VECTOR INDEXES').records

Text embeddings will have 1024 dimensions


[<Record id=12 name='chunks_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Chunk'] properties=['embedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=neo4j.time.DateTime(2024, 6, 2, 17, 54, 57, 672000000, tzinfo=<UTC>) readCount=2>,
 <Record id=6 name='sections_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Section'] properties=['summaryEmbedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=neo4j.time.DateTime(2024, 6, 2, 17, 46, 42, 478000000, tzinfo=<UTC>) readCount=1>]

In [None]:
%%time

query = """
MATCH (s:Section)-[:CONTAINS]->(c:Chunk)
WITH s, collect(c.text) as textList
RETURN s.sectionId AS sectionId, s.level AS level, apoc.text.join(textList, " \n ") AS text
"""

sections_with_text = gdb.execute_query(query).records
for section in sections_with_text[0:]:
    section_info = {}
    text = section["text"]
    result = chat_api.invoke(
      f"""Write a single, very brief sentence summary
       based on the following information...\n {text}. Do not respond with "Here is a brief sentence summary:" 
      """)
    summary = result.content
    print(f"Summarized {len(summary)} characters. Here's a preview...")
    print(f"\t{summary[:120]}")
    section_info['summary'] = summary
    section_info['sectionId'] = section["sectionId"]
    
    summary_embedding = embeddings_api.embed_query(summary)
    section_info['summaryEmbedding'] = summary_embedding
    print(f"\tUpdating section with ID {section_info['sectionId']} with summary and embedding...")
    gdb.execute_query("""
      MATCH (s:Section {sectionId: $sectionInfoParam.sectionId})
        SET s.summary = $sectionInfoParam.summary 
      WITH s
        CALL db.create.setNodeVectorProperty(s, "summaryEmbedding", $sectionInfoParam.summaryEmbedding)    
    """, 
    sectionInfoParam=section_info
    )


# Parent retriever

In [250]:
search_results = neo4j_vector_search(
    'injured Players', 'chunks_vector'
)
for result in search_results:
    print (str(result["score"]) + " : " + result["node"]["chunkId"] + " : " + result["node"]["text"] )
    print ("------------")


Using vector index: chunks_vector
0.8543055653572083 : 9c8d74e33176f1e321d886cb3f7121a0da0e0307 : When a Player is injured so that they cannot continue play or go to their Players’ Bench, the play shall not be stopped until the inju- red Player’s Team has secured control of the puck. If the Player’s Team is in “control of the puck” at the time of injury, play shall be stopped immediately unless their Team is in a scoring position.
------------
0.8513350486755371 : 831977f9548f9ecdcfd78c6289ebf3ae312ce9de : When a Player is injured or compelled to leave the ice during a Game, they may retire from the Game and be replaced by a substitute, but play must continue without the Teams leaving the ice.  
During the play, if an injured Player wishes to retire from the ice and be replaced by a substitute, they must do so at the Players’ Bench and not through any other exit leading from the Rink. This is not a legal Player change and therefore when a violation occurs, a Bench-minor Penalty shall b

In [267]:
retrieval_query_parent = """

    WITH node AS chunk, score AS score ORDER BY score 
    OPTIONAL MATCH (chunk:Chunk)<-[:CONTAINS]-(parent)
    OPTIONAL MATCH (parent)-[:CONTAINS]->(s:Chunk)
    WITH chunk, s, score ORDER BY s.chunkSeqId ASC
    WITH collect(s.text) as textList, chunk.text as text, score AS score
    RETURN apoc.text.join(textList, " \n ")  as text,
    score,  {} AS metadata 
    ORDER BY score desc

"""

vector_store_parent = Neo4jVector.from_existing_index(
    embedding=embeddings_api,
    url=NEO4J_URI,
    username=NEO4J_USERNAME,
    password=NEO4J_PASSWORD,
    database="neo4j",
    index_name=VECTOR_INDEX_NAME,
    text_node_property=VECTOR_SOURCE_PROPERTY,
    retrieval_query=retrieval_query_parent
)

# Create a retriever from the vector store
retriever_parent = vector_store_parent.as_retriever(search_kwargs={'k': 10})

# Create a chatbot Question & Answer chain from the retriever
chain_parent = prettifyChain(RetrievalQA.from_chain_type(
    chat_api, 
    chain_type="stuff", 
    retriever=retriever_parent,
    chain_type_kwargs={"verbose": True}
))


In [268]:
docs = retriever_parent.invoke("what happens if a player is injured")
docs

[Document(page_content='When a Player is injured or compelled to leave the ice during a Game, they may retire from the Game and be replaced by a substitute, but play must continue without the Teams leaving the ice.  \nDuring the play, if an injured Player wishes to retire from the ice and be replaced by a substitute, they must do so at the Players’ Bench and not through any other exit leading from the Rink. This is not a legal Player change and therefore when a violation occurs, a Bench-minor Penalty shall be imposed. \n If a penalized Player has been injured, they may proceed to the Dressing Room without taking a seat in the Penalty Box. The penalized Team shall immediately put a substitute Player in the Penalty Box, who shall serve the penalty until the injured Player is able to return to the game. They would replace their Teammate in the Penalty Box at the next stoppage of play.  \nFor violation of this rule, a Bench Minor Penalty shall be imposed. \n Should the injured penalized Pl

In [269]:
chain_parent("what happens if a player is injured")



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
When a Player is injured or compelled to leave the ice during a Game, they may retire from the Game and be replaced by a substitute, but play must continue without the Teams leaving the ice.  
During the play, if an injured Player wishes to retire from the ice and be replaced by a substitute, they must do so at the Players’ Bench and not through any other exit leading from the Rink. This is not a legal Player change and therefore when a violation occurs, a Bench-minor Penalty shall be imposed. 
 If a penalized Player has been injured, they may proceed to the Dressing Room without taking a seat in the Penalty Box. The penalized Team shall immediately put a substitute Playe

In [266]:


retrieval_query_parent = """
    WITH node AS chunk, score AS score ORDER BY score 
    OPTIONAL MATCH (chunk:Chunk)<-[:CONTAINS]-(parent)
    OPTIONAL MATCH (parent)-[:CONTAINS]->(s:Chunk)
    WITH chunk, s, score ORDER BY s.chunkSeqId ASC
    WITH collect(s.text) as textList, chunk.text as text, score AS score
    RETURN apoc.text.join(textList, " \n ")  as text,
    score,  {} AS metadata 
    ORDER BY score desc
"""


def neo4j_vector_search_2(question, retrieval_query):
  """Search for similar nodes using the Neo4j vector index"""
  vector_search_query = """
    CALL db.index.vector.queryNodes($index_name, $top_k, $question_embedding) 
        YIELD node, score
  """ + retrieval_query
  similar = []

  print ("Using vector index: " + str(VECTOR_INDEX_NAME))
    
  question_embedding = embeddings_api.embed_query(question)
  return gdb.execute_query(vector_search_query,
                      question=question, 
                      question_embedding=question_embedding, 
                      index_name=VECTOR_INDEX_NAME, 
                      top_k=10
                    ).records

search_results = neo4j_vector_search_2(
    'injured player', retrieval_query_parent
)
search_results[0]
    

Using vector index: chunks_vector


<Record text='When a Player is injured or compelled to leave the ice during a Game, they may retire from the Game and be replaced by a substitute, but play must continue without the Teams leaving the ice.  \nDuring the play, if an injured Player wishes to retire from the ice and be replaced by a substitute, they must do so at the Players’ Bench and not through any other exit leading from the Rink. This is not a legal Player change and therefore when a violation occurs, a Bench-minor Penalty shall be imposed. \n If a penalized Player has been injured, they may proceed to the Dressing Room without taking a seat in the Penalty Box. The penalized Team shall immediately put a substitute Player in the Penalty Box, who shall serve the penalty until the injured Player is able to return to the game. They would replace their Teammate in the Penalty Box at the next stoppage of play.  \nFor violation of this rule, a Bench Minor Penalty shall be imposed. \n Should the injured penalized Player who h

In [None]:
MATCH (c:Chunk)<-[:CONTAINS]-(s:Section)
WHERE c.header4 = '8.1. INJURED PLAYER'
    RETURN c, s
    