# Expand Context with Hypothetical Questions

The Chunk Nodes are individual chunks of text that used
to be part of an entire document.

You can reconstruct the original context by connecting the nodes
with relationships. It also makes the data easy to navigate and understand.

That is super helpful when you're building an application, for debugging and testing.

And, you can provide a better user experience. Your users will be able 
to directly interact with the data and even provide feedback that will
improve subsequent answers. 

You will create a connected context by making the following
changes to the knowledge graph:

1. Extract, create `(:Document)` nodes for each original source Form.
2. Enhance, add a summarized text property to each `(:Form)` node.
3. Expand, connect each `(:Chunk)` to the `(:Form)` node that it is part of

The graph will look like this...

```cypher
(:Question
  questionId: string
  question: string
  embedding: float[] // vector embedding of quesion
)

```

```cypher
// Section has Question
(:Section)-[:HAS_QUESTIONS]->(:Question)
```

## Setup

In [99]:
%run 'shared.ipynb'

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
Connecting to Neo4j at bolt://neo4j-1:7687 as neo4j
Using data from /home/jovyan/data/single
Embedding with ollama using mxbai-embed-large
Chatting with ollama using llama3


# Enhance - create hypotetical question nodes, with embedding

During the file processing above, the text from all interesting
items were added to the `fullText` property of the `all_forms` dictionaries.

We'll use an LLM to generate the questions and create an embedding.

Both the text summary and the embdding will be added to the Form nodes.

In [166]:
text = "The Goalkeeper Restricted Area is a trapezoidal zone marked behind each goal on the ice surface, defined by two red lines that measure 6.80m along the goal line and 8.60m along the boards"
result = chat_api.invoke(
      f"""Generate a maximum of 3 short hypothetical questions based on the information from the text\n {text}. Only respond with the questions delimited by a newline (\n) and do not return "Here are some hypothetical questions based on the information:" 
      """)
print (result.content.split('\n\n'))

['What is the total area of the Goalkeeper Restricted Area?', 'How many meters longer is one side of the trapezoid compared to the other?', "If a player's skate extends 0.1m beyond the red line, will it be considered out of bounds?"]


In [124]:
# Create an embedding to find out the dimensions of the vector
text_embedding = embeddings_api.embed_query("embed this text using an LLM")
vector_dimensions = len(text_embedding) 

print(f"Text embeddings will have {vector_dimensions} dimensions")
# Create a vector index called "sections_vector" the `summaryEmbedding`` property of nodes labeled `Section`. 
gdb.execute_query("""
         CREATE VECTOR INDEX `questions_vector` IF NOT EXISTS
          FOR (s:Question) ON (s.embedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: $vectorDimensionsParam,
            `vector.similarity_function`: 'cosine'    
         }}
""",
  vectorDimensionsParam=vector_dimensions
)

# Check the vector indexes in the graph
gdb.execute_query('SHOW VECTOR INDEXES').records

Text embeddings will have 1024 dimensions


[<Record id=12 name='chunks_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Chunk'] properties=['embedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=neo4j.time.DateTime(2024, 6, 2, 18, 0, 56, 401000000, tzinfo=<UTC>) readCount=5>,
 <Record id=2 name='questions_vector' state='POPULATING' populationPercent=0.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Question'] properties=['embedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=None readCount=None>,
 <Record id=6 name='sections_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Section'] properties=['summaryEmbedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=neo4j.time.DateTime(2024, 6, 2, 17, 46, 42, 478000000, tzinfo=<UTC>) readCount=1>]

In [None]:
%%time

import uuid

query = """
MATCH (s:Section)-[:CONTAINS]->(c:Chunk)
WITH s, collect(c.text) as textList
RETURN s.sectionId AS sectionId, s.level AS level, apoc.text.join(textList, " \n ") AS text
"""

sections_with_text = gdb.execute_query(query).records
for section in sections_with_text[0:]:
    question_info = {}
    text = section["text"]
    result = chat_api.invoke(f"""
                Generate a maximum of 3 short hypothetical questions based on the information from the text\n {text}. Only respond with the questions delimited by a newline (\n) and do not return "Here are some hypothetical questions based on the information:
                """ 
    )    
    questions = result.content.split('\n\n')
    for question in questions:
        question_info['question'] = question
        question_info['sectionId'] = section["sectionId"]
        question_info['questionId'] = str(uuid.uuid4())
    
        question_embedding = embeddings_api.embed_query(question)
        question_info['embedding'] = question_embedding
        print(f"\Inserting Question node with question and embedding link it to Section with ID {question_info['sectionId']} ...")

        gdb.execute_query("""
          MERGE (q:Question {questionId: $questionInfoParam.questionId} )
            SET q.question = $questionInfoParam.question 
          WITH q
            CALL db.create.setNodeVectorProperty(q, "embedding", $questionInfoParam.embedding)
          MATCH (s:Section {sectionId: $questionInfoParam.sectionId} )
          MERGE (s)-[r:HAS_QUESTION]->(q)
        """, 
        questionInfoParam=question_info
        )


\Inserting Question node with question and embedding link it to Section with ID ca4f9dcf204e2037bfe5884867bead98bd9cbaf8 ...
\Inserting Question node with question and embedding link it to Section with ID ca4f9dcf204e2037bfe5884867bead98bd9cbaf8 ...
\Inserting Question node with question and embedding link it to Section with ID ca4f9dcf204e2037bfe5884867bead98bd9cbaf8 ...
\Inserting Question node with question and embedding link it to Section with ID 70669d65799f0b85f53fa4d87f745b1ec129af58 ...
\Inserting Question node with question and embedding link it to Section with ID 70669d65799f0b85f53fa4d87f745b1ec129af58 ...
\Inserting Question node with question and embedding link it to Section with ID 70669d65799f0b85f53fa4d87f745b1ec129af58 ...
\Inserting Question node with question and embedding link it to Section with ID 3256ff621b0e47207f42b2db33a4589ec1ddd458 ...
\Inserting Question node with question and embedding link it to Section with ID 3256ff621b0e47207f42b2db33a4589ec1ddd458 ...


In [136]:
import uuid

uuid4 = uuid.uuid4()
print(str(uuid4))

c751aae4-dfff-4f5f-ab8a-ddea14154037
