# Expand Context with Hypothetical Queries


The graph will look like this...

```cypher
(:Question
  questionId: string
  question: string
  embedding: float[] // vector embedding of quesion
)

```

```cypher
// Section has Question
(:Section)-[:HAS_QUESTIONS]->(:Question)
```

## Setup

In [11]:
%run 'shared.ipynb'

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv
Connecting to Neo4j at bolt://neo4j-1:7687 as neo4j
Using data from /home/jovyan/data/single
Embedding with ollama using mxbai-embed-large
Chatting with ollama using llama3


# Enhance - create hypotetical question nodes, with embedding

For each `Section` node which contains `Chunk` nodes (holds text) let's create a maximum of **3** short hypthetical questions a user might ask about the text in this section.

For each question and embedding is created and stored in a new `Question` node.

### Let's see a good prompt for the LLM to create questions

In [22]:
text = "The Goalkeeper Restricted Area is a trapezoidal zone marked behind each goal on the ice surface, defined by two red lines that measure 6.80m along the goal line and 8.60m along the boards"
result = chat_api.invoke(
      f"""Generate a maximum of 3 short hypothetical questions based on the information from the text\n {text}. Only respond with the questions delimited by a newline (\n) and do not return "Here are 3 hypothetical questions based on the information:" 
      """)
print (result.content.split('\n\n'))

['What is the total area of the Goalkeeper Restricted Area?', 'How many meters longer is one side of the trapezoid compared to the other?', "If a player's skate extends beyond the red line, but does not enter the Goalkeeper Restricted Area, is it considered an infraction?"]


### Create the vector index or the embedding on the `Question` node

In [3]:
# Create an embedding to find out the dimensions of the vector
text_embedding = embeddings_api.embed_query("embed this text using an LLM")
vector_dimensions = len(text_embedding) 

print(f"Text embeddings will have {vector_dimensions} dimensions")
# Create a vector index called "sections_vector" the `summaryEmbedding`` property of nodes labeled `Section`. 
gdb.execute_query("""
         CREATE VECTOR INDEX `questions_vector` IF NOT EXISTS
          FOR (s:Question) ON (s.embedding) 
          OPTIONS { indexConfig: {
            `vector.dimensions`: $vectorDimensionsParam,
            `vector.similarity_function`: 'cosine'    
         }}
""",
  vectorDimensionsParam=vector_dimensions
)

# Check the vector indexes in the graph
gdb.execute_query('SHOW VECTOR INDEXES').records

Text embeddings will have 1024 dimensions


[<Record id=11 name='chunks_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Chunk'] properties=['embedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=neo4j.time.DateTime(2024, 6, 5, 9, 30, 46, 871000000, tzinfo=<UTC>) readCount=19>,
 <Record id=3 name='questions_vector' state='POPULATING' populationPercent=0.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Question'] properties=['embedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=None readCount=None>,
 <Record id=2 name='sections_vector' state='ONLINE' populationPercent=100.0 type='VECTOR' entityType='NODE' labelsOrTypes=['Section'] properties=['summaryEmbedding'] indexProvider='vector-2.0' owningConstraint=None lastRead=None readCount=0>]

### Create generating questions for all `Section` nodes which have a direct relationship `CONTAINS` to a `Chunk`

In [26]:
%%time

import uuid

query = """
MATCH (s:Section)-[:CONTAINS]->(c:Chunk)
WITH s, collect(c.text) as textList
RETURN s.sectionId AS sectionId, s.level AS level, apoc.text.join(textList, " \n ") AS text
"""

sections_with_text = gdb.execute_query(query).records
for section in sections_with_text[0:]:
    question_info = {}
    text = section["text"]
    result = chat_api.invoke(f"""
                Generate a maximum of 3 short hypothetical questions based on the information from the text\n {text}. Only respond with the questions delimited by a newline (\n) and do not return "Here are 3 hypothetical questions based on the information:
                """ 
    )    
    questions = result.content.split('\n\n')
    for question in questions:
        if '?' in question:
            question_info['question'] = question
            question_info['sectionId'] = section["sectionId"]
            question_info['questionId'] = str(uuid.uuid4())
        
            question_embedding = embeddings_api.embed_query(question)
            question_info['embedding'] = question_embedding
            
            print(f"Inserting Question node with question and embedding link it to Section with \n - ID={question_info['sectionId']}\n - text={question_info['question']}")
            print(f"----------------------------------------------")
            gdb.execute_query("""
              MERGE (q:Question {questionId: $questionInfoParam.questionId} )
                SET q.question = $questionInfoParam.question 
              WITH q
                CALL db.create.setNodeVectorProperty(q, "embedding", $questionInfoParam.embedding)
              MATCH (s:Section {sectionId: $questionInfoParam.sectionId} )
              MERGE (s)-[r:HAS_QUESTION]->(q)
            """, 
            questionInfoParam=question_info
            )


Inserting Question node with question and embedding link it to Section with 
 - ID=ca4f9dcf204e2037bfe5884867bead98bd9cbaf8
 - text=) What would happen if different countries had their own unique rules for ice hockey, and how would this affect the game's global unity?
----------------------------------------------
Inserting Question node with question and embedding link it to Section with 
 - ID=ca4f9dcf204e2037bfe5884867bead98bd9cbaf8
 - text=) How does the IIHF Championship program ensure that all players, regardless of age or gender, have a fair and leveled standard of play?
----------------------------------------------
Inserting Question node with question and embedding link it to Section with 
 - ID=ca4f9dcf204e2037bfe5884867bead98bd9cbaf8
 - text=) Can you think of any situations where the speed and excitement of ice hockey might clash with the importance of fair play and respect, and how would you resolve these conflicts?
----------------------------------------------
Inserting

### Remove all generated `Question` nodes

In [25]:
cypher_remove_questions = """
MATCH (q:Question) DETACH DELETE q
"""

gdb.execute_query(cypher_remove_questions)

EagerResult(records=[], summary=<neo4j._work.summary.ResultSummary object at 0xffff2d075450>, keys=[])