# Document Augmentation through Question Generation for Enhanced Retrieval

This implementation demonstrates a text augmentation technique that leverages additional question generation to improve document retrieval within a vector database. By generating and incorporating various questions related to each text fragment, the system enhances the standard retrieval process, thus increasing the likelihood of finding relevant documents that can be utilized as context for generative question answering.

In [19]:
!pip install -q llama_index


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [21]:
from dotenv import load_dotenv

load_dotenv()

True

## Loading Documents and Text Chunking

In [64]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./data1").load_data()

In [22]:
from llama_index.llms.groq import Groq
llm = Groq(model="llama3-70b-8192")

  from .autonotebook import tqdm as notebook_tqdm


In [23]:
from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.llms.ollama import Ollama

embed_model = OllamaEmbedding(model_name="nomic-embed-text:latest")
# llm = Ollama(model="mistral")

In [82]:
from llama_index.core.node_parser import SentenceSplitter

parser = SentenceSplitter(chunk_size=512, chunk_overlap=64)
nodes = parser.get_nodes_from_documents(documents)
len(nodes)

61

## Question Generation and Augmentation

This is the standout step from naive RAG. Instead of simply embedding the chunks(docs), we first leverage LLM to generate relevant questions using the chunk as context. Then we augment the chunks with these relevant questions before finally embedding using `VectorStoreIndex`.

This enables the retriever to improve the quality of information retrieval in vector-based document search systems. By generating additional questions similar to user queries.

In [68]:
from llama_index.core.schema import NodeWithScore

def generate_questions_for_node(node, num_questions=3):
    prompt = f"""Generate {num_questions} relevant questions that could be answered by the following text:

\"\"\"{node.text}\"\"\"

Questions:"""
    response = llm.complete(prompt)
    questions = response.text.strip().split("\n")
    return questions

In [69]:
augmented_nodes = []
k = 0
for node in nodes:
    questions = generate_questions_for_node(node)
    question_text = "\n".join(questions)
    augmented_text = node.text + "\n\n" + "Related Questions:\n" + question_text
    node.text = augmented_text
    augmented_nodes.append(node)
    k+=1
    print("Q generated for Chunk", k)

Q generated for Chunk 1
Q generated for Chunk 2
Q generated for Chunk 3
Q generated for Chunk 4
Q generated for Chunk 5
Q generated for Chunk 6
Q generated for Chunk 7
Q generated for Chunk 8
Q generated for Chunk 9
Q generated for Chunk 10
Q generated for Chunk 11
Q generated for Chunk 12
Q generated for Chunk 13
Q generated for Chunk 14
Q generated for Chunk 15
Q generated for Chunk 16
Q generated for Chunk 17
Q generated for Chunk 18
Q generated for Chunk 19
Q generated for Chunk 20
Q generated for Chunk 21
Q generated for Chunk 22
Q generated for Chunk 23
Q generated for Chunk 24
Q generated for Chunk 25
Q generated for Chunk 26
Q generated for Chunk 27
Q generated for Chunk 28
Q generated for Chunk 29
Q generated for Chunk 30
Q generated for Chunk 31
Q generated for Chunk 32
Q generated for Chunk 33
Q generated for Chunk 34
Q generated for Chunk 35
Q generated for Chunk 36
Q generated for Chunk 37
Q generated for Chunk 38
Q generated for Chunk 39
Q generated for Chunk 40
Q generat

### Example Augmented Node

In [70]:
print(augmented_nodes[1].text)

Air safety, enforced in large 
part by the certification process of the Federal Aviation Administration ( FAA), was enjoying an 
exemplary record: Globally, the five-year worldwide average stood at one fatal airliner crash for every 
2.5 million to 3 million flights. In the United States, airline safety had reached record levels, with only 
one passenger fatality in more than 10 years.2   
 
Boeing’s initial response to the crash focused on Lion Air’s airline mainte nance procedures and 
suggested the pilots were at fault.3 Nonetheless, about eight days later, on November 6, 2018, Boeing 
issued a bulletin to all 737 MAX 8 and 737 MAX 9 operators indicating that “erroneous angle-of-attack 
data” could result in “uncommanded nose-down movement of the aircraft and that this action can repeat 
until the related system is deactivated.”4 The Boeing bulletin provided additional instructions to pilots 
who might encounter such a dangerous situation. On November 7, 2018, the FAA followed by is

## Embedding

In [81]:
from llama_index.core import VectorStoreIndex, StorageContext

index = VectorStoreIndex(augmented_nodes, embed_model=embed_model)

## Retrieval and Inference

In [73]:
query_engine = index.as_query_engine(llm=llm)
system_prompt = "You are a helpful assistant that only responds with accurate and cited information from the context. Responses should be fairly detailed."
query = "How many people died in Lion Air crash, when was this and what were the main reasons for the crash?"
response = query_engine.query(f"{system_prompt}\n\n Query : {query}")
print(response)
query2 = "What are the controversies surrounding boeing aircraft's crashes?"
response = query_engine.query(f"{system_prompt}\n\n Query : {query2}")
print(response)

According to the provided text, 189 people died in the Lion Air Flight 610 crash, which occurred on October 29, 2018. The main reason for the crash was not explicitly stated in the text, but it was mentioned that the plane commenced its rapid descent at 5,000 feet, just 11 minutes after taking off from Jakarta's Soekarno-Hatta International Airport, and the pilots lost control of the aircraft.
The controversies surrounding Boeing's aircraft crashes involve high-stakes public disputes over the root causes of the crashes. Some, like U.S. Congressman Sam Graves, declared that pilots trained in the United States would have been able to handle the emergencies on both jets. Others, however, blamed MCAS and a flawed design process at Boeing. Still, others faulted the FAA. A subsequent report by the U.S. Department of Transportation's Inspector General identified "limitations in FAA's guidance and processes that impacted certification …." These included "communication gaps," "management and ov