# Generation-aware Retreival

Consists of these optional components:
* Reranking
* Filtering
* Disambiguation followups
* Query decomposition
* Personalization

In [1]:
#%pip install --quiet llama-index llama-index-llms-gemini llama-index-embeddings-huggingface pydantic-ai 

In [1]:
MODEL_ID = "gemini-2.0-flash"
EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["GEMINI_API_KEY"][:2] == "AI",\
       "Please specify the GEMINI_API_KEY access token in keys.env file"
assert os.environ["HF_TOKEN"][:2] == "hf",\
       "Please specify the HF_TOKEN access token in keys.env file"

In [2]:
import sys
sys.path.append('../basic_rag')
import gutenberg_text_loader as gtl
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

In [29]:
# for pydantic AI in Jupyter
import nest_asyncio
nest_asyncio.apply()

The examples here are on two Geology texts:
* An 1878 book: The Student's Elements of Geology:  https://www.gutenberg.org/cache/epub/3772/pg3772.txt
* A  1905 book: The Elements of Geology: https://www.gutenberg.org/cache/epub/4204/pg4204.txt

## Plain Semantic Indexing to use as a comparison

In [10]:
#!rm -rf .cache vector_index   # uncomment to start afresh

In [12]:
!ls ./.cache vector_index

./.cache:
pg3772_3736454afe.txt  pg4204_81e8e90db3.txt

vector_index:
default__vector_store.json  graph_store.json	      index_store.json
docstore.json		    image__vector_store.json


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [13]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core import Document
import os
import pathlib

INDEX_DIR="vector_index"
Settings.embed_model = HuggingFaceEmbedding(
    model_name=EMBED_MODEL_ID
)

# these are the defaults in LlamaIndex
Settings.chunk_size = 1024; Settings.chunk_overlap = 20; TOP_K=2
#Settings.chunk_size = 100; Settings.chunk_overlap = 10; TOP_K=4

if os.path.isdir(INDEX_DIR):
    print("Loading in already created index")
    storage_context = StorageContext.from_defaults(persist_dir=INDEX_DIR)
    index = load_index_from_storage(storage_context)
else:
    # downloads into .cache the first time
    gs = gtl.GutenbergSource()
    gs.load_from_url("https://www.gutenberg.org/cache/epub/3772/pg3772.txt")
    gs.load_from_url("https://www.gutenberg.org/cache/epub/4204/pg4204.txt")
    # reads all files in .cache
    documents = SimpleDirectoryReader(input_dir="./.cache", required_exts=[".txt"], exclude_hidden=False).load_data()
    # creates a vector db
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=INDEX_DIR)

2025-03-27 18:25:35,393 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
2025-03-27 18:25:36,622 - INFO - 2 prompts are loaded, with the keys: ['query', 'text']


Loading in already created index


2025-03-27 18:25:38,643 - INFO - Loading all indices.


In [18]:
from llama_index.llms.gemini import Gemini
from llama_index.core.query_engine import RetrieverQueryEngine

llm = Gemini(model=f"models/{MODEL_ID}", api_key=os.environ["GEMINI_API_KEY"])

def semantic_rag(question, top_k=TOP_K, verbose=True):
    query_engine = RetrieverQueryEngine.from_args(
        retriever=index.as_retriever(similarity_top_k=top_k), llm=llm,
    )
    response = query_engine.query(question)
    response = {
        "answer": str(response),
        "source_nodes": response.source_nodes
    }
    if verbose:
        print(response['answer'])
        for node in response['source_nodes']:
            print(node)
    return response
 
semantic_rag("Describe the geology of the Grand Canyon");

  llm = Gemini(model=f"models/{MODEL_ID}", api_key=os.environ["GEMINI_API_KEY"])


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The Grand Canyon is north of the high plateaus of northern Arizona and southern Utah. The canyon is cut into stratified rocks that are more than ten thousand feet thick with a gentle inclination northward. From the broad platform rises a series of gigantic stairs, often more than one thousand feet high and a score or more miles in breadth. The retreating escarpments and the walls of the ravines are carved into architectural forms by weathering and deflation.

Node ID: 7b635fb9-7b61-4508-ad6a-370f5cd42822
Text: W. M. DAVIS    HARVARD UNIVERSITY, CAMBRIDGE, MASS.    JULY,
1905            CONTENTS    INTRODUCTION.--THE SCOPE AND AIM OF
GEOLOGY    PART I    EXTERNAL GEOLOGICAL AGENCIES         I. THE WORK
OF THE WEATHER      II. THE WORK OF GROUND WATER     III. RIVERS AND
VALLEYS      IV. RIVER DEPOSITS       V. THE WORK OF GLACIERS      VI.
THE WORK OF ...
Score:  0.771

Node ID: 2e6e56ad-1080-4534-9177-6e25d1db23ff
Text: As they are little protected by talus, which  commonly is
removed 

In [17]:
# shouldn't work because the national forest didn't exist at the time the book was written
semantic_rag("Describe the geology of Petrified National Forest");

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

I'm sorry, but the provided text does not contain information about the geology of Petrified National Forest.

Node ID: 7bfb39c5-9900-465d-8f7b-b9f0d49e6617
Text: In the Nova Scotia  field, out of seventy-six distinct coal
seams, twenty are  underlain by old forest grounds.    The presence of
fire clay beneath a seam points in the same  direction. Such
underclays withstand intense heat and are used in  making fire brick,
because their alkalies have been removed by the  long-continued growth
of vegetation....
Score:  0.707

Node ID: a3556434-f9be-45c2-871f-0cf47f859389
Text: — Purity of the Coal explained. — Conversion of Coal into
Anthracite. — Origin of Clay-ironstone. — Marine and brackish-water
Strata in Coal. — Fossil Insects. — Batrachian Reptiles. —
Labyrinthodont Foot-prints in Coal-measures. — Nova Scotia  Coal-
measures with successive Growths of erect fossil Trees. —  Similarity
of American and European...
Score:  0.703



Saved answer:
<pre>
I'm sorry, but the provided text does not contain information about the geology of Petrified National Forest.
<pre>

## Limitations of plain-vanilla Semantic Indexing



In [40]:
semantic_rag("Describe the geology of the Grand Canyon");

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The Grand Canyon is north of the high plateaus of northern Arizona and southern Utah. The canyon is cut into stratified rocks that are more than ten thousand feet thick with a gentle inclination northward. From the broad platform rises a series of gigantic stairs, often more than one thousand feet high and a score or more miles in breadth. The retreating escarpments and the walls of the ravines are carved into architectural forms by weathering and deflation.

Node ID: 7b635fb9-7b61-4508-ad6a-370f5cd42822
Text: W. M. DAVIS    HARVARD UNIVERSITY, CAMBRIDGE, MASS.    JULY,
1905            CONTENTS    INTRODUCTION.--THE SCOPE AND AIM OF
GEOLOGY    PART I    EXTERNAL GEOLOGICAL AGENCIES         I. THE WORK
OF THE WEATHER      II. THE WORK OF GROUND WATER     III. RIVERS AND
VALLEYS      IV. RIVER DEPOSITS       V. THE WORK OF GLACIERS      VI.
THE WORK OF ...
Score:  0.771

Node ID: 2e6e56ad-1080-4534-9177-6e25d1db23ff
Text: As they are little protected by talus, which  commonly is
removed 

Saved answer:
<pre>
The Grand Canyon is north of the high plateaus of northern Arizona and southern Utah. The canyon is cut into stratified rocks that are more than ten thousand feet thick with a gentle inclination northward. From the broad platform rises a series of gigantic stairs, often more than one thousand feet high and a score or more miles in breadth. The retreating escarpments and the walls of the ravines are carved into architectural forms by weathering and deflation.
</pre>
The first sentence of the answer is irrelevant. We didn't ask it where the Grand Canyon is located.

## Reranking

You can use reranking to identify chunks that actually answer the question, and to remove extraneous bits (contextual compression).
Another benefit is that you can look at the chunks and replace obsolete material

In [20]:
response = semantic_rag("Describe the geology of the Grand Canyon", top_k=4)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

The Grand Canyon is north of the high plateaus of northern Arizona and southern Utah. The plateaus are made of stratified rocks that are more than ten thousand feet thick with a gentle inclination northward. From the broad platform where the canyon was cut, a series of gigantic stairs rise, often more than one thousand feet high and a score or more miles in breadth. The retreating escarpments, cliffs of mesas and buttes, and the walls of the ravines are carved into architectural forms by weathering and deflation. The wind helps dissect the great plateaus in arid regions, smoothing them away to waterless plains of naked rock, residual gravel, or drifting residual sand.

Node ID: 7b635fb9-7b61-4508-ad6a-370f5cd42822
Text: W. M. DAVIS    HARVARD UNIVERSITY, CAMBRIDGE, MASS.    JULY,
1905            CONTENTS    INTRODUCTION.--THE SCOPE AND AIM OF
GEOLOGY    PART I    EXTERNAL GEOLOGICAL AGENCIES         I. THE WORK
OF THE WEATHER      II. THE WORK OF GROUND WATER     III. RIVERS AND
VALLEY

In [24]:
response['source_nodes'][0].metadata

{'file_path': '/home/jupyter/generative-ai-design-patterns/examples/generation_aware_retrieval/.cache/pg4204_81e8e90db3.txt',
 'file_name': 'pg4204_81e8e90db3.txt',
 'file_type': 'text/plain',
 'file_size': 685928,
 'creation_date': '2025-03-27',
 'last_modified_date': '2025-03-27'}

In [25]:
response['source_nodes'][0].text

"W. M. DAVIS\r\n\r\nHARVARD UNIVERSITY, CAMBRIDGE, MASS.\r\n\r\nJULY, 1905\r\n\r\n\r\n\r\n\r\n\r\nCONTENTS\r\n\r\nINTRODUCTION.--THE SCOPE AND AIM OF GEOLOGY\r\n\r\nPART I\r\n\r\nEXTERNAL GEOLOGICAL AGENCIES\r\n\r\n     I. THE WORK OF THE WEATHER\r\n    II. THE WORK OF GROUND WATER\r\n   III. RIVERS AND VALLEYS\r\n    IV. RIVER DEPOSITS\r\n     V. THE WORK OF GLACIERS\r\n    VI. THE WORK OF THE WIND\r\n   VII. THE SEA AND ITS SHORES\r\n  VIII. OFFSHORE AND DEEP-SEA DEPOSITS\r\n\r\nPART  II\r\n\r\nINTERNAL GEOLOGICAL AGENCIES\r\n\r\n    IX. MOVEMENTS OF THE EARTH'S CRUST\r\n     X. EARTHQUAKES\r\n    XI. VOLCANOES\r\n   XII. UNDERGROUND STRUCTURES OF IGNEOUS ORIGIN\r\n  XIII. METAMORPHISM AND MINERAL VEINS\r\n\r\nPART III\r\n\r\nHISTORICAL GEOLOGY\r\n\r\n   XIV. THE GEOLOGICAL RECORD\r\n    XV. THE PRE-CAMBRIAN SYSTEMS\r\n   XVI. THE CAMBRIAN\r\n  XVII. THE ORDOVICIAN AND SILURIAN\r\n XVIII. THE DEVONIAN\r\n   XIX. THE CARBONIFEROUS\r\n    XX. THE MESOZOIC\r\n   XXI. THE TERTIARY\r\n  X

In [30]:
from dataclasses import dataclass
import pydantic_ai
from pydantic_ai.models.gemini import GeminiModel
from pydantic_ai import Agent

model = GeminiModel(MODEL_ID, api_key=os.getenv('GEMINI_API_KEY'))

@dataclass
class Chunk:
    full_text: str
    publication_year: int
    relevant_text: str
    relevance_score: float

def process_node(query, node):
    system_prompt = """
    You will be given a query and some text.
    1. Assign a publication year if it's clear from the text, else say it's the current year
    2. Remove information from the text that is not relevant for answering the question.
    3. Assign a relevance score between 0 and 1 where 1 means that the text answers the question 
    """
    agent = Agent(model, result_type=Chunk, system_prompt=system_prompt)
    chunk = agent.run_sync(f"**Query**: {query}\n **Full Text**: {node.text}").data
    if node.metadata['file_name'].startswith('pg4204'):
        chunk.publication_year = 1905
    else:
        chunk.publication_year = 1878
    return chunk
                              
chunks = [process_node(response['source_nodes'], node) for node in response['source_nodes']]    

2025-03-27 19:31:26,749 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 19:31:32,616 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 19:31:39,601 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 19:31:46,861 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"


In [32]:
chunks[0].relevant_text

"Geology deals with the rocks of the earth's crust. It learns from their composition and structure how the rocks were made and how they have been modified. It ascertains how they have been brought to their present places and wrought to their various topographic forms, such as hills and valleys, plains and mountains. It studies the vestiges which the rocks preserve of ancient organisms which once inhabited our planet. Geology is the history of the earth and its inhabitants, as read in the rocks of the earth's crust.\n\nThe ledges of the valley of our illustration are of sandstone. Looking closely at the rock we see that it is composed of myriads of grains of sand cemented together.\n\nA short search may find in the rock relics of animals, such as the imprints of shells, which lived when it was deposited; and as these are of kinds whose nearest living relatives now have their home in the sea, we infer that it was on the flat sea floor that the sandstone was laid. Its present position hun

In [37]:
latest_year = max([chunk.publication_year for chunk in chunks])

In [42]:
def rerank_rag(query, top_k=TOP_K):
    # retrieve matching nodes
    retriever=index.as_retriever(similarity_top_k=top_k*4)  # 4 times as many
    nodes = retriever.retrieve(query)
    
    # shorten text to what's relevant and get a relevance score
    chunks = [process_node(query, node) for node in nodes]
    sorted(chunks, key=lambda x: x.relevance_score, reverse=True)
    
    # keep only chunks from the latest publication year
    latest_year = max([chunk.publication_year for chunk in chunks])
    chunks = [chunk for chunk in chunks if chunk.publication_year == latest_year]
    
    # limit to best k
    chunks = chunks[:top_k]
    
    system_prompt = """
    Use the information provided in the context to answer the question.
    Limit your answer to what's known based on the provided information.
    """
    agent = Agent(model, result_type=str, system_prompt=system_prompt)
    answer = agent.run_sync(f"**Query**: {query}\n **Context**: {[chunk.relevant_text for chunk in chunks]}\n **Answer**:").data
    
    response = {
        "answer": answer,
        "source_nodes": chunks # [chunk.relevant_text for chunk in chunks]
    }
    return response

rerank_rag("Describe the geology of the Grand Canyon", top_k=2)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

2025-03-27 20:07:04,266 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 20:07:10,270 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 20:07:16,117 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 20:07:21,871 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 20:07:27,358 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 20:07:32,749 - INFO - HTTP Request: POST https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent "HTTP/1.1 200 OK"
2025-03-27 20:07:38,085 - INFO - HTTP Request:

{'answer': 'The high plateaus north of the Grand Canyon are made of stratified rocks that are more than ten thousand feet thick with a gentle inclination northward. The canyon has been cut into a broad platform with gigantic stairs. The walls of the ravines are carved into architectural forms by weathering and deflation.\n',
 'source_nodes': [Chunk(full_text="W. M. DAVIS\n\nHARVARD UNIVERSITY, CAMBRIDGE, MASS.\n\nJULY, 1905\n\n\n\nCONTENTS\n\nINTRODUCTION.--THE SCOPE AND AIM OF GEOLOGY\n\nPART I\n\nEXTERNAL GEOLOGICAL AGENCIES\n\n     I. THE WORK OF THE WEATHER\n    II. THE WORK OF GROUND WATER\n   III. RIVERS AND VALLEYS\n    IV. RIVER DEPOSITS\n     V. THE WORK OF GLACIERS\n    VI. THE WORK OF THE WIND\n   VII. THE SEA AND ITS SHORES\n  VIII. OFFSHORE AND DEEP-SEA DEPOSITS\n\nPART  II\n\nINTERNAL GEOLOGICAL AGENCIES\n\n    IX. MOVEMENTS OF THE EARTH'S CRUST\n     X. EARTHQUAKES\n    XI. VOLCANOES\n   XII. UNDERGROUND STRUCTURES OF IGNEOUS ORIGIN\n  XIII. METAMORPHISM AND MINERAL VEIN

Saved answer:
    <pre>
    The high plateaus north of the Grand Canyon are composed of stratified rocks that are more than ten thousand feet thick with a gentle inclination northward. The canyon has been cut into a broad platform from which a series of gigantic stairs rise, often more than one thousand feet high and a score or more of miles in breadth. The ledges of the valley in the illustration are of sandstone and are composed of myriads of grains of sand cemented together. The surface of some of these layers is ripple-marked.
    </pre>
    
This is all relevant.  Also, usefully, the chunks' relevant text is now more concise.