# Attribution + Hallucination Control + Long Context Handling

1. Attribution -> Where did this answer come from 
2. Hallucination Control -> Is the answer supported by the retrieved text 
3. Long Context handling -> Reasoning over more text without breaking | Losing context windows

* Return exact span and source of the retrieved texts, evidence based QA 
* Change the prompting to ensure whatever it answers its only from the retrieved texts
* Structured output for better response quality 

| Extractive QA vs Generative QA
| Evidence grounded generation 
| Chain of verification prompts 

* Prompt level verification
* Self verification pass (pass through llm again)
* Claim decomposition - break answer into atomic claims 
* Retrieval based validation 

| Faithfulness vs Relevance | Self-consistency | Verifier LLM's | RAG Hallucination Taxonomy 

* Map Reduce -> Split context - Anwer per chunk - Aggregate 
* Hierarchial Rag -> Chunk - Summarize - retrieve summaries - expand
* Sliding window/Streaming RAG -> LLM reasons incrementally instead of all at once 
* Check long context models and their accuracy vs context length for particular tasks

| Local Inference Runtimes | Context Window vs VRAM Tradeoffs | Chunk Routing instead of brute force context

**Metrics to track**

* Hallucination Rate 
* Citation Accuracy 
* Coverage 
* Human Evaluation 

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter, SentenceTransformersTokenTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_classic.schema import Document
from langchain_community.retrievers import BM25Retriever
from langchain_classic.retrievers import EnsembleRetriever
from langchain_classic.retrievers import ContextualCompressionRetriever 
from langchain_classic.retrievers.document_compressors import CrossEncoderReranker 
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
from sentence_transformers import CrossEncoder
import json, os
from langchain_huggingface import HuggingFacePipeline 
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
from langchain_classic.chains import RetrievalQA
from langchain_classic.prompts import PromptTemplate

In [2]:
#Load pdf
pdf_path = "book1.pdf"
pdf = PyPDFLoader(pdf_path)
docs = pdf.load() 
print("Number of loaded documents are ", len(docs))
print("Example of one document \n ", docs[35].page_content[:250])

#Create chunks 
chunk_size = 600 
chunk_overlap = 150 

splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
                                          chunk_overlap=chunk_overlap,
                                          separators=["\n\n", "\n", " ", ""]
                                          )
chunked_docs = splitter.split_documents(docs)
print("Chunks created ", len(chunked_docs))

Number of loaded documents are  228
Example of one document 
  34 Harry Potter  
 
been trying to do. He shouted at Harry for about half an hour and 
then told him to go and make a cup of tea. Harry shuffled miser-
ably off into the kitchen, and by the time he got back, the post  
had arrived, right into Uncle V
Chunks created  1057


In [3]:
for index, doc in enumerate(chunked_docs):
    doc.metadata["chunk_id"] = f"p{doc.metadata['page']}_c{index}"

In [4]:
chunked_docs[10].metadata

{'producer': 'Acrobat Distiller 7.0.5 (Windows)',
 'creator': 'PScript5.dll Version 5.2',
 'creationdate': '2012-05-01T23:57:27+10:00',
 'subject': "When a letter arrives for unhappy but ordinary Harry Potter, a decade-old secret is revealed to him. His parents were wizards, killed by a Dark Lord's curse when Harry was just a baby, and which he somehow survived. Escaping from his unbearable Muggle guardians to Hogwarts, a wizarding school brimming with ghosts and enchantments, Harry stumbles into a sinister adventure when he finds a threeheaded dog guarding a room on the third floor. Then he hears of a missing stone with astonishing powers which could be valuable, dangerous, or both.",
 'author': 'J.K. Rowling',
 'keywords': '',
 'moddate': '2012-07-01T14:20:39+10:00',
 'title': 'Harry Potter and the Philosopher’s Stone',
 'source': 'book1.pdf',
 'total_pages': 228,
 'page': 8,
 'page_label': '9',
 'chunk_id': 'p8_c10'}

In [5]:
persona = '''You are a literary analyst assistant. Your work is to analyze context of books given to you and answer questions accurately'''

instructions = '''
Rules:
1. Use only the context given below
2. Every factual claim by you must be supported by a direct quote
3. Attach the chunk_id for each quote
4. If the answer to the question is not fully supported, reply back by saying "Answer not given in the current context"

'''

answer_format = '''
Context:
{context}

Question:
{question}

Return your answer in this JSON format:

{{
  "answer": "...",
  "claims": [
    {{
      "claim": "...",
      "evidence": {{
        "quote": "...",
        "chunk_id": "..."
      }}
    }}
  ],
  "confidence": <return confidence of your answer here upto the first decimal point>
}}
'''

prompt_template = f'''
{persona}

{instructions}

{answer_format}
''' 

In [6]:
answer_prompt = PromptTemplate(
    input_variables = ["context", "question"],
    template=prompt_template
)

In [7]:
#Load embeddings model
embeddings_model = HuggingFaceEmbeddings(model_name = "KaLM-Embedding/KaLM-embedding-multilingual-mini-instruct-v2.5",
                                         model_kwargs = {"device" : "cuda"},
                                         encode_kwargs = {"normalize_embeddings" : True})

#Create vector store from embeddings model 
vectorstore = FAISS.from_documents(chunked_docs, embedding=embeddings_model)
vectorstore.save_local("faiss_book1_index")

# Creating retriever for chain operations 
faiss_retriever = vectorstore.as_retriever(search_type="similarity",
                                           search_kwargs = {"k":3})

#Load BM25 Retriever
bm25_retriever = BM25Retriever.from_documents(chunked_docs)
bm25_retriever.k =5 

hybrid_retriever = EnsembleRetriever(retrievers=[bm25_retriever, faiss_retriever],
                                     weights = [0.4, 0.6] #FAISS Gets more weights -> check why
)

#Create the reranker
reranker_model = HuggingFaceCrossEncoder(model_name="BAAI/bge-reranker-v2-m3",
                                         model_kwargs = {"device" : "cuda"})
reranker =  CrossEncoderReranker(model = reranker_model,
                                top_n=3)

#Wrap hybrid retriever 
compression_retriever = ContextualCompressionRetriever(base_retriever=hybrid_retriever,
                                                       base_compressor=reranker)

In [8]:
# Loading the llm model to implement on top of this RAG
model_name = "Qwen/Qwen3-0.6B" 

tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir="models")
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             device_map="cuda",
                                             cache_dir="models",
                                             dtype=torch.bfloat16)

pipe = pipeline("text-generation",
                model=model,
                tokenizer = tokenizer,
                max_new_tokens = 256,
                temperature = 0.6)

llm = HuggingFacePipeline(pipeline=pipe)

Device set to use cuda


In [9]:
def generate_answer(question, retriever, llm):
    docs = compression_retriever.invoke(question) #Return the final context docs which should be mix of BM25 + FAISS Embeddings + Reranking
    context = "\n\n".join(f"Chunk ID - [{d.metadata['chunk_id']}] \n{d.page_content}" for d in docs)
    prompt = answer_format.format(context=context, question=question)
    response = llm.invoke(prompt)
    return response, docs, context

In [10]:
verifier_prompt_template = '''
You are a strict fact checker.

Check whether ALL claims in the answer are fully supported by the context.

Rules:
- A claim is supported ONLY if the quote logically entails it.
- If any claim is unsupported, mark the answer as unsupported.

Return JSON:

{{
  "supported": true/false,
  "unsupported_claims": [],
  "confidence": <return confidence of your answer here upto the first decimal point>
}}

Question:
{question}

Answer:
{answer}

Context:
{context}
'''

verifier_prompt = PromptTemplate(input_variables=["question", "answer", "context"],
                                 template=verifier_prompt_template)

In [11]:
def verify_answer(question, answer_json, docs, llm):
    context = "\n\n".join(f"{d.metadata['chunk_id']} \n{d.page_content}" for d in docs) 
    prompt = verifier_prompt.format(question=question, answer=answer_json, context=context) 
    verification = llm.invoke(prompt)
    return verification

In [12]:
question = "Which house was Harry sorted into ?"

answer, docs, context = generate_answer(question, compression_retriever, llm)
print("LLM Answer is: \n------------ \n", answer)
print("\n--------------\nRetrieved docs are : \n")
for i in docs:
    print(i.page_content[0:100])

LLM Answer is: 
------------ 
 
Context:
Chunk ID - [p222_c1041] 
McGonagall’s giant chess set!’ 
At last there was silence again. 
‘Second – to Miss Hermione Granger … for the use of cool logic 
in the face of fire, I award Gryffindor house fifty points.’ 
Hermione buried her face in her arms; Harry strongly suspected 
she had burst into tears. Gryffindors up and down the table were 
beside themselves – they were a hundred points up. 
‘Third – to Mr Harry Potter …’ said Dumbledore. The room 
went deadly quiet. ‘… for pure nerve and outstanding courage, I 
award Gryffindor house sixty points.’ 
The din was deafening. Those who could add up while yelling

Chunk ID - [p80_c367] 
‘Am I?’ said Harry, feeling dazed. 
‘Goodness, didn’t you know, I’d have found out everything I 
could if it was me,’ said Hermio ne. ‘Do either of you know what 
house you’ll be in? I’ve been asking around and I hope I’m in 
Gryffindor, it sounds by far the best, I hear Dumbledore himself

Chunk ID - [p92_c417] 

In [13]:
verification = verify_answer(question, answer, docs, llm)
print(verification)


You are a strict fact checker.

Check whether ALL claims in the answer are fully supported by the context.

Rules:
- A claim is supported ONLY if the quote logically entails it.
- If any claim is unsupported, mark the answer as unsupported.

Return JSON:

{
  "supported": true/false,
  "unsupported_claims": [],
  "confidence": <return confidence of your answer here upto the first decimal point>
}

Question:
Which house was Harry sorted into ?

Answer:

Context:
Chunk ID - [p222_c1041] 
McGonagall’s giant chess set!’ 
At last there was silence again. 
‘Second – to Miss Hermione Granger … for the use of cool logic 
in the face of fire, I award Gryffindor house fifty points.’ 
Hermione buried her face in her arms; Harry strongly suspected 
she had burst into tears. Gryffindors up and down the table were 
beside themselves – they were a hundred points up. 
‘Third – to Mr Harry Potter …’ said Dumbledore. The room 
went deadly quiet. ‘… for pure nerve and outstanding courage, I 
award Gryff

In [14]:
# For long context handling we will map step (per chunk)

map_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template = '''
Answer the question only using the context given below 
If not relevant say "No Relevant Information". 

Context: 
{context}

Question: 
{question}
'''
)

In [15]:
def map_steps(question, docs, llm):
    partial_answers = [] 
    for d in docs:
        prompt = map_prompt.format(context=d.page_content, question=question)
        response = llm.invoke(prompt)
        partial_answers.append((response, d.metadata))
    return partial_answers 

In [17]:
reduce_prompt = PromptTemplate(input_variables=["partials", "question"],
                               template='''
Combine the partial answers below into a single coherent answer. 
Cite sources and resolve contradictions 

Partial Answers: 
{partials}

Question: 
{question}
''')

In [18]:
def reduce_step(question, partials, llm):
    temp = "\n\n".join(f"[{metadata['chunk_id']}] : {text}" for text, metadata in partials)
    prompt = reduce_prompt.format(partials=temp, question=question) 
    response = llm.invoke(prompt)
    return response 

In [20]:
question = "Did Harry win his Quiditch match ?"
docs = compression_retriever.invoke(question)

partial_answers = map_steps(question, docs, llm)
final_answers = reduce_step(question, partial_answers, llm)

print(final_answers[0:100]) #Truncating the final response due to model hallucinations and giving repeated texts back | Need inst. based models but the issue is with my personal vram limitations


Combine the partial answers below into a single coherent answer. 
Cite sources and resolve contradi
