<a href="https://colab.research.google.com/github/jm7n7/week-5-adv-rag/blob/main/ADV_RAG_Hands_On.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Track A, B, C: Reranking & Context Optimization | Multimodal RAG | Evaluation & Guardrails**

##Track A: Reranking & Context Optimization
*   Implement Reciprocal Rank Fusion (BM25 + dense)
*   Add a reranker (e.g., cross-encoder) and/or MMR/passage compression
*   Compare Baseline vs. +Rerank vs. +Compression vs. +Both on your project queries
##Track B: Multimodal RAG
*  Add image/chart support via captions/embeddings (e.g., Gemini-Vision, BLIP2)
*   Build a joint index for text + image
*  Demonstrate text-only, image-only, and hybrid queries with grounded citations
##Track C: Evaluation & Guardrails
*   Create an eval set (10-20 project queries) with gold answers + source IDs
*   Compute correctness, faithfullness, context precision/recall, latency, token cost
*   Add guardrails: citation enforcement, PII redaction, refusal template
*   Report Before vs. after guardrails

##1. Install & Setup
*   Install
    - numpy
    - pandas
    - matplotlib
    - sentence-transformers
    - faiss
    - langchain
    - openai
*   Log environment to env_rag_adv.json
##2. Load Your Project Materials
*   Use the same documents from week 4 (optional add new documents)
    - PDFs (research papers, survey articles, datasets)
    - Text/Markdown notes
*   Include 2-3 images / charts for Track B
##3. Retrieval Upgrades (Track A)
*   Implement RRF (BM25 + dense)
    - Add reranker + compression
*   Log
    - Recall@k
    - latency
    - avg context length
    - token cost
##4. Multimodal Retrieval (Track B)
*   Caption / encode images with CLIP/BLIP2/Gemini-Vision
*   Show at least _one image-only query_ retrieving a relevant chart with citations
##5. Evaluation & Guardrails (Track C)
*   Build eval_queries.jsonl
*   Compute
    - correctness/faithfullness
    - latency before guardrails
    - latency after guardrails
*   Include at least _one adversarial/unsafe/PII query_ to test guardrails
##6. Ablation Study
*   Fill ablation_results.csv:
    - Baseline
    - +Rerank
    - +Compression
    - +Multimodal
    - +Guardrails
*   Plot recall versus latency using matplotlib
##7. Reproducibility log
*   Save configs in rag_adv_run_config.json
    - embedding models
    - reranker
    - chunking
    - multimodal pipeline
    - guardrails
    - retriever (k)

## Step 1

In [None]:
# Install
%pip install langchain chromadb sentence-transformers transformers langchain-community pypdf rank_bm25



In [39]:
# import packages
import os
import sys
import json
import time
import torch
import platform
import chromadb
import numpy as np
import pandas as pd
import transformers
import sentence_transformers
import matplotlib.pyplot as plt
#
from rank_bm25 import BM25Okapi
from google.colab import userdata
from langchain.schema import Document
from langchain.llms import HuggingFaceHub
from langchain.chains import RetrievalQA
from langchain.retrievers import EnsembleRetriever
from langchain.document_loaders import PyPDFLoader
from langchain_community.vectorstores import Chroma
from langchain_community.retrievers import BM25Retriever
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
#
try:
    import torch
    torch_v = torch.__version__
    cuda_ok = torch.cuda.is_available()
    device_name = torch.cuda.get_device_name(0) if cuda_ok else "CPU"
except:
    torch_v, cuda_ok, device_name = "N/A", False, "CPU"

In [12]:
# Log versions
env_info = {
    "python": sys.version,
    "platform": platform.platform(),
    "torch": torch_v,
    "cuda": cuda_ok,
    "device": device_name,
    "transformers": transformers.__version__,
    "sentence_transformers": sentence_transformers.__version__,
    "chromadb": chromadb.__version__,
    "numpy": np.__version__,
    "pandas": pd.__version__,
}

# Save results in env_rag_adv.json
output_dir = '/content/drive/MyDrive/Capstone/Week 5_Advanced_RAG'
file_path = os.path.join(output_dir, "env_rag_adv.json")

# Ensure the directory exists
os.makedirs(os.path.dirname(file_path), exist_ok=True)

# Check if the file exists and load existing data
existing_data = {}
if os.path.exists(file_path):
    try:
        with open(file_path, 'r') as f:
            existing_data = json.load(f)
    except json.JSONDecodeError:
        existing_data = {} # Handle empty or invalid JSON

# Update existing data with new environment info
existing_data.update(env_info)

with open(file_path, 'w') as f:
    json.dump(existing_data, f, indent=4)

print(f"Environment information saved to {file_path}")

Environment information saved to /content/drive/MyDrive/Capstone/Week 5_Advanced_RAG/env_rag_adv.json


## Step 2

In [13]:
# Define the directory where the files are located
file_dir = '/content/drive/MyDrive/Capstone/Week 5_Advanced_RAG'

# List of PDF files to load for text-based RAG
pdf_files = ["maia-2.pdf", "Amortized_chess.pdf", "chessgpt.pdf"]

# List of PNG files to be used for multimodal RAG (will be processed separately)
png_files = ["daily_puzzle.png", "puzzle_1.png", "puzzle_2.png","puzzle_3.png"]

# Load the PDF documents using PyPDFLoader
pdf_documents = []
for pdf_file in pdf_files:
    file_path = os.path.join(file_dir, pdf_file)
    loader = PyPDFLoader(file_path)
    pdf_documents.extend(loader.load())

png_documents = []
for png_file in png_files:
    file_path = os.path.join(file_dir, png_file)
    # Create a simple Document object with file path as content and source
    png_documents.append(Document(page_content=f"Image file: {png_file}", metadata={"source": file_path, "file_type": "png"}))

# Combine all documents (PDFs and placeholder PNGs)
all_documents = pdf_documents + png_documents

print(f"Loaded {len(pdf_documents)} PDF documents.")
print(f"Listed and created placeholder documents for {len(png_documents)} PNG files for future multimodal processing.")
print(f"Total documents (PDFs + PNG placeholders): {len(all_documents)}")

Loaded 99 PDF documents.
Listed and created placeholder documents for 4 PNG files for future multimodal processing.
Total documents (PDFs + PNG placeholders): 103


## Step 3

### Replicate Week 4 work

In [14]:
# Define chunking parameters
chunk_size = 500
chunk_overlap = 100

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

# Split the documents into chunks
chunks = text_splitter.split_documents(all_documents) # Use 'all_documents' from Step 2

# Preview chunk count and first chunk
print(f"Created {len(chunks)} chunks.")
if chunks:
    print("\nFirst chunk:")
    print(chunks[0].page_content)

Created 813 chunks.

First chunk:
Maia-2: A Unified Model for Human-AI Alignment in
Chess
Zhenwei Tang
University of Toronto
josephtang@cs.toronto.edu
Difan Jiao
University of Toronto
difanjiao@cs.toronto.edu
Reid McIlroy-Young
Harvard University
reidmcy@seas.harvard.edu
Jon Kleinberg
Cornell University
kleinberg@cornell.edu
Siddhartha Sen
Microsoft Research
sidsen@microsoft.com
Ashton Anderson
University of Toronto
ashton@cs.toronto.edu
Abstract
There are an increasing number of domains in which artificial intelligence (AI)


In [15]:
# Initialize the embedding model
embedding_model_name = "all-MiniLM-L6-v2"
embedding_function = SentenceTransformerEmbeddings(model_name=embedding_model_name)

# Create the Chroma vector database
# We'll store the database in the same output directory
db_dir = os.path.join(output_dir, "chroma_db")
vectorstore = Chroma.from_documents(chunks, embedding_function, persist_directory=db_dir)

# Create a retriever from the vector store
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Verify retrieval with a sample query
sample_query = "What is the main idea of the Maia-2 paper?"
docs = retriever.invoke(sample_query)

print(f"\nSample Query: {sample_query}")
print(f"\nRetrieved {len(docs)} documents:")
for i, doc in enumerate(docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content)

# Save embedding model and retriever k value to rag_run_config.json
file_path = os.path.join(output_dir, "rag_run_config.json")

# Check if the file exists and load existing data
existing_data = {}
if os.path.exists(file_path):
    try:
        with open(file_path, 'r') as f:
            existing_data = json.load(f)
    except json.JSONDecodeError:
        existing_data = {} # Handle empty or invalid JSON

# Update existing data with new information
existing_data.update({
    "embedding_model": embedding_model_name,
    "retriever_k": 4
})

# Save the updated data to the file
with open(file_path, 'w') as f:
    json.dump(existing_data, f, indent=4)

print(f"\nConfiguration updated in {file_path}")



Sample Query: What is the main idea of the Maia-2 paper?

Retrieved 4 documents:

Document 1:
interact with chess positions to produce the moves humans make. Unlike previous models, Maia-2
only requires the current board position as input (as opposed to six), which dramatically reduces
training time and increases flexibility (e.g. for applying the model in non-game contexts where there
may be no 6-board history). In addition to policy and value heads like in previous work, we also add
an additional auxiliary information head that helps the model learn a deeper understanding of human

Document 2:
important dimension is prediction coherence as skill varies. A central drawback of Maia-1 is that it
8

Document 3:
Maia-2subset. Maia-2 differs from Maia-1 in two main ways: it has a different architecture and it
has access to more training data. To control for the difference in training data and isolate the effects
of our architecture, we create Maia-2 subset which has access to the exact sa

### Week 5 upgrades

In [37]:
# Create a list of document texts for BM25
document_texts = [doc.page_content for doc in pdf_documents]

# Initialize BM25Retriever
bm25_retriever = BM25Retriever.from_texts(document_texts, metadatas=[doc.metadata for doc in pdf_documents])
bm25_retriever.k = 4 # Set a default k value for BM25

print("BM25 Retriever initialized.")

# Use the existing vectorstore for the dense retriever
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 4}) # Set k for dense

print("Dense Retriever initialized.")

# Initialize the EnsembleRetriever with BM25 and dense retrievers
# weights can be adjusted based on desired contribution of each retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, dense_retriever], weights=[0.5, 0.5])

print("Ensemble Retriever (RRF) initialized.")

# Test the RRF retriever with a sample query
sample_query = "What is the main idea of the Maia-2 paper?"
docs_rrf = ensemble_retriever.invoke(sample_query)

print(f"\nSample Query with RRF: {sample_query}")
print(f"\nRetrieved {len(docs_rrf)} documents using RRF:")
for i, doc in enumerate(docs_rrf):
    print(f"\nDocument {i+1}:")
    print(doc.page_content)
    print(f"Source: {doc.metadata.get('source')}") # Include source information if available

BM25 Retriever initialized.
Dense Retriever initialized.
Ensemble Retriever (RRF) initialized.

Sample Query with RRF: What is the main idea of the Maia-2 paper?

Retrieved 8 documents using RRF:

Document 1:
Figure 4: Maia-2’s chess concept recognition as a function of skill level, as measured by linear
activation probes right before (blue) and after (orange) skill-aware attention. (a) Stockfish overall
board evaluation for middle-game positions. (b) Stockfish evaluation of middle-game bonuses and
penalties to pieces for white. (c) Does the active player own two bishops? (d) Can the active player
capture the opponent’s queen?
subtle nuances. We now turn our focus to a critical question: does Maia-2 vary in its ability to capture
human chess concepts when given different skill levels? Following the chess concepts probing
strategy for AlphaZero [22], we show how Maia-2’s grasp of various concepts varies with skill. The
left two plots in Figure 4 show concepts for which Maia-2 clearly di

In [35]:
# Initialize the reranker model and tokenizer
reranker_model_name = "cross-encoder/ms-marco-MiniLM-L-6-v2"
reranker_tokenizer = AutoTokenizer.from_pretrained(reranker_model_name)
reranker_model = AutoModelForSequenceClassification.from_pretrained(reranker_model_name)

# Define a function to rerank documents
def rerank_documents(query, documents, top_n=5):
    # Return empty list if no documents are provided
    if not documents:
        return []

    # Create pairs of query and document content for the cross-encoder
    pairs = [[query, doc.page_content] for doc in documents]

    # Use the reranker model to get scores for each pair
    with torch.no_grad():
        inputs = reranker_tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)
        scores = reranker_model(**inputs).logits.squeeze(-1)

    # Sort documents based on the reranker scores in descending order
    sorted_docs = [documents[i] for i in scores.argsort(descending=True)]

    # Print a message indicating reranking is done
    print(f"\nReranked documents using {reranker_model_name}.")

    # Return the top_n reranked documents
    return sorted_docs[:top_n]

# Test the reranking function with RRF results
reranked_rrf_docs = rerank_documents(sample_query, docs_rrf, top_n=5)

# Print the top 5 reranked RRF documents
print(f"\nTop 5 Reranked RRF documents:")
for i, doc in enumerate(reranked_rrf_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content)
    print(f"Source: {doc.metadata.get('source')}")


Reranked documents using cross-encoder/ms-marco-MiniLM-L-6-v2.

Top 5 Reranked RRF documents:

Document 1:
Maia-2subset. Maia-2 differs from Maia-1 in two main ways: it has a different architecture and it
has access to more training data. To control for the difference in training data and isolate the effects
of our architecture, we create Maia-2 subset which has access to the exact same training data that
Maia-1 was developed with. Comparing the two, we see that Maia-2subset matches or outperforms
all baselines and alternate models. Recall that Maia-2 and Maia-2subset don’t have the recent history
Source: /content/drive/MyDrive/Capstone/Week 5_Advanced_RAG/maia-2.pdf

Document 2:
Maia-2 can mimic weaker players to whom the puzzle is hard to solve, while stronger Maia-2 with
skill level configured above or equal to 1500 can successfully solve the puzzle. However, Maia 1100
surprisingly solved the puzzle, while the stronger Maia-1 models, e.g., Maia 1700 failed to make
the optimal move.

In [36]:
# Initialize the EmbeddingsRedundantFilter for compression
redundant_filter = EmbeddingsRedundantFilter(embeddings=embedding_function)

# Create a DocumentCompressorPipeline with the redundant filter
compression_pipeline = DocumentCompressorPipeline(transformers=[redundant_filter])

print("\nCompression pipeline initialized with EmbeddingsRedundantFilter.")

# Define a function to retrieve, rerank, and compress documents
def retrieve_and_compress(query, rrf_retriever, reranker_function, compressor_pipeline, top_n_rerank=5):
    # Perform RRF retrieval
    print(f"\nPerforming RRF retrieval for query: {query}")
    initial_docs = rrf_retriever.invoke(query)

    # Rerank the retrieved documents
    print(f"Reranking {len(initial_docs)} retrieved documents.")
    reranked_docs = reranker_function(query, initial_docs, top_n=top_n_rerank)

    # Apply compression to the reranked documents
    print(f"Applying compression to {len(reranked_docs)} reranked documents.")
    compressed_docs = compressor_pipeline.compress_documents(reranked_docs, query=query)

    return compressed_docs

# Test the combined retrieval and compression process
final_retrieved_compressed_docs = retrieve_and_compress(
    sample_query, # Use the predefined sample_query
    ensemble_retriever, # Use the RRF retriever
    rerank_documents, # Use the reranking function
    compression_pipeline, # Use the compression pipeline
    top_n_rerank=5 # Specify the number of documents to keep after reranking
)

# Print the final retrieved and compressed documents
print(f"\nFinal Retrieved and Compressed Documents (RRF -> Reranking -> Compression):")
for i, doc in enumerate(final_retrieved_compressed_docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content)
    print(f"Source: {doc.metadata.get('source')}")


Compression pipeline initialized with EmbeddingsRedundantFilter.

Performing RRF retrieval for query: What is the main idea of the Maia-2 paper?
Reranking 8 retrieved documents.

Reranked documents using cross-encoder/ms-marco-MiniLM-L-6-v2.
Applying compression to 5 reranked documents.

Final Retrieved and Compressed Documents (RRF -> Reranking -> Compression):

Document 1:
Maia-2subset. Maia-2 differs from Maia-1 in two main ways: it has a different architecture and it
has access to more training data. To control for the difference in training data and isolate the effects
of our architecture, we create Maia-2 subset which has access to the exact same training data that
Maia-1 was developed with. Comparing the two, we see that Maia-2subset matches or outperforms
all baselines and alternate models. Recall that Maia-2 and Maia-2subset don’t have the recent history
Source: /content/drive/MyDrive/Capstone/Week 5_Advanced_RAG/maia-2.pdf

Document 2:
Maia-2 can mimic weaker players to whom

In [38]:
# Define the output directory
output_dir = '/content/drive/MyDrive/Capstone/Week 5_Advanced_RAG'
log_file_path = os.path.join(output_dir, "retrieval_metrics_log.json")

# Function to measure latency and average context length
def measure_retrieval_metrics(retriever_function, query):
    start_time = time.time()
    retrieved_docs = retriever_function(query) # Execute the retrieval process
    end_time = time.time()
    latency = end_time - start_time

    # Calculate average context length
    total_length = sum(len(doc.page_content) for doc in retrieved_docs)
    avg_context_length = total_length / len(retrieved_docs) if retrieved_docs else 0

    return latency, avg_context_length, retrieved_docs

# Measure metrics for the RRF + Rerank + Compression pipeline
# We will use the retrieve_and_compress function defined previously
# Make sure ensemble_retriever, rerank_documents, and compression_pipeline are defined and accessible
try:
    latency, avg_context_length, retrieved_docs = measure_retrieval_metrics(
        lambda q: retrieve_and_compress(q, ensemble_retriever, rerank_documents, compression_pipeline, top_n_rerank=5),
        sample_query # Use the sample query for testing
    )

    # Prepare the metrics to log
    metrics = {
        "retriever": "RRF + Rerank + Compression",
        "query": sample_query,
        "latency_seconds": latency,
        "average_context_length": avg_context_length,
        "num_retrieved_docs": len(retrieved_docs)
    }

    # Check if the log file exists and load existing data
    existing_logs = []
    if os.path.exists(log_file_path):
        try:
            with open(log_file_path, 'r') as f:
                existing_logs = json.load(f)
        except json.JSONDecodeError:
            existing_logs = [] # Handle empty or invalid JSON

    # Append new metrics
    existing_logs.append(metrics)

    # Save the updated logs to the file
    with open(log_file_path, 'w') as f:
        json.dump(existing_logs, f, indent=4)

    print(f"\nLogged retrieval metrics to {log_file_path}")
    print(json.dumps(metrics, indent=4))

except NameError as e:
    print(f"\nError: Required variables or functions are not defined. Please ensure ensemble_retriever, rerank_documents, and compression_pipeline are executed before this cell.")
    print(e)
except Exception as e:
    print(f"\nAn unexpected error occurred during metric measurement: {e}")


# Note on Recall@k and Token Cost:
# Recall@k requires an evaluation dataset with ground truth relevant documents for each query.
# Token cost is relevant if an LLM is used in the pipeline (e.g., for compression/summarization).
# To implement these, you would need to:
# 1. Create or load an evaluation dataset.
# 2. For Recall@k, compare the retrieved documents against the ground truth relevant documents for each query in the dataset.
# 3. For Token cost (with LLM), track the token usage of the LLM during the compression step.


Performing RRF retrieval for query: What is the main idea of the Maia-2 paper?
Reranking 8 retrieved documents.

Reranked documents using cross-encoder/ms-marco-MiniLM-L-6-v2.
Applying compression to 5 reranked documents.

Logged retrieval metrics to /content/drive/MyDrive/Capstone/Week 5_Advanced_RAG/retrieval_metrics_log.json
{
    "retriever": "RRF + Rerank + Compression",
    "query": "What is the main idea of the Maia-2 paper?",
    "latency_seconds": 1.760443925857544,
    "average_context_length": 1028.0,
    "num_retrieved_docs": 5
}


## Step 4

## Step 5

## Step 6

## Step 7