<a href="https://colab.research.google.com/github/jm7n7/week-4-RAG/blob/main/RAG_Hands_On.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Track A: LangChain RAG**

##1. Install & Setup
*   Install dependencies: langchain, chromadb, sentence-transformers, transformers
*   Log Python, Torch, Transformers, SentenceTransformers, and Chroma versions
*   Save results in env_rag.json
##2. Load Your Project Documents
*   Upload at least three files directly related to your capstone project, such as:
    - PDFs (research papers, survey articles, datasets)
    - Text/Markdown notes
*   Use PyPDFLoader and/or TextLoader to ingest them into LangChain.
##3. Chunk the Documents
*   Start with chunk_size=500 and chunk_overlap=100
*   Preview chunk count and first chunk
*   Save chunk parameters in rag_run_config.json
##4. Build Embeddings & Chroma Vector DB
*   Start with all-MiniLM-L6-v2
*   Build a Chroma retriever (k=4) and verify retrieval with a sample query from your project
##5. Connect an LLM
*   Use a Hugging Face model (TinyLlama, distilgpt2, etc.) or Gemini API (gemini-2.5-flash / pro)
*   Document which model you used, why, and how it serves your project domain
##6. Build RetrievalQA
*   Connect retriever + LLM with LangChain
*   Ask at least three domain-specific questions grounded in your project documents
##7. Mini-Experiments
*   Embedding Swap: Compare MiniLM vs e5-small-v2 using your project data
*   Chunk Sensitivity: Compare 500/100 vs 300/50 chunking settings
##8. Fine-Tuning (Optional, Track C)
*   Create a small Q/A dataset from your project materials
*   Run a fine-tuning workflow (Gemini API or simulated LoRA/PEFT on an open-source model)
*   Compare base vs tuned performance on your project’s domain questions
##9. Reproducibility Log
*   Save configs in rag_run_config.json, including:
    - Embedding models tested
    - Chunk settings
    - LLMs used (base and tuned, if Track C)
    - Retriever k value

## Step 1

In [2]:
# Install
%pip install langchain chromadb sentence-transformers transformers langchain-community pypdf

Collecting chromadb
  Downloading chromadb-1.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.29-py3-none-any.whl.metadata (2.9 kB)
Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.37.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.

In [3]:
# import packages
import sys
import platform
import transformers
import sentence_transformers
import chromadb
import json
import os
try:
    import torch
    torch_v = torch.__version__
    cuda_ok = torch.cuda.is_available()
    device_name = torch.cuda.get_device_name(0) if cuda_ok else "CPU"
except:
    torch_v, cuda_ok, device_name = "N/A", False, "CPU"

In [4]:
# Log versions
env_info = {
    "python": sys.version,
    "platform": platform.platform(),
    "torch": torch_v,
    "cuda": cuda_ok,
    "device": device_name,
    "transformers": transformers.__version__,
    "sentence_transformers": sentence_transformers.__version__,
    "chromadb": chromadb.__version__
}

# Save results in env_rag.json
output_dir = '/content/drive/MyDrive/Capstone/Week 4_RAG'
file_path = os.path.join(output_dir, "env_rag.json")

# Ensure the directory exists
os.makedirs(os.path.dirname(file_path), exist_ok=True)

# Check if the file exists and load existing data
existing_data = {}
if os.path.exists(file_path):
    try:
        with open(file_path, 'r') as f:
            existing_data = json.load(f)
    except json.JSONDecodeError:
        existing_data = {} # Handle empty or invalid JSON

# Update existing data with new environment info
existing_data.update(env_info)

with open(file_path, 'w') as f:
    json.dump(existing_data, f, indent=4)

print(f"Environment information saved to {file_path}")

Environment information saved to /content/drive/MyDrive/Capstone/Week 4_RAG/env_rag.json


## Step 2

In [5]:
from langchain.document_loaders import PyPDFLoader

# Define the directory where the PDF files are located
pdf_dir = '/content/drive/MyDrive/Capstone/Week 4_RAG'

# List of PDF files to load
pdf_files = ["maia-2.pdf", "Amortized_chess.pdf", "chessgpt.pdf"]

# Load the documents
documents = []
for pdf_file in pdf_files:
    file_path = os.path.join(pdf_dir, pdf_file)
    loader = PyPDFLoader(file_path)
    documents.extend(loader.load())

print(f"Loaded {len(documents)} documents.")

Loaded 99 documents.


## Step 3

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define chunking parameters
chunk_size = 500
chunk_overlap = 100

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

# Split the documents into chunks
chunks = text_splitter.split_documents(documents)

# Preview chunk count and first chunk
print(f"Created {len(chunks)} chunks.")
if chunks:
    print("\nFirst chunk:")
    print(chunks[0].page_content)

Created 809 chunks.

First chunk:
Maia-2: A Unified Model for Human-AI Alignment in
Chess
Zhenwei Tang
University of Toronto
josephtang@cs.toronto.edu
Difan Jiao
University of Toronto
difanjiao@cs.toronto.edu
Reid McIlroy-Young
Harvard University
reidmcy@seas.harvard.edu
Jon Kleinberg
Cornell University
kleinberg@cornell.edu
Siddhartha Sen
Microsoft Research
sidsen@microsoft.com
Ashton Anderson
University of Toronto
ashton@cs.toronto.edu
Abstract
There are an increasing number of domains in which artificial intelligence (AI)


In [7]:
# Define the file path
file_path = os.path.join(output_dir, "rag_run_config.json")

# Ensure the directory exists
os.makedirs(os.path.dirname(file_path), exist_ok=True)

# Check if the file exists and load existing data
existing_data = {}
if os.path.exists(file_path):
    try:
        with open(file_path, 'r') as f:
            existing_data = json.load(f)
    except json.JSONDecodeError:
        existing_data = {} # Handle empty or invalid JSON

# Update existing data with chunk parameters
existing_data.update({
    "chunk_size": chunk_size,
    "chunk_overlap": chunk_overlap
})

# Save the updated data to the file
with open(file_path, 'w') as f:
    json.dump(existing_data, f, indent=4)

print(f"Chunk parameters saved to {file_path}")

Chunk parameters saved to /content/drive/MyDrive/Capstone/Week 4_RAG/rag_run_config.json


## Step 4

In [8]:
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# Initialize the embedding model
embedding_model_name = "all-MiniLM-L6-v2"
embedding_function = SentenceTransformerEmbeddings(model_name=embedding_model_name)

# Create the Chroma vector database
# We'll store the database in the same output directory
db_dir = os.path.join(output_dir, "chroma_db")
vectorstore = Chroma.from_documents(chunks, embedding_function, persist_directory=db_dir)

# Create a retriever from the vector store
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Verify retrieval with a sample query
sample_query = "What is the main idea of the Maia-2 paper?"
docs = retriever.invoke(sample_query)

print(f"\nSample Query: {sample_query}")
print(f"\nRetrieved {len(docs)} documents:")
for i, doc in enumerate(docs):
    print(f"\nDocument {i+1}:")
    print(doc.page_content)

# Save embedding model and retriever k value to rag_run_config.json
file_path = os.path.join(output_dir, "rag_run_config.json")

# Check if the file exists and load existing data
existing_data = {}
if os.path.exists(file_path):
    try:
        with open(file_path, 'r') as f:
            existing_data = json.load(f)
    except json.JSONDecodeError:
        existing_data = {} # Handle empty or invalid JSON

# Update existing data with new information
existing_data.update({
    "embedding_model": embedding_model_name,
    "retriever_k": 4
})

# Save the updated data to the file
with open(file_path, 'w') as f:
    json.dump(existing_data, f, indent=4)

print(f"\nConfiguration updated in {file_path}")

  embedding_function = SentenceTransformerEmbeddings(model_name=embedding_model_name)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Sample Query: What is the main idea of the Maia-2 paper?

Retrieved 4 documents:

Document 1:
interact with chess positions to produce the moves humans make. Unlike previous models, Maia-2
only requires the current board position as input (as opposed to six), which dramatically reduces
training time and increases flexibility (e.g. for applying the model in non-game contexts where there
may be no 6-board history). In addition to policy and value heads like in previous work, we also add
an additional auxiliary information head that helps the model learn a deeper understanding of human

Document 2:
interact with chess positions to produce the moves humans make. Unlike previous models, Maia-2
only requires the current board position as input (as opposed to six), which dramatically reduces
training time and increases flexibility (e.g. for applying the model in non-game contexts where there
may be no 6-board history). In addition to policy and value heads like in previous work, we also add


## Step 5

In [9]:
from langchain.llms import HuggingFaceHub
import os
from google.colab import userdata

# Define the model to use (e.g., TinyLlama or distilgpt2)
# Make sure to choose a model that fits within your computational resources
model_id = "distilgpt2" # Changing to a different model
task = "text-generation" # Update task for text generation models

# Get the Hugging Face API token from Colab secrets
# Make sure you have added your token to Colab secrets with the name 'HF_TOKEN'
huggingface_api_token = userdata.get("HF_TOKEN")

# Initialize the Hugging Face LLM
llm = HuggingFaceHub(
    repo_id=model_id,
    task=task, # Use the updated task
    huggingfacehub_api_token=huggingface_api_token,
)

print(f"Connected to Hugging Face model: {model_id}")

# Note: You might need to install the 'huggingface_hub' library if not already installed
# %pip install huggingface_hub

Connected to Hugging Face model: distilgpt2


  llm = HuggingFaceHub(


## Step 6

In [10]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import pipeline

# Define the model to use (using the same model as before)
model_id = "distilgpt2"
task = "text-generation"

# Create a Hugging Face pipeline
pipe = pipeline(task, model=model_id)

# Initialize the LangChain LLM with the pipeline
llm_pipeline = HuggingFacePipeline(pipeline=pipe)

print(f"Initialized LLM using HuggingFacePipeline with model: {model_id}")

# Now, you can use 'llm_pipeline' in your RetrievalQA chain
# I will modify the next cell to use this new LLM object.

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


Initialized LLM using HuggingFacePipeline with model: distilgpt2


  llm_pipeline = HuggingFacePipeline(pipeline=pipe)


In [11]:
from langchain.chains import RetrievalQA

# Create a RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm_pipeline, # Use the llm_pipeline object
    chain_type="stuff", # Other options include "map_reduce", "refine", "map_rerank"
    retriever=retriever,
    return_source_documents=True # Set to True to see the source documents
)

# Ask three domain-specific question
query_1 = "What is the main idea of the Maia-2 paper?" # Same as first question
query_2 = "What is the conclusion of the Maia-2 paper?" # Related to first question
query_3 = "What models were used in the chessGPT paper?" # Brand new qestion context

In [12]:
# Run the query_1
result = qa_chain.invoke(query_1)

print(f"Query: {query_1}")
print(f"\nAnswer: {result['result']}")

# Optionally print source documents
if 'source_documents' in result:
    print("\nSource Documents:")
    for i, doc in enumerate(result['source_documents']):
        print(f"\nDocument {i+1}:")
        print(f"Content: {doc.page_content[:200]}...") # Print first 200 characters
        print(f"Source: {doc.metadata.get('source')}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the main idea of the Maia-2 paper?

Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

interact with chess positions to produce the moves humans make. Unlike previous models, Maia-2
only requires the current board position as input (as opposed to six), which dramatically reduces
training time and increases flexibility (e.g. for applying the model in non-game contexts where there
may be no 6-board history). In addition to policy and value heads like in previous work, we also add
an additional auxiliary information head that helps the model learn a deeper understanding of human

interact with chess positions to produce the moves humans make. Unlike previous models, Maia-2
only requires the current board position as input (as opposed to six), which dramatically reduces
training time and increases flexibility (e.g. for applying the model in non-game contex

In [13]:
# Run the query_2
result = qa_chain.invoke(query_2)

print(f"Query: {query_2}")
print(f"\nAnswer: {result['result']}")

# Optionally print source documents
if 'source_documents' in result:
    print("\nSource Documents:")
    for i, doc in enumerate(result['source_documents']):
        print(f"\nDocument {i+1}:")
        print(f"Content: {doc.page_content[:200]}...") # Print first 200 characters
        print(f"Source: {doc.metadata.get('source')}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What is the conclusion of the Maia-2 paper?

Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

important dimension is prediction coherence as skill varies. A central drawback of Maia-1 is that it
8

important dimension is prediction coherence as skill varies. A central drawback of Maia-1 is that it
8

important dimension is prediction coherence as skill varies. A central drawback of Maia-1 is that it
8

Justification: We train Maia-2 with a huge amount (9.1B) of chess positions. Therefore, it is
hard to evaluate Maia-2 multiple times with different train/test splits.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confi-
dence intervals, or statistical significance tests, at least for the experiments that support
the main claims of the paper.

Questi

In [14]:
# Run the query_3
result = qa_chain.invoke(query_3)

print(f"Query: {query_3}")
print(f"\nAnswer: {result['result']}")

# Optionally print source documents
if 'source_documents' in result:
    print("\nSource Documents:")
    for i, doc in enumerate(result['source_documents']):
        print(f"\nDocument {i+1}:")
        print(f"Content: {doc.page_content[:200]}...") # Print first 200 characters
        print(f"Source: {doc.metadata.get('source')}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Query: What models were used in the chessGPT paper?

Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

co/datasets/Waterhorse/chess_data.
F Implementation and Evaluation Details
We open source all our models: ChessCLIP ( https://huggingface.co/Waterhorse/ChessCLIP),
ChessGPT-Base ( https://huggingface.co/Waterhorse/chessgpt-base-v1) and ChessGPT-Chat
(https://huggingface.co/Waterhorse/chessgpt-chat-v1). Refer to these URLs for model licenses
and model cards.
F.1 Implmenetation details
F.1.1 ChessCLIP

co/datasets/Waterhorse/chess_data.
F Implementation and Evaluation Details
We open source all our models: ChessCLIP ( https://huggingface.co/Waterhorse/ChessCLIP),
ChessGPT-Base ( https://huggingface.co/Waterhorse/chessgpt-base-v1) and ChessGPT-Chat
(https://huggingface.co/Waterhorse/chessgpt-chat-v1). Refer to these URLs for model licenses
and model cards.
F.1 Implmen

## Step 7

In [19]:
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

# Initialize a new embedding function
embedding_model_name_e5 = "intfloat/e5-small-v2"
embedding_function_e5 = SentenceTransformerEmbeddings(model_name=embedding_model_name_e5)

# Create a new Chroma vector database
db_dir_e5 = os.path.join(output_dir, f"chroma_db_{embedding_model_name_e5.replace('-', '_')}") # Update db_dir name
vectorstore_e5 = Chroma.from_documents(chunks, embedding_function_e5, persist_directory=db_dir_e5)

# Create a new retriever
retriever_e5 = vectorstore_e5.as_retriever(search_kwargs={"k": 4})

# Create a new RetrievalQA chain instance
qa_chain_e5 = RetrievalQA.from_chain_type(
    llm=llm_pipeline,  # Use the same LLM pipeline
    chain_type="stuff",
    retriever=retriever_e5,
    return_source_documents=True
)

modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

In [20]:
# Define the same three domain-specific queries
query_1 = "What is the main idea of the Maia-2 paper?"
query_2 = "What is the conclusion of the Maia-2 paper?"
query_3 = "What models were used in the chessGPT paper?"

queries = [query_1, query_2, query_3]
results_e5 = {}

# Invoke the new RetrievalQA chain and print results
print(f"\n--- Results with {embedding_model_name_e5} embeddings ---")
for i, query in enumerate(queries):
    print(f"\nQuery: {query}")
    result_e5 = qa_chain_e5.invoke(query)
    print(f"\nAnswer: {result_e5['result']}")

    results_e5[f"query_{i+1}"] = {
        "query": query,
        "answer": result_e5['result'],
        "source_documents": [{"content": doc.page_content, "source": doc.metadata.get('source')} for doc in result_e5['source_documents']]
    }

    if 'source_documents' in result_e5:
        print("\nSource Documents:")
        for j, doc in enumerate(result_e5['source_documents']):
            print(f"\nDocument {j+1}:")
            print(f"Content: {doc.page_content[:200]}...")
            print(f"Source: {doc.metadata.get('source')}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



--- Results with intfloat/e5-small-v2 embeddings ---

Query: What is the main idea of the Maia-2 paper?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

interact with chess positions to produce the moves humans make. Unlike previous models, Maia-2
only requires the current board position as input (as opposed to six), which dramatically reduces
training time and increases flexibility (e.g. for applying the model in non-game contexts where there
may be no 6-board history). In addition to policy and value heads like in previous work, we also add
an additional auxiliary information head that helps the model learn a deeper understanding of human

builds directly on original Maia, we call it Maia-2. Maia-2 consists of a standard residual network
tower that processes chess positions into features, and our novel contribution of askill-aware attention
module with channel-wise patching. This innovation takes the position representation outputted
by the residual network tower 

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Justification: We train Maia-2 with a huge amount (9.1B) of chess positions. Therefore, it is
hard to evaluate Maia-2 multiple times with different train/test splits.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confi-
dence intervals, or statistical significance tests, at least for the experiments that support
the main claims of the paper.

that is necessary to appreciate the results and make sense of them.
• The full details can be provided either with the code, in appendix, or as supplemental
material.
7. Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate
information about the statistical significance of the experiments?
A

In [21]:
# Load the existing data from rag_run_config.json
file_path = os.path.join(output_dir, "rag_run_config.json")
existing_data = {}
if os.path.exists(file_path):
    try:
        with open(file_path, 'r') as f:
            existing_data = json.load(f)
    except json.JSONDecodeError:
        existing_data = {} # Handle empty or invalid JSON

# Update the loaded data
existing_data[f"embedding_experiment_{embedding_model_name_e5.replace('-', '_')}"] = {
    "embedding_model": embedding_model_name_e5,
    "retriever_k": 4,
    "results": results_e5
}

# Save the updated data to the file
with open(file_path, 'w') as f:
    json.dump(existing_data, f, indent=4)

print(f"\nConfiguration updated with {embedding_model_name_e5} results in {file_path}")


Configuration updated with intfloat/e5-small-v2 results in /content/drive/MyDrive/Capstone/Week 4_RAG/rag_run_config.json


In [22]:
# Define new chunking parameters
new_chunk_size = 300
new_chunk_overlap = 50

print(f"Performing Chunk Sensitivity Experiment with chunk_size={new_chunk_size} and chunk_overlap={new_chunk_overlap}")

# Initialize the text splitter with new parameters
text_splitter_new = RecursiveCharacterTextSplitter(
    chunk_size=new_chunk_size,
    chunk_overlap=new_chunk_overlap
)

# Split the original documents into chunks using new parameters
chunks_new = text_splitter_new.split_documents(documents) # Using the 'documents' variable from Step 2

print(f"Created {len(chunks_new)} new chunks.")

# Initialize the same embedding model used in Step 4
embedding_model_name_original = "all-MiniLM-L6-v2"
embedding_function_original = SentenceTransformerEmbeddings(model_name=embedding_model_name_original)

# Create a new Chroma vector database with new chunks and the original embedding model
db_dir_new_chunks = os.path.join(output_dir, f"chroma_db_chunk_{new_chunk_size}_{new_chunk_overlap}")
vectorstore_new_chunks = Chroma.from_documents(chunks_new, embedding_function_original, persist_directory=db_dir_new_chunks)

# Create a new retriever from this vector store
retriever_new_chunks = vectorstore_new_chunks.as_retriever(search_kwargs={"k": 4}) # Using the same k as before

# Create a new RetrievalQA chain instance
qa_chain_new_chunks = RetrievalQA.from_chain_type(
    llm=llm_pipeline,  # Use the same LLM pipeline from Step 6
    chain_type="stuff",
    retriever=retriever_new_chunks,
    return_source_documents=True
)


Performing Chunk Sensitivity Experiment with chunk_size=300 and chunk_overlap=50
Created 1252 new chunks.


In [23]:
# Define the same three domain-specific queries from Step 6
query_1 = "What is the main idea of the Maia-2 paper?"
query_2 = "What is the conclusion of the Maia-2 paper?"
query_3 = "What models were used in the chessGPT paper?"

queries = [query_1, query_2, query_3]
results_new_chunks = {}

# Invoke the new RetrievalQA chain and print results
print(f"\n--- Results with chunk_size={new_chunk_size}, chunk_overlap={new_chunk_overlap} ---")
for i, query in enumerate(queries):
    print(f"\nQuery: {query}")
    result_new_chunks = qa_chain_new_chunks.invoke(query)
    print(f"\nAnswer: {result_new_chunks['result']}")

    results_new_chunks[f"query_{i+1}"] = {
        "query": query,
        "answer": result_new_chunks['result'],
        "source_documents": [{"content": doc.page_content, "source": doc.metadata.get('source')} for doc in result_new_chunks['source_documents']]
    }

    if 'source_documents' in result_new_chunks:
        print("\nSource Documents:")
        for j, doc in enumerate(result_new_chunks['source_documents']):
            print(f"\nDocument {j+1}:")
            print(f"Content: {doc.page_content[:200]}...")
            print(f"Source: {doc.metadata.get('source')}")



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



--- Results with chunk_size=300, chunk_overlap=50 ---

Query: What is the main idea of the Maia-2 paper?


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

interact with chess positions to produce the moves humans make. Unlike previous models, Maia-2
only requires the current board position as input (as opposed to six), which dramatically reduces
training time and increases flexibility (e.g. for applying the model in non-game contexts where there

winning, drawing, and losing, respectively. The training objectives of these heads are balanced to
contribute equally to Maia-2 model optimization. Hyperparameter settings used for Maia-2 training
can be found in Appendix Table 5.
4 Results

Limitation. Our work has limitations. First, we are excited by the applications that Maia-2 will
enable, such as more relatable AI partners and AI-powered learning aids, the development of which is
out of scope for the current work. Maia-2 does not yet incorporate search, although previou

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

the less dense upper left quadrant indicates fewer instances where Maia 1900 outperforms Maia-
2. Remarkably, while this consistently occurs across all move qualities, the distinction is more
pronounced for Blunders and Errors compared to Optimal moves. Additionally, the bottom row of

surprisingly solved the puzzle, while the stronger Maia-1 models, e.g., Maia 1700 failed to make
the optimal move. Therefore, in the considered case, as opposed to Maia-1, Maia-2 yields smooth
predictions provided that its treatment of this position is monotonic and transitional.
14

winning, drawing, and losing, respectively. The training objectives of these heads are balanced to
contribute equally to Maia-2 model optimization. Hyperparameter settings used for Maia-2 training
can be found in Appendix Table 5.
4 Results

Figure 3.(A) 

In [24]:
# Load the existing data from rag_run_config.json
file_path = os.path.join(output_dir, "rag_run_config.json")
existing_data = {}
if os.path.exists(file_path):
    try:
        with open(file_path, 'r') as f:
            existing_data = json.load(f)
    except json.JSONDecodeError:
        existing_data = {} # Handle empty or invalid JSON

# Update the loaded data with new chunking experiment results
existing_data[f"chunk_experiment_{new_chunk_size}_{new_chunk_overlap}"] = {
    "chunk_size": new_chunk_size,
    "chunk_overlap": new_chunk_overlap,
    "embedding_model": embedding_model_name_original,
    "retriever_k": 4,
    "results": results_new_chunks
}

# Save the updated data to the file
with open(file_path, 'w') as f:
    json.dump(existing_data, f, indent=4)

print(f"\nConfiguration updated with chunk_size={new_chunk_size}, chunk_overlap={new_chunk_overlap} results in {file_path}")


Configuration updated with chunk_size=300, chunk_overlap=50 results in /content/drive/MyDrive/Capstone/Week 4_RAG/rag_run_config.json


## Step 8
*Skipped*

## Step 9

In [25]:
file_path = os.path.join(output_dir, "rag_run_config.json")

# Load the data from the file
try:
    with open(file_path, 'r') as f:
        config_data = json.load(f)
except (FileNotFoundError, json.JSONDecodeError):
    print(f"Error loading data from {file_path}")
    config_data = {} # Initialize empty if file not found or invalid

print("--- Experiment Summary ---")

# Summarize Embedding Swap Experiment
embedding_experiment_key = None
for key in config_data:
    if key.startswith("embedding_experiment_"):
        embedding_experiment_key = key
        break

if embedding_experiment_key:
    embedding_experiment_data = config_data[embedding_experiment_key]
    original_embedding_model = config_data.get("embedding_model", "N/A") # Get original embedding model
    print(f"\nEmbedding Swap Experiment:")
    print(f"  Original Embedding Model: {original_embedding_model}")
    print(f"  Compared Against: {embedding_experiment_data.get('embedding_model', 'N/A')}")
    print("  Review the 'results' section in rag_run_config.json for detailed output.")
else:
    print("\nEmbedding Swap Experiment data not found in rag_run_config.json")


# Summarize Chunk Sensitivity Experiment
chunk_experiment_key = None
for key in config_data:
    if key.startswith("chunk_experiment_"):
        chunk_experiment_key = key
        break

if chunk_experiment_key:
    chunk_experiment_data = config_data[chunk_experiment_key]
    original_chunk_size = config_data.get("chunk_size", "N/A")
    original_chunk_overlap = config_data.get("chunk_overlap", "N/A")
    print(f"\nChunk Sensitivity Experiment:")
    print(f"  Original Chunk Settings: chunk_size={original_chunk_size}, chunk_overlap={original_chunk_overlap}")
    print(f"  Compared Against: chunk_size={chunk_experiment_data.get('chunk_size', 'N/A')}, chunk_overlap={chunk_experiment_data.get('chunk_overlap', 'N/A')}")
    print("  Review the 'results' section in rag_run_config.json for detailed output.")
else:
    print("\nChunk Sensitivity Experiment data not found in rag_run_config.json")

print("\n--- End of Summary ---")

--- Experiment Summary ---

Embedding Swap Experiment:
  Original Embedding Model: all-MiniLM-L6-v2
  Compared Against: intfloat/e5-small-v2
  Review the 'results' section in rag_run_config.json for detailed output.

Chunk Sensitivity Experiment:
  Original Chunk Settings: chunk_size=500, chunk_overlap=100
  Compared Against: chunk_size=300, chunk_overlap=50
  Review the 'results' section in rag_run_config.json for detailed output.

--- End of Summary ---
