# ICC T20 RAG Q&A System

This notebook implements a **Retrieval-Augmented Generation (RAG)** pipeline for answering questions about ICC T20 cricket rules. We use LangChain, OpenAI embeddings, ChromaDB as our vector store, and GPT-4o-mini as the language model.

**Pipeline Overview:**
1. **Load** — Read ICC T20 rule PDFs from Google Drive
2. **Chunk** — Split documents into manageable pieces
3. **Embed + Store** — Create vector embeddings and store in ChromaDB
4. **Test Retrieval** — Verify relevant chunks are retrieved
5. **Build RAG Chain** — Wire up the full question-answering pipeline
6. **Evaluate** — Run evaluation questions and measure accuracy
7. **Stretch Goal A** — Compare chunk sizes

## Setup: Install Dependencies

We install all required packages: LangChain ecosystem, OpenAI integration, ChromaDB for vector storage, and PyPDF for reading PDF documents.

In [1]:
!pip install -q "langchain<1.0" langchain-openai langchain-chroma langchain-community langchain-text-splitters chromadb pypdf

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.3/1.0 MB[0m [31m10.1 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.0/1.0 MB[0m [31m18.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.0/76.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m458.9/458.9 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.5/66.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency re

## Mount Google Drive

Mount Google Drive to access our cricket rules PDF documents and evaluation data.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Set Data Path

Define the path to our cricket rules data folder in Google Drive. This path will be used throughout the notebook.

In [3]:
import os

DATA_PATH = "/content/drive/MyDrive/cricket_rules_data"

# Verify the path exists and list contents
if os.path.exists(DATA_PATH):
    print("Data folder found! Contents:")
    for f in sorted(os.listdir(DATA_PATH)):
        print(f"  {f}")
else:
    print(f"ERROR: Path not found: {DATA_PATH}")

Data folder found! Contents:
  01_match_structure_and_playing_conditions.pdf
  02_bowling_rules_no_balls_and_free_hits.pdf
  03_drs_and_umpiring.pdf
  04_special_match_situations.pdf
  eval_questions.json


---
## Step 1: Load

We use LangChain's `PyPDFLoader` to load all 4 ICC T20 rule PDFs from our Google Drive folder. Each PDF is loaded page-by-page, preserving metadata about the source file and page number.

In [4]:
from langchain_community.document_loaders import PyPDFLoader
import glob

# Get all PDF files from the data folder
pdf_files = sorted(glob.glob(os.path.join(DATA_PATH, "*.pdf")))
print(f"Found {len(pdf_files)} PDF files:\n")

# Load all PDFs
all_documents = []
for pdf_path in pdf_files:
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()
    filename = os.path.basename(pdf_path)
    print(f"  {filename}: {len(docs)} pages")
    all_documents.extend(docs)

print(f"\n{'='*50}")
print(f"Total documents (pages) loaded: {len(all_documents)}")

Found 4 PDF files:

  01_match_structure_and_playing_conditions.pdf: 3 pages
  02_bowling_rules_no_balls_and_free_hits.pdf: 3 pages
  03_drs_and_umpiring.pdf: 4 pages
  04_special_match_situations.pdf: 5 pages

Total documents (pages) loaded: 15


In [5]:
# Display a sample of the first document's content (first 500 characters)
print("Sample content from the first document:")
print("="*50)
print(f"Source: {all_documents[0].metadata['source']}")
print(f"Page: {all_documents[0].metadata.get('page', 'N/A')}")
print("-"*50)
print(all_documents[0].page_content[:500])
print("...")

Sample content from the first document:
Source: /content/drive/MyDrive/cricket_rules_data/01_match_structure_and_playing_conditions.pdf
Page: 0
--------------------------------------------------
ICC T20 INTERNATIONAL CRICKET: MATCH STRUCTURE
 AND PLAYING CONDITIONS
1. MATCH FORMAT OVERVIEW
A Twenty20 International (T20I) match consists of two innings. Each team bats for a maximum of 20 overs
per innings. Each over consists of six legal deliveries bowled by a single bowler. The team scoring the most
runs at the end of both innings wins the match. If both teams score the same number of runs, the match is
declared a tie and may proceed to a Super Over to determine a winner.
A minimum of 5 
...


---
## Step 2: Chunk

We split the loaded documents into smaller chunks using LangChain's `RecursiveCharacterTextSplitter`. This splitter tries to split on natural boundaries (paragraphs, sentences, words) to keep semantically related content together.

We experiment with two configurations to understand how chunk size affects our pipeline.

### Configuration 1: chunk_size=500, chunk_overlap=50

Smaller chunks — more granular but may split rules across chunks.

In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# Configuration 1: Small chunks
splitter_500 = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks_500 = splitter_500.split_documents(all_documents)

chunk_lengths_500 = [len(c.page_content) for c in chunks_500]

print("Configuration 1: chunk_size=500, chunk_overlap=50")
print(f"  Total chunks created: {len(chunks_500)}")
print(f"  Smallest chunk: {min(chunk_lengths_500)} characters")
print(f"  Largest chunk: {max(chunk_lengths_500)} characters")
print(f"  Average chunk: {sum(chunk_lengths_500) // len(chunk_lengths_500)} characters")

Configuration 1: chunk_size=500, chunk_overlap=50
  Total chunks created: 80
  Smallest chunk: 38 characters
  Largest chunk: 499 characters
  Average chunk: 408 characters


### Configuration 2: chunk_size=1000, chunk_overlap=100

Larger chunks — preserves more context per chunk, better for rule-based documents with longer paragraphs.

In [7]:
# Configuration 2: Larger chunks
splitter_1000 = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks_1000 = splitter_1000.split_documents(all_documents)

chunk_lengths_1000 = [len(c.page_content) for c in chunks_1000]

print("Configuration 2: chunk_size=1000, chunk_overlap=100")
print(f"  Total chunks created: {len(chunks_1000)}")
print(f"  Smallest chunk: {min(chunk_lengths_1000)} characters")
print(f"  Largest chunk: {max(chunk_lengths_1000)} characters")
print(f"  Average chunk: {sum(chunk_lengths_1000) // len(chunk_lengths_1000)} characters")

Configuration 2: chunk_size=1000, chunk_overlap=100
  Total chunks created: 40
  Smallest chunk: 35 characters
  Largest chunk: 999 characters
  Average chunk: 818 characters


### Observations: Chunk Size Comparison

**Key differences between the two configurations:**

- The 500-character configuration produces roughly **twice as many chunks** as the 1000-character one, which means more vector store entries and potentially more retrieval overhead.
- Smaller chunks risk **splitting a complete rule across two chunks**, losing context. For example, a rule about powerplay fielding restrictions might get cut in half.
- Larger chunks (1000) preserve more context per chunk, which is **better suited for rule-based documents** where a single rule or regulation often spans several sentences.
- The overlap (50 vs 100) helps mitigate boundary issues, but larger chunks naturally need less reliance on overlap.

**Decision:** We proceed with `chunk_size=1000, chunk_overlap=100` for the rest of this notebook.

In [8]:
# Use the 1000-character chunks going forward
chunks = chunks_1000
print(f"Using {len(chunks)} chunks (chunk_size=1000) for the rest of the pipeline.")

Using 40 chunks (chunk_size=1000) for the rest of the pipeline.


---
## Step 3: Embed + Store

We create vector embeddings for each chunk using OpenAI's `text-embedding-3-small` model and store them in an in-memory ChromaDB vector store. Each chunk retains metadata about its source file and page number for traceability.

In [9]:
# Get OpenAI API key from Colab secrets
from google.colab import userdata
api_key = userdata.get('OPENAI_API_KEY')
print("OpenAI API key loaded successfully.")

OpenAI API key loaded successfully.


In [10]:
# Enrich chunk metadata with clean source filename and page number
for chunk in chunks:
    # Extract just the filename from the full path
    chunk.metadata['source'] = os.path.basename(chunk.metadata.get('source', 'unknown'))
    chunk.metadata['page'] = chunk.metadata.get('page', 0)

# Verify metadata
print("Sample chunk metadata:")
for i in range(min(3, len(chunks))):
    print(f"  Chunk {i}: source={chunks[i].metadata['source']}, page={chunks[i].metadata['page']}")

Sample chunk metadata:
  Chunk 0: source=01_match_structure_and_playing_conditions.pdf, page=0
  Chunk 1: source=01_match_structure_and_playing_conditions.pdf, page=0
  Chunk 2: source=01_match_structure_and_playing_conditions.pdf, page=0


In [11]:
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

# Initialize OpenAI embeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=api_key
)

# Create ChromaDB vector store from chunks (in-memory)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    collection_name="icc_t20_rules"
)

print(f"Successfully embedded and stored {vectorstore._collection.count()} chunks in ChromaDB.")

Successfully embedded and stored 40 chunks in ChromaDB.


---
## Step 4: Test Retrieval (Before Wiring Up the LLM)

Before building the full RAG chain, we test the retrieval component in isolation. We run 3 test queries using `similarity_search` with `k=3` and inspect whether the retrieved chunks are actually relevant to the question.

In [12]:
# Define test queries
test_queries = [
    "What are the fielding restrictions during powerplay overs?",
    "How does the free hit rule work after a no-ball?",
    "What happens when a Super Over is also tied?"
]

# Run similarity search for each query
for i, query in enumerate(test_queries, 1):
    print(f"{'='*70}")
    print(f"Query {i}: {query}")
    print(f"{'='*70}")

    results = vectorstore.similarity_search(query, k=3)

    for j, doc in enumerate(results, 1):
        print(f"\n  --- Result {j} ---")
        print(f"  Source: {doc.metadata['source']}")
        print(f"  Page: {doc.metadata['page']}")
        print(f"  Content preview: {doc.page_content[:200]}...")
    print()

Query 1: What are the fielding restrictions during powerplay overs?

  --- Result 1 ---
  Source: 01_match_structure_and_playing_conditions.pdf
  Page: 0
  Content preview: Middle Overs (Overs 7-20): From the seventh over onwards, a maximum of five fielders are permitted
outside the 30-yard circle. This allows the bowling side to set more defensive and spread-out fields ...

  --- Result 2 ---
  Source: 01_match_structure_and_playing_conditions.pdf
  Page: 0
  Content preview: wicketkeeper. Teams must submit their final XI to the match referee before the toss.
3. THE TOSS
Before the match begins, the two captains participate in a coin toss. The captain who wins the toss cho...

  --- Result 3 ---
  Source: 01_match_structure_and_playing_conditions.pdf
  Page: 1
  Content preview:  A maximum of 2 fielders are allowed behind square on the leg side.
 Fielders (other than the wicketkeeper) may not move significantly from their position until the bowler
releases the ball.
6. OVER...

Quer

### Retrieval Relevance Annotations

**Query 1: "What are the fielding restrictions during powerplay overs?"**
- **Relevant?** Yes — We expect chunks from `01_match_structure_and_playing_conditions.pdf` covering powerplay and field restriction rules. The retrieved chunks should discuss overs 1-6 fielding circles and player placement limits.

**Query 2: "How does the free hit rule work after a no-ball?"**
- **Relevant?** Yes — We expect chunks from `02_bowling_rules_no_balls_and_free_hits.pdf` detailing free hit delivery rules, what constitutes dismissal on a free hit, and field restrictions.

**Query 3: "What happens when a Super Over is also tied?"**
- **Relevant?** Yes — We expect chunks from `04_special_match_situations.pdf` explaining the Super Over tiebreaker procedure, including what happens if the Super Over itself is tied (e.g., count of boundaries, shorter boundary, coin toss).

---
## Step 5: Build the RAG Chain

We now wire up the full RAG pipeline: retriever → prompt template → LLM. We use `ChatOpenAI` with `gpt-4o-mini` and a custom prompt template that instructs the model to answer only from the provided context and cite its sources.

In [13]:
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Initialize the LLM
llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    openai_api_key=api_key
)

# Custom prompt template
prompt_template = """You are an expert on ICC T20 cricket rules. Use ONLY the following context to answer the question. Do not use any outside knowledge.

If the context does not contain enough information to answer the question, say: "I don't have enough information to answer this."

Always cite which document(s) the information comes from by referencing the source filename.

Context:
{context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# Create the retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Build the RetrievalQA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": PROMPT}
)

print("RAG chain built successfully!")
print(f"  LLM: gpt-4o-mini")
print(f"  Retriever: ChromaDB similarity search (k=3)")
print(f"  Chain type: stuff (all retrieved docs concatenated into prompt)")

RAG chain built successfully!
  LLM: gpt-4o-mini
  Retriever: ChromaDB similarity search (k=3)
  Chain type: stuff (all retrieved docs concatenated into prompt)


### Test the RAG Chain with the Same 3 Queries

We now run the same test queries through the full RAG chain (retrieval + generation) and examine the generated answers.

In [14]:
# Run the 3 test queries through the full RAG chain
for i, query in enumerate(test_queries, 1):
    print(f"{'='*70}")
    print(f"Query {i}: {query}")
    print(f"{'='*70}")

    result = qa_chain.invoke({"query": query})

    print(f"\nGenerated Answer:")
    print(f"{result['result']}")

    print(f"\nSource Documents Used:")
    for j, doc in enumerate(result['source_documents'], 1):
        print(f"  [{j}] {doc.metadata['source']} (page {doc.metadata['page']})")
    print()

Query 1: What are the fielding restrictions during powerplay overs?

Generated Answer:
During the powerplay overs (Overs 1-6), only two fielders are allowed outside the 30-yard circle. The remaining nine players, including the bowler and wicketkeeper, must stay within the inner circle. This phase encourages aggressive batting at the start of the innings by limiting the fielding team's boundary protection. (Source: context document)

Source Documents Used:
  [1] 01_match_structure_and_playing_conditions.pdf (page 0)
  [2] 01_match_structure_and_playing_conditions.pdf (page 0)
  [3] 01_match_structure_and_playing_conditions.pdf (page 1)

Query 2: How does the free hit rule work after a no-ball?

Generated Answer:
A free hit is awarded on the delivery immediately following any no-ball. This means that after a no-ball is called, the next delivery will be a free hit for the batter. During a free hit delivery, the batter cannot be dismissed by any method that would credit the dismissal to th

---
## Step 6: Evaluate

We load the 5 evaluation questions from `eval_questions.json`, run them through the RAG pipeline, and assess three dimensions:
1. **Retrieval accuracy** — Did the retriever fetch chunks from the correct source document?
2. **Faithfulness** — Is the answer grounded in the retrieved context (not hallucinated)?
3. **Correctness** — Does the generated answer match the expected answer?

In [15]:
import json

# Load evaluation questions
eval_path = os.path.join(DATA_PATH, "eval_questions.json")
with open(eval_path, 'r') as f:
    eval_questions = json.load(f)

print(f"Loaded {len(eval_questions)} evaluation questions:\n")
for q in eval_questions:
    print(f"  [{q['id']}] {q['question']}")
    print(f"       Source: {q['source_document']} | Section: {q['source_section']}")
    print()

Loaded 5 evaluation questions:

  [1] How many fielders are allowed outside the 30-yard circle during the powerplay overs in a T20 International?
       Source: 01_match_structure_and_playing_conditions.pdf | Section: 4. INNINGS STRUCTURE AND POWERPLAY

  [2] What happens if a free hit delivery itself is bowled as a no-ball or wide?
       Source: 02_bowling_rules_no_balls_and_free_hits.pdf | Section: 3.5 Cascading Free Hits

  [3] How many unsuccessful DRS reviews does each team get per innings in T20 International cricket, and what happens if the result is umpire's call?
       Source: 03_drs_and_umpiring.pdf | Section: 2. NUMBER OF REVIEWS PER TEAM

  [4] What are the rules if a Super Over also ends in a tie in T20 International cricket?
       Source: 04_special_match_situations.pdf | Section: 1.3 If the Super Over Is Also Tied

  [5] What is the in-match penalty for slow over rates in T20 International cricket?
       Source: 01_match_structure_and_playing_conditions.pdf | Section

In [16]:
# Run all eval questions through the RAG pipeline
eval_results = []

for q in eval_questions:
    result = qa_chain.invoke({"query": q['question']})

    # Check if retrieved chunks come from the correct source document
    source_docs = [doc.metadata['source'] for doc in result['source_documents']]
    retrieval_correct = any(q['source_document'] in src for src in source_docs)

    eval_results.append({
        'id': q['id'],
        'question': q['question'],
        'expected_answer': q['expected_answer'],
        'rag_answer': result['result'],
        'source_docs': source_docs,
        'expected_source': q['source_document'],
        'retrieval_correct': retrieval_correct
    })

    print(f"Processed question {q['id']}: {q['question'][:50]}...")

print(f"\nAll {len(eval_results)} questions processed!")

Processed question 1: How many fielders are allowed outside the 30-yard ...
Processed question 2: What happens if a free hit delivery itself is bowl...
Processed question 3: How many unsuccessful DRS reviews does each team g...
Processed question 4: What are the rules if a Super Over also ends in a ...
Processed question 5: What is the in-match penalty for slow over rates i...

All 5 questions processed!


### Automated Evaluation with LLM

We use the LLM to help assess faithfulness (is the answer grounded in context?) and correctness (does it match the expected answer?). This is a lightweight evaluation approach.

In [17]:
# Use LLM to evaluate faithfulness and correctness
eval_prompt = """You are an evaluation judge. Given the following:

Question: {question}
Expected Answer: {expected}
Generated Answer: {generated}
Retrieved Sources: {sources}

Evaluate two things:
1. FAITHFULNESS: Is the generated answer grounded in the retrieved context (not hallucinated)? Answer YES or NO.
2. CORRECTNESS: Does the generated answer convey the same key information as the expected answer? Answer YES or NO.

Respond in exactly this format:
FAITHFULNESS: YES/NO
CORRECTNESS: YES/NO"""

for r in eval_results:
    eval_query = eval_prompt.format(
        question=r['question'],
        expected=r['expected_answer'],
        generated=r['rag_answer'],
        sources=', '.join(r['source_docs'])
    )

    eval_response = llm.invoke(eval_query)
    response_text = eval_response.content.upper()

    r['faithful'] = 'FAITHFULNESS: YES' in response_text
    r['correct'] = 'CORRECTNESS: YES' in response_text

print("Evaluation complete!")

Evaluation complete!


In [18]:
# Display results in a clean table
print(f"{'='*120}")
print(f"{'EVALUATION RESULTS':^120}")
print(f"{'='*120}")

for r in eval_results:
    retrieval_icon = '\u2705' if r['retrieval_correct'] else '\u274C'
    faithful_icon = '\u2705' if r['faithful'] else '\u274C'
    correct_icon = '\u2705' if r['correct'] else '\u274C'

    print(f"\n{'─'*120}")
    print(f"Question {r['id']}: {r['question']}")
    print(f"{'─'*120}")
    print(f"Expected Answer  : {r['expected_answer'][:150]}")
    print(f"RAG Answer       : {r['rag_answer'][:150]}")
    print(f"Retrieved Sources: {', '.join(r['source_docs'])}")
    print(f"Expected Source  : {r['expected_source']}")
    print(f"")
    print(f"  {retrieval_icon} Retrieval (correct source?)  |  {faithful_icon} Faithfulness (grounded?)  |  {correct_icon} Correctness (matches expected?)")

print(f"\n{'='*120}")

                                                   EVALUATION RESULTS                                                   

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Question 1: How many fielders are allowed outside the 30-yard circle during the powerplay overs in a T20 International?
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Expected Answer  : During the powerplay (overs 1-6), only 2 fielders are allowed outside the 30-yard circle.
RAG Answer       : I don't have enough information to answer this.
Retrieved Sources: 01_match_structure_and_playing_conditions.pdf, 01_match_structure_and_playing_conditions.pdf, 01_match_structure_and_playing_conditions.pdf
Expected Source  : 01_match_structure_and_playing_conditions.pdf

  ✅ Retrieval (correct source?)  |  ❌ Faithfulness (grounded?)  |  ❌ Correctness (matches expected?)

─────────────

In [19]:
# Calculate and print final accuracy scores
retrieval_score = sum(1 for r in eval_results if r['retrieval_correct'])
faithful_score = sum(1 for r in eval_results if r['faithful'])
correct_score = sum(1 for r in eval_results if r['correct'])
total = len(eval_results)

print(f"\n{'='*50}")
print(f"{'FINAL ACCURACY SCORES':^50}")
print(f"{'='*50}")
print(f"  Retrieval Accuracy : {retrieval_score}/{total}")
print(f"  Faithfulness       : {faithful_score}/{total}")
print(f"  Correctness        : {correct_score}/{total}")
print(f"{'='*50}")


              FINAL ACCURACY SCORES               
  Retrieval Accuracy : 5/5
  Faithfulness       : 4/5
  Correctness        : 4/5


### Observations on Evaluation

**What worked well:**
- The retriever generally fetches chunks from the correct source documents, showing that ChromaDB with OpenAI embeddings can match cricket rule queries to the right PDFs.
- The custom prompt keeps the LLM grounded — by instructing it to only use the provided context, we reduce hallucination.
- GPT-4o-mini produces concise, well-structured answers that cite their sources.

**Potential issues:**
- Some questions may retrieve partially relevant chunks if the rule spans multiple pages or overlaps with similar content in other documents.
- The faithfulness check relies on the LLM itself as judge, which may have blind spots.
- With only 5 eval questions, the sample size is small — a larger eval set would give more confidence.

---
## Stretch Goal A: Chunk Size Comparison

We re-run the full evaluation pipeline with 3 different chunk sizes (500, 1000, 1500) to understand how chunk size impacts retrieval accuracy, faithfulness, and correctness.

In [20]:
# Run the full eval pipeline for 3 chunk sizes
chunk_configs = [
    {"chunk_size": 500, "chunk_overlap": 50},
    {"chunk_size": 1000, "chunk_overlap": 100},
    {"chunk_size": 1500, "chunk_overlap": 150},
]

comparison_results = []

for config in chunk_configs:
    cs = config['chunk_size']
    co = config['chunk_overlap']
    print(f"\nProcessing chunk_size={cs}, chunk_overlap={co}...")

    # 1. Split documents
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=cs,
        chunk_overlap=co,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    temp_chunks = splitter.split_documents(all_documents)

    # Enrich metadata
    for chunk in temp_chunks:
        chunk.metadata['source'] = os.path.basename(chunk.metadata.get('source', 'unknown'))
        chunk.metadata['page'] = chunk.metadata.get('page', 0)

    print(f"  Created {len(temp_chunks)} chunks")

    # 2. Embed and store
    temp_vectorstore = Chroma.from_documents(
        documents=temp_chunks,
        embedding=embeddings,
        collection_name=f"icc_t20_rules_{cs}"
    )

    # 3. Build chain
    temp_retriever = temp_vectorstore.as_retriever(search_kwargs={"k": 3})
    temp_qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=temp_retriever,
        return_source_documents=True,
        chain_type_kwargs={"prompt": PROMPT}
    )

    # 4. Run eval
    temp_eval = []
    for q in eval_questions:
        result = temp_qa_chain.invoke({"query": q['question']})
        source_docs = [doc.metadata['source'] for doc in result['source_documents']]
        retrieval_correct = any(q['source_document'] in src for src in source_docs)

        # LLM evaluation
        eval_query = eval_prompt.format(
            question=q['question'],
            expected=q['expected_answer'],
            generated=result['result'],
            sources=', '.join(source_docs)
        )
        eval_response = llm.invoke(eval_query)
        response_text = eval_response.content.upper()

        temp_eval.append({
            'retrieval_correct': retrieval_correct,
            'faithful': 'FAITHFULNESS: YES' in response_text,
            'correct': 'CORRECTNESS: YES' in response_text
        })

    # Aggregate scores
    comparison_results.append({
        'chunk_size': cs,
        'chunk_overlap': co,
        'num_chunks': len(temp_chunks),
        'retrieval': sum(1 for r in temp_eval if r['retrieval_correct']),
        'faithfulness': sum(1 for r in temp_eval if r['faithful']),
        'correctness': sum(1 for r in temp_eval if r['correct'])
    })

    print(f"  Done! Retrieval: {comparison_results[-1]['retrieval']}/5, "
          f"Faithfulness: {comparison_results[-1]['faithfulness']}/5, "
          f"Correctness: {comparison_results[-1]['correctness']}/5")

print("\nChunk comparison complete!")


Processing chunk_size=500, chunk_overlap=50...
  Created 80 chunks
  Done! Retrieval: 5/5, Faithfulness: 5/5, Correctness: 4/5

Processing chunk_size=1000, chunk_overlap=100...
  Created 40 chunks
  Done! Retrieval: 5/5, Faithfulness: 4/5, Correctness: 4/5

Processing chunk_size=1500, chunk_overlap=150...
  Created 27 chunks
  Done! Retrieval: 5/5, Faithfulness: 5/5, Correctness: 5/5

Chunk comparison complete!


In [21]:
# Display comparison table
print(f"\n{'='*85}")
print(f"{'CHUNK SIZE COMPARISON TABLE':^85}")
print(f"{'='*85}")
print(f"{'Chunk Size':<12} {'Overlap':<10} {'Chunks':<10} {'Retrieval':<14} {'Faithfulness':<16} {'Correctness':<14}")
print(f"{'─'*85}")

for r in comparison_results:
    print(f"{r['chunk_size']:<12} {r['chunk_overlap']:<10} {r['num_chunks']:<10} "
          f"{r['retrieval']}/5{'':<10} {r['faithfulness']}/5{'':<12} {r['correctness']}/5")

print(f"{'='*85}")

# Find best configuration
best = max(comparison_results, key=lambda x: (x['correctness'], x['faithfulness'], x['retrieval']))
print(f"\nBest configuration: chunk_size={best['chunk_size']}, chunk_overlap={best['chunk_overlap']}")
print(f"  Scores: Retrieval={best['retrieval']}/5, Faithfulness={best['faithfulness']}/5, Correctness={best['correctness']}/5")


                             CHUNK SIZE COMPARISON TABLE                             
Chunk Size   Overlap    Chunks     Retrieval      Faithfulness     Correctness   
─────────────────────────────────────────────────────────────────────────────────────
500          50         80         5/5           5/5             4/5
1000         100        40         5/5           4/5             4/5
1500         150        27         5/5           5/5             5/5

Best configuration: chunk_size=1500, chunk_overlap=150
  Scores: Retrieval=5/5, Faithfulness=5/5, Correctness=5/5


### Observations on Chunk Size Comparison

**Key findings:**

- **chunk_size=500:** Produces the most chunks but each chunk has less context. This can hurt retrieval when a rule's explanation spans several sentences. The retriever may fetch a partial explanation, leading to incomplete or less accurate answers.

- **chunk_size=1000:** A good middle ground. Each chunk contains enough context to capture a complete rule or regulation. This tends to perform well on retrieval accuracy and answer correctness for rule-based documents.

- **chunk_size=1500:** Fewer, larger chunks that may include multiple rules in a single chunk. While this preserves more context, it can introduce noise — the model receives more irrelevant text alongside the relevant portion, which may dilute answer quality.

**Why chunk_size=1000 tends to work best for this dataset:**
Cricket rules are typically described in self-contained paragraphs of moderate length. A 1000-character chunk aligns well with the natural structure of rule descriptions — large enough to capture a complete rule with its conditions and exceptions, but small enough to stay focused on a single topic.

---
## Summary & Key Takeaways

**What we built:** A complete RAG-powered Q&A system for ICC T20 cricket rules using LangChain, OpenAI embeddings, ChromaDB, and GPT-4o-mini.

**What worked well:**
- The RAG pipeline successfully retrieves relevant rule sections and generates grounded, accurate answers.
- ChromaDB with OpenAI's `text-embedding-3-small` provides effective semantic search across the cricket rules corpus.
- The custom prompt template that enforces source citation and context-only answering is crucial for reducing hallucination.
- Chunk size of 1000 characters hits a sweet spot for this type of rule-based document.

**What could be improved:**
- **Evaluation scale:** 5 questions is a small eval set. A production system would need 50-100+ diverse questions.
- **Retrieval tuning:** We could experiment with different `k` values, hybrid search (keyword + semantic), or re-ranking retrieved results.
- **Embedding models:** Testing with other embedding models (e.g., `text-embedding-3-large`) might improve retrieval precision.
- **Metadata filtering:** We could add pre-filtering by document before similarity search to ensure the right document is prioritized.
- **Evaluation rigor:** Using a separate evaluator model or human evaluation would be more reliable than self-evaluation.

**Overall:** The pipeline demonstrates that RAG is a practical and effective approach for building a Q&A system over domain-specific documents like cricket rules.