Skip to content

guydev42/rag-document-qa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Overview

A self-contained retrieval-augmented generation system that answers questions about municipal policy documents using TF-IDF and BM25 retrieval with re-ranking.

When organizations accumulate policy documents, bylaws, and operational guides, finding the right passage to answer a specific question becomes slow and error-prone. This project builds a RAG pipeline that indexes municipal policy documents into overlapping text chunks, retrieves the most relevant passages for a given question, and presents them with relevance scores and highlighted matching terms. No external API calls are needed -- all retrieval and scoring runs locally using scikit-learn and rank_bm25.

Problem   →  Finding answers in a growing corpus of municipal policy documents
Solution  →  TF-IDF and BM25 retrieval with term-overlap re-ranking
Impact    →  MRR 0.82, Precision@3 0.89 across 30 evaluation questions on 15 documents

Key results

Metric TF-IDF BM25
MRR 0.82 0.80
Precision@1 0.77 0.73
Precision@3 0.89 0.87
Recall@5 0.93 0.90

Architecture

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│  Document        │───▶│  Text chunking   │───▶│  TF-IDF / BM25   │
│  loading         │    │  with overlap     │    │  indexing         │
└──────────────────┘    └──────────────────┘    └────────┬─────────┘
                                                         │
                          ┌──────────────────────────────┘
                          ▼
              ┌──────────────────────┐    ┌──────────────────────┐
              │  Cosine similarity   │───▶│  Term-overlap        │
              │  retrieval           │    │  re-ranking          │
              └──────────────────────┘    └──────────┬───────────┘
                                                     │
                          ┌──────────────────────────┘
                          ▼
              ┌──────────────────────┐    ┌──────────────────────┐
              │  Passage ranking     │───▶│  Answer              │
              │  with scores         │    │  presentation        │
              └──────────────────────┘    └──────────────────────┘
Project structure
project_18_rag_document_qa/
├── data/
│   ├── documents.json                 # 15 municipal policy documents
│   ├── eval_qa.json                   # 30 evaluation Q&A pairs
│   └── generate_data.py               # Synthetic data generator
├── src/
│   ├── __init__.py
│   ├── data_loader.py                 # Document loading and chunking
│   └── model.py                       # Retrieval models and evaluation
├── notebooks/
│   ├── 01_eda.ipynb                   # Document statistics and vocabulary
│   ├── 02_feature_engineering.ipynb   # Text preprocessing and indexing
│   ├── 03_modeling.ipynb              # TF-IDF vs BM25 comparison
│   └── 04_evaluation.ipynb            # Full evaluation and error analysis
├── figures/
├── app.py                             # Streamlit dashboard
├── requirements.txt
└── README.md

Quickstart

# Clone and navigate
git clone https://github.com/guydev42/calgary-data-portfolio.git
cd calgary-data-portfolio/project_18_rag_document_qa

# Install dependencies
pip install -r requirements.txt

# Generate document data
python data/generate_data.py

# Launch dashboard
streamlit run app.py

Dataset

Property Details
Source Synthetic municipal policy documents
Documents 15 (land use, transit, water, housing, parks, etc.)
Evaluation questions 30 with ground truth document IDs
Chunk size 500 characters with 50-character overlap
Domain Calgary municipal policy and public services

Tech stack


Methodology

Document chunking
  • Fixed-size character chunks (default 500 characters) with configurable overlap (default 50)
  • Sentence boundary detection to avoid splitting mid-sentence
  • Each chunk retains metadata linking it back to the source document
TF-IDF retrieval
  • Scikit-learn TfidfVectorizer with sublinear TF scaling and English stop words
  • Unigram and bigram features up to 5,000 terms
  • Cosine similarity between query vector and all chunk vectors
BM25 retrieval
  • Okapi BM25 with k1=1.5 and b=0.75 parameters
  • Token-level matching with term frequency saturation
  • Length normalization relative to average document length
Re-ranking
  • Term overlap scoring as a lightweight cross-encoder alternative
  • Combines unigram overlap, bigram overlap bonus, and passage length penalty
  • Weighted combination: 60% retrieval score + 40% re-ranking score
Evaluation
  • 30 hand-crafted questions with ground truth relevant document IDs
  • Metrics: Precision@k, Recall@k, MRR (mean reciprocal rank)
  • Parameter sensitivity analysis across chunk sizes and k values

Acknowledgements

Built as part of the Calgary Data Portfolio.


About

<div align=center>

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors