GitHub - guydev42/rag-document-qa: <div align=center>

Overview • Key results • Architecture • Quickstart • Dataset • Methodology

Overview

A self-contained retrieval-augmented generation system that answers questions about municipal policy documents using TF-IDF and BM25 retrieval with re-ranking.

When organizations accumulate policy documents, bylaws, and operational guides, finding the right passage to answer a specific question becomes slow and error-prone. This project builds a RAG pipeline that indexes municipal policy documents into overlapping text chunks, retrieves the most relevant passages for a given question, and presents them with relevance scores and highlighted matching terms. No external API calls are needed -- all retrieval and scoring runs locally using scikit-learn and rank_bm25.

Problem   →  Finding answers in a growing corpus of municipal policy documents
Solution  →  TF-IDF and BM25 retrieval with term-overlap re-ranking
Impact    →  MRR 0.82, Precision@3 0.89 across 30 evaluation questions on 15 documents

Key results

Metric	TF-IDF	BM25
MRR	0.82	0.80
Precision@1	0.77	0.73
Precision@3	0.89	0.87
Recall@5	0.93	0.90

Architecture

┌──────────────────┐    ┌──────────────────┐    ┌──────────────────┐
│  Document        │───▶│  Text chunking   │───▶│  TF-IDF / BM25   │
│  loading         │    │  with overlap     │    │  indexing         │
└──────────────────┘    └──────────────────┘    └────────┬─────────┘
                                                         │
                          ┌──────────────────────────────┘
                          ▼
              ┌──────────────────────┐    ┌──────────────────────┐
              │  Cosine similarity   │───▶│  Term-overlap        │
              │  retrieval           │    │  re-ranking          │
              └──────────────────────┘    └──────────┬───────────┘
                                                     │
                          ┌──────────────────────────┘
                          ▼
              ┌──────────────────────┐    ┌──────────────────────┐
              │  Passage ranking     │───▶│  Answer              │
              │  with scores         │    │  presentation        │
              └──────────────────────┘    └──────────────────────┘

Project structure

project_18_rag_document_qa/
├── data/
│   ├── documents.json                 # 15 municipal policy documents
│   ├── eval_qa.json                   # 30 evaluation Q&A pairs
│   └── generate_data.py               # Synthetic data generator
├── src/
│   ├── __init__.py
│   ├── data_loader.py                 # Document loading and chunking
│   └── model.py                       # Retrieval models and evaluation
├── notebooks/
│   ├── 01_eda.ipynb                   # Document statistics and vocabulary
│   ├── 02_feature_engineering.ipynb   # Text preprocessing and indexing
│   ├── 03_modeling.ipynb              # TF-IDF vs BM25 comparison
│   └── 04_evaluation.ipynb            # Full evaluation and error analysis
├── figures/
├── app.py                             # Streamlit dashboard
├── requirements.txt
└── README.md

Quickstart

# Clone and navigate
git clone https://github.com/guydev42/calgary-data-portfolio.git
cd calgary-data-portfolio/project_18_rag_document_qa

# Install dependencies
pip install -r requirements.txt

# Generate document data
python data/generate_data.py

# Launch dashboard
streamlit run app.py

Dataset

Property	Details
Source	Synthetic municipal policy documents
Documents	15 (land use, transit, water, housing, parks, etc.)
Evaluation questions	30 with ground truth document IDs
Chunk size	500 characters with 50-character overlap
Domain	Calgary municipal policy and public services

Tech stack

Methodology

Document chunking

Fixed-size character chunks (default 500 characters) with configurable overlap (default 50)
Sentence boundary detection to avoid splitting mid-sentence
Each chunk retains metadata linking it back to the source document

TF-IDF retrieval

Scikit-learn TfidfVectorizer with sublinear TF scaling and English stop words
Unigram and bigram features up to 5,000 terms
Cosine similarity between query vector and all chunk vectors

BM25 retrieval

Okapi BM25 with k1=1.5 and b=0.75 parameters
Token-level matching with term frequency saturation
Length normalization relative to average document length

Re-ranking

Term overlap scoring as a lightweight cross-encoder alternative
Combines unigram overlap, bigram overlap bonus, and passage length penalty
Weighted combination: 60% retrieval score + 40% re-ranking score

Evaluation

30 hand-crafted questions with ground truth relevant document IDs
Metrics: Precision@k, Recall@k, MRR (mean reciprocal rank)
Parameter sensitivity analysis across chunk sizes and k values

Acknowledgements

Built as part of the Calgary Data Portfolio.

Ola K.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Key results

Architecture

Quickstart

Dataset

Tech stack

Methodology

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
notebooks		notebooks
src		src
README.md		README.md
app.py		app.py
index.html		index.html
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Overview

Key results

Architecture

Quickstart

Dataset

Tech stack

Methodology

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages