A self-contained retrieval-augmented generation system that answers questions about municipal policy documents using TF-IDF and BM25 retrieval with re-ranking.
When organizations accumulate policy documents, bylaws, and operational guides, finding the right passage to answer a specific question becomes slow and error-prone. This project builds a RAG pipeline that indexes municipal policy documents into overlapping text chunks, retrieves the most relevant passages for a given question, and presents them with relevance scores and highlighted matching terms. No external API calls are needed -- all retrieval and scoring runs locally using scikit-learn and rank_bm25.
Problem → Finding answers in a growing corpus of municipal policy documents
Solution → TF-IDF and BM25 retrieval with term-overlap re-ranking
Impact → MRR 0.82, Precision@3 0.89 across 30 evaluation questions on 15 documents
| Metric | TF-IDF | BM25 |
|---|---|---|
| MRR | 0.82 | 0.80 |
| Precision@1 | 0.77 | 0.73 |
| Precision@3 | 0.89 | 0.87 |
| Recall@5 | 0.93 | 0.90 |
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Document │───▶│ Text chunking │───▶│ TF-IDF / BM25 │
│ loading │ │ with overlap │ │ indexing │
└──────────────────┘ └──────────────────┘ └────────┬─────────┘
│
┌──────────────────────────────┘
▼
┌──────────────────────┐ ┌──────────────────────┐
│ Cosine similarity │───▶│ Term-overlap │
│ retrieval │ │ re-ranking │
└──────────────────────┘ └──────────┬───────────┘
│
┌──────────────────────────┘
▼
┌──────────────────────┐ ┌──────────────────────┐
│ Passage ranking │───▶│ Answer │
│ with scores │ │ presentation │
└──────────────────────┘ └──────────────────────┘
Project structure
project_18_rag_document_qa/
├── data/
│ ├── documents.json # 15 municipal policy documents
│ ├── eval_qa.json # 30 evaluation Q&A pairs
│ └── generate_data.py # Synthetic data generator
├── src/
│ ├── __init__.py
│ ├── data_loader.py # Document loading and chunking
│ └── model.py # Retrieval models and evaluation
├── notebooks/
│ ├── 01_eda.ipynb # Document statistics and vocabulary
│ ├── 02_feature_engineering.ipynb # Text preprocessing and indexing
│ ├── 03_modeling.ipynb # TF-IDF vs BM25 comparison
│ └── 04_evaluation.ipynb # Full evaluation and error analysis
├── figures/
├── app.py # Streamlit dashboard
├── requirements.txt
└── README.md
# Clone and navigate
git clone https://github.com/guydev42/calgary-data-portfolio.git
cd calgary-data-portfolio/project_18_rag_document_qa
# Install dependencies
pip install -r requirements.txt
# Generate document data
python data/generate_data.py
# Launch dashboard
streamlit run app.py| Property | Details |
|---|---|
| Source | Synthetic municipal policy documents |
| Documents | 15 (land use, transit, water, housing, parks, etc.) |
| Evaluation questions | 30 with ground truth document IDs |
| Chunk size | 500 characters with 50-character overlap |
| Domain | Calgary municipal policy and public services |
Document chunking
- Fixed-size character chunks (default 500 characters) with configurable overlap (default 50)
- Sentence boundary detection to avoid splitting mid-sentence
- Each chunk retains metadata linking it back to the source document
TF-IDF retrieval
- Scikit-learn TfidfVectorizer with sublinear TF scaling and English stop words
- Unigram and bigram features up to 5,000 terms
- Cosine similarity between query vector and all chunk vectors
BM25 retrieval
- Okapi BM25 with k1=1.5 and b=0.75 parameters
- Token-level matching with term frequency saturation
- Length normalization relative to average document length
Re-ranking
- Term overlap scoring as a lightweight cross-encoder alternative
- Combines unigram overlap, bigram overlap bonus, and passage length penalty
- Weighted combination: 60% retrieval score + 40% re-ranking score
Evaluation
- 30 hand-crafted questions with ground truth relevant document IDs
- Metrics: Precision@k, Recall@k, MRR (mean reciprocal rank)
- Parameter sensitivity analysis across chunk sizes and k values
Built as part of the Calgary Data Portfolio.