A framework that teaches QA models when to answer, when to ask back, and when to offer alternatives — entirely without ambiguity labels.
Somyajit Chakraborty¹ · Sayak Naskar² · Soham Paul² · Angshuman Jana² · Nilotpal Chakraborty² · Avijit Gayen²·³
¹ University College Cork | ² IIIT Guwahati | ³ Techno India University
Standard QA systems commit to one answer even when a question is genuinely ambiguous. CenterDistill fixes this by learning the geometry of question meaning — no ambiguity labels required. It clusters question embeddings into semantic centers, distils those soft distributions into an XLM-RoBERTa student, and at inference time routes each query to one of three behaviours: Answer, Clarify, or Return Alternatives.
On 1 000 English–Spanish MLQA examples: 90.1% behaviour accuracy, 92.6% center assignment accuracy, 77.3 QA-F1 — outperforming all baselines while adding zero annotation cost.
- The Problem
- Key Contributions
- How It Works
- Pipeline Overview
- Results
- Quick Start
- Installation
- Usage
- Dataset
- Hyperparameters
- Reproducibility
- Project Structure
- Industry Applications
- Limitations & Future Work
- Paper Alignment Notes
- Citation
- Contributing
Every time a QA system answers a question like "What are the side effects?" or "When does school start?", it is silently choosing one interpretation and discarding all others. According to Pew Research (2023) and Statista (2024), LLM-based assistant usage surged from 28 % → 68 % between 2022 and 2024 while traditional forums (which surface multiple answers naturally) declined from 72 % to 28 %. As users rely more heavily on single-system responses, undetected ambiguity causes real downstream harm:
| Domain | Ambiguous Query | Risk of Single Confident Answer |
|---|---|---|
| 🏥 Medical | "What are the side effects?" | Wrong drug assumed; patient harm |
| 🛒 E-Commerce | "When will my order arrive?" | Policy vs. shipment vs. estimate confusion |
| 🎓 Education | "When does school start?" | Daily time vs. semester start conflated |
| ⚖️ Legal | "What is the statute of limitations?" | Jurisdiction silently assumed |
Min et al. (2020) showed that standard QA models fail to detect ambiguity in 64 % of naturally ambiguous questions, returning a high-confidence single answer. Existing fixes (AmbigQA, ASQA) require expensive interpretation annotations and handle ambiguity at evaluation time only — not during training.
- Label-free semantic center induction — Clusters LaBSE question embeddings via spectral clustering to discover K semantic regions with zero manual annotation.
- Soft distillation as ambiguity signal — Computes cosine-similarity teacher distributions over cluster centroids; feeds them to the student as KL-divergence supervision, richer than hard one-hot labels.
- Three-way inference-time behaviour policy — The predicted center distribution determines whether to Answer directly, Clarify (request elaboration via mT5), or Return Alternatives (surface multiple valid answers).
- Automatic hyperparameter derivation — All six critical thresholds (K, τ, λ, τ_conf, τ_ent, τ_multi) are computed programmatically from training statistics before any gradient step — no grid search on held-out labels.
- Cross-lingual transfer — A single model trained on English data transfers the behaviour capability to Spanish and German test sets without additional tuning.
- New evaluation protocol — Behaviour Accuracy + Worst-Cluster F1 (RoMQA) as primary metrics alongside standard EM/F1, enabling deployment-oriented assessment.
Question + Context
│
XLM-RoBERTa-large (shared encoder — deepset/xlm-roberta-large-squad2)
│
┌────┴──────────────────────────┐
│ │
span_head center_head
Linear(hidden → 2) Linear(hidden → K=5)
start / end logits centre distribution P_S(c | q, d)
│ │
Extractive answer Behaviour policy
┌─────────┼─────────────────┐
max(PS)>0.44 H(PS)>1.51 two peaks>0.24
│ │ │
ANSWER CLARIFY ALTERNATIVES
│ │ │
Span text mT5 clarification All valid spans
The student jointly minimises extractive span loss and KL divergence from the teacher's soft center distribution:
ℒ = λ · KL( P_T(c|q) ‖ P_S(c|q,d) ) + (1−λ) · ℒ_span(start*, end*)
└──────────────────────────────┘ └─────────────────────────────┘
ambiguity-awareness signal standard extractive QA
(λ = 0.70) (1−λ = 0.30)
The teacher distribution P_T is computed offline using LaBSE embeddings and spectral clustering — no annotation, no fine-tuning of the teacher. All six hyperparameters are derived automatically from training-set statistics.
Stage 1 ─ Embed : LaBSE(q) → ê (768-dim, L2-normalised)
Stage 2 ─ Cluster : SpectralClustering(cosine affinity, K=5, seed=42) → {C_k}
Stage 3 ─ Teacher : P_T(c_k|q) = softmax(τ · µ̃_k⊤ ê_q) [τ=10.0]
Figure: End-to-end CenterDistill pipeline — from raw MLQA inputs through center induction, distillation training, to the three-way inference-time behaviour policy.
The five stages map directly to the codebase modules:
| Stage | Module | Key Function |
|---|---|---|
| Data loading | centerdistill/data.py |
load_en_en / es / de, make_tokenise_fn |
| Center induction | centerdistill/cluster.py |
encode_questions(), induce_centers() |
| Teacher distributions | centerdistill/cluster.py |
compute_teacher_distributions() |
| Training | centerdistill/model.py |
CenterDistillModel, CenterDistillTrainer |
| Inference & evaluation | centerdistill/evaluate.py |
evaluate_behaviour(), bootstrap_ci() |
| Method | Beh. Acc. ↑ | WC-F1 ↑ | QA-F1 ↑ | Params |
|---|---|---|---|---|
| MLQA Baseline (Lewis et al., 2019) | — | — | 74.0 | ≈ 340M |
| AmbigQA (Min et al., 2020) | — | — | 71.3 | ≈ 340M |
| Majority-Class (always Clarify) | 75.9 % | — | — | ≈ 560M |
| Standard XLM-R | 8.6 % | 0.3 | 77.3 | ≈ 560M |
| Confidence-based threshold | 81.4 % | 7.8 | — | ≈ 560M |
| Multi-task distillation | 71.4 % | 6.5 | — | ≈ 560M |
| CenterDistill (Ours) | 90.1 % | 8.8 | 77.3 | ≈ 560M |
95 % bootstrap CI for CenterDistill behaviour accuracy: [88.2 %, 91.8 %] (10 K resamples).
The non-overlapping CI vs. the confidence-based baseline ([78.2 %, 84.4 %]) confirms the improvement is statistically significant at α = 0.05.
WC-F1 = Worst-Cluster F1 (per-cluster behaviour accuracy × 10) — a lower bound on robustness across semantic groups.
| Pair | N | Baseline F1 | CD F1 | Beh. Acc. | WC-F1 |
|---|---|---|---|---|---|
| en → es | 1 000 | 77.3 | 75.2 | 90.1 % | 8.8 |
| en → de | 500 | 75.6 | 74.6 | 91.0 % | 8.7 |
The −2.05 % F1 on en–es reflects a deliberate trade-off: EM/F1 apply only to direct-answer cases; 75.9 % of examples are correctly routed to Clarify/Alternatives and excluded from the span metric. On en–de this gap narrows to −1.03 %.
| Pred: Answer | Pred: Clarify | Pred: Alternatives | |
|---|---|---|---|
| Gold: Answer | 73 | 12 | 1 |
| Gold: Clarify | 17 | 708 | 34 |
| Gold: Alternatives | 0 | 35 | 120 |
| Semantic Center | Size | Purity | Silhouette | Model Acc. |
|---|---|---|---|---|
| Center 1 | 120 | 90.53 % | 0.03 | 98.39 % |
| Center 2 | 116 | 87.76 % | 0.03 | 97.44 % |
| Center 3 | 65 | 89.22 % | 0.03 | 98.36 % |
| Center 4 | 70 | 88.99 % | 0.03 | 95.65 % |
| Center 5 | 129 | 89.33 % | 0.03 | 98.45 % |
| Overall | 500 | 89.17 % | 0.04 | 97.66 % (micro) |
Although silhouette scores are modest (0.03–0.04 — expected for natural language questions), prediction accuracy remains consistently high across all centers, confirming that spectral clustering on LaBSE embeddings yields semantically coherent regions despite their proximity in embedding space.
| Question | Gold Behaviour | Standard XLM-R | Multi-task | CenterDistill |
|---|---|---|---|---|
| "How long does the treatment last?" (chemo: 3–6 months vs. single session: 4–6 hours) | Alternatives | 3–6 months (overconfident) | Unanswerable | ✅ Alternatives: "3–6 months" / "4–6 hours" |
| "What did the company report?" (single salient referent: $4.2B revenue) | Answer | $4.2B (correct) | $4.2B (correct) | ✅ $4.2B (no spurious clarification) |
| "What are the side effects?" (two medications in context) | Clarify | Nausea (anchors to first) | Unanswerable | ✅ Clarify: "Which medication are you referring to?" |
Click the badge at the top. Run all 28 cells top-to-bottom. The full pipeline — MLQA download, hyperparameter derivation, training, evaluation, and all publication figures — completes in approximately 45 minutes on a free T4 GPU.
# 1. Clone and install
git clone https://github.com/hacky1997/centerdistill.git
cd centerdistill
pip install -e . # or: pip install -r requirements.txt
# 2. Download MLQA
curl -L -o MLQA_V1.zip https://dl.fbaipublicfiles.com/MLQA/MLQA_V1.zip
unzip -q MLQA_V1.zip && rm MLQA_V1.zip
# 3. Full pipeline: embed → cluster → train → evaluate → save all figures
python scripts/run_pipeline.py \
--mlqa_root MLQA_V1 \
--output_dir outputs/centerdistill_seed42 \
--seed 42Expected runtime: ~3–4 hours on a single A100 / ~6–8 hours on a V100.
| Dependency | Version |
|---|---|
| Python | ≥ 3.9 |
| PyTorch | 2.10.0+cu128 |
| HuggingFace Transformers | 4.37.2 |
| Accelerate | 0.26.1 |
| sentence-transformers | (for LaBSE) |
| scikit-learn | (spectral clustering) |
# Recommended: create a dedicated environment
conda create -n centerdistill python=3.10 -y
conda activate centerdistill
# Install pinned dependencies
pip install -r requirements.txt
# Or install as an editable package
pip install -e .from centerdistill import CenterDistillModel
from centerdistill.config import BASE_CFG, derive_hyperparameters
print("CenterDistill installed successfully ✓")# Table 3 + Table 4 + confusion matrix + all figures
python scripts/evaluate_only.py \
--output_dir outputs/centerdistill_seed42 \
--mlqa_root MLQA_V1
# Table 3: baseline comparison
python scripts/baselines.py \
--output_dir outputs/centerdistill_seed42 \
--mlqa_root MLQA_V1
# Table 4: K ablation sweep (K ∈ {3,4,5,6,7})
python scripts/ablation.py \
--output_dir outputs/centerdistill_seed42 \
--mlqa_root MLQA_V1
# LaTeX source for Tables 2–5 + confusion matrix
python scripts/generate_latex.py \
--output_dir outputs/centerdistill_seed42
# Data leakage verification (should report 0 overlaps)
python scripts/check_leakage.py --mlqa_root MLQA_V1from centerdistill import CenterDistillModel
from centerdistill.config import BASE_CFG
model = CenterDistillModel.from_pretrained("outputs/centerdistill_seed42")
question = "What are the side effects?"
context = "Medication A causes nausea. Medication B causes headaches and dizziness."
result = model.predict(question, context)
# result = {
# "behaviour": "CLARIFY",
# "clarification": "Which medication are you referring to?",
# "center_distribution": [0.12, 0.09, 0.38, 0.31, 0.10],
# "entropy": 1.57
# }
print(result)from centerdistill.config import BASE_CFG, derive_hyperparameters
cfg = derive_hyperparameters(BASE_CFG, cluster_pool_questions)
# Returns: K=5, tau=10.0, lambda=0.7, tau_conf=0.44, tau_ent=1.51, tau_multi=0.24
print(cfg)CenterDistill uses MLQA (Lewis et al., 2019) — a publicly available extractive QA benchmark in SQuAD format covering 7 languages.
| Split | N | Role |
|---|---|---|
| MLQA en-en (train) | 918 | Model training |
| MLQA en-en (val) | 230 | Hyperparameter selection (λ) |
| MLQA en-en (test) | 11 590 | Baseline QA eval |
| MLQA en-es (val) | 500 | — |
| MLQA en-es (test) | 5 253 → first 1 000 used | Primary behaviour eval |
| MLQA en-de (val) | 512 | — |
| MLQA en-de (test) | 4 517 → first 500 used | Secondary behaviour eval |
| Cluster pool | 500 | Center induction (en-en train subset) |
Download:
curl -L -o MLQA_V1.zip https://dl.fbaipublicfiles.com/MLQA/MLQA_V1.zip
unzip -q MLQA_V1.zip && rm MLQA_V1.zipℹ️ Data Leakage:
check_leakage.pyconfirms zero train/test ID overlap, zero context overlap across pool and test sets, and a single documented question-text overlap that changes no reported metric.
All six parameters are derived automatically in config.derive_hyperparameters() before any model training. No manual tuning or validation-label grid search is performed.
| Parameter | Value | Derivation |
|---|---|---|
| K (semantic centres) | 5 | Silhouette sweep K ∈ {2…8} with cosine affinity; K = 5 maximises score among semantically meaningful partitions (K ≥ 4) |
| τ (temperature) | 10.0 | Maximises discrimination score: std(max P_T) × above-chance mass × (1 − concentration penalty) |
| λ (KL weight) | 0.7 | Information-theoretic ideal ≈ 0.91; capped at 0.7 to protect extractive QA quality (span F1 stays within 1.5 % of baseline) |
| τ_conf | 0.44 | 75th percentile of max(P_T) on training pool; guarantees ≥ 20 % routed to ANSWER |
| τ_ent | 1.51 | Median entropy of training examples not routed to ANSWER |
| τ_multi | 0.24 | 60th percentile of second-highest P_T mass among non-answer training examples |
Note on K = 6/7 ablation numbers: At K ≥ 6, mean teacher entropy drops below 0.8 nats, making soft labels near-deterministic. The reported 98.5 % and 99.1 % behaviour accuracy reflect label sharpness rather than better semantic structure — silhouette scores confirm the underlying geometry does not improve beyond K = 5.
All experiments use seed = 42 locked across Python random, NumPy, PyTorch, and HuggingFace set_seed.
| Reproducibility Check | Status |
|---|---|
| Train / test ID overlap | ✅ 0 |
| Cluster pool / test ID overlap | ✅ 0 |
| Context text overlap (pool vs. any test set) | ✅ 0 |
| Question text overlap (train vs. test) | ✅ 0 |
| Spectral clustering stability (n_init = 5) | ✅ Identical assignments across all initialisations |
| Single Colab T4 run reproduces all Table 2–5 numbers exactly | ✅ |
# Verify all of the above on your machine
python scripts/check_leakage.py --mlqa_root MLQA_V1backbone: deepset/xlm-roberta-large-squad2
epochs: 4
batch_size: 8 (grad_accum=4 → effective=32)
optimizer: AdamW (lr=3e-5, weight_decay=0.01)
scheduler: cosine with 10% warmup
precision: FP16 mixed
gradient_ckpt: true
seed: 42centerdistill/
│
├── centerdistill/ # Importable Python package
│ ├── __init__.py
│ ├── config.py # BASE_CFG + derive_hyperparameters()
│ ├── data.py # MLQA loaders, tokenisation, QA evaluation
│ ├── cluster.py # LaBSE encoding, spectral clustering, teacher distributions
│ ├── model.py # CenterDistillModel + CenterDistillTrainer
│ ├── evaluate.py # evaluate_behaviour(), bootstrap_ci(), error_analysis()
│ └── visualize.py # All publication figures (300–900 dpi)
│
├── scripts/
│ ├── run_pipeline.py # End-to-end: download → embed → train → eval → save
│ ├── evaluate_only.py # Reproduce all tables from a saved model
│ ├── baselines.py # Table 3 — all baseline comparisons
│ ├── ablation.py # Table 4 — K ablation sweep
│ ├── generate_latex.py # LaTeX source for Tables 2–5 + confusion matrix
│ └── check_leakage.py # Data leakage verification
│
├── notebooks/
│ └── CenterDistill_Colab.ipynb # Original Colab notebook (all 28 cells)
│
├── results/ # Saved JSON artefacts (populated at runtime)
├── figures/ # Generated figures (populated at runtime)
├── requirements.txt # Pinned dependencies
├── setup.py # Pip-installable package
└── LICENSE # Apache 2.0
CenterDistill's three-way behaviour policy addresses a gap in virtually every NLP deployment that surfaces a single answer to users:
| Domain | CenterDistill Behaviour | Value |
|---|---|---|
| Virtual assistants / chatbots | Routes ambiguous queries to Clarify before hallucinating | Reduces hallucination-driven churn |
| Enterprise search & knowledge bases | Surfaces Alternatives when a query spans multiple departments | Prevents silent mis-routing |
| Medical / legal QA | High-stakes Clarify routing when interpretation is uncertain | Reduces liability from overconfident answers |
| Customer support automation | Distinguishes product-specific vs. policy questions | Improves first-contact resolution |
| Multilingual helpdesks | Cross-lingual transfer without per-language retraining | Reduces localisation cost |
- Latency overhead: The center head is a single linear layer on the CLS token — negligible inference overhead over the base XLM-RoBERTa model.
- Threshold calibration: All thresholds are derived from training statistics. For new domains, re-run
derive_hyperparameters()on a representative in-domain pool (~500 questions is sufficient). - No architectural changes required: CenterDistill is a drop-in wrapper around any HuggingFace extractive QA model via
patch_to_hf_qa_model(). - Clarification generation: mT5-based generation is optional; the behaviour policy operates independently if generation is not deployed.
Current Limitations:
- Surface-level answer variation can create spurious centers, contributing to false clarification errors (~13 % of misclassifications).
- Evaluation is limited to two high-resource language pairs (en–es, en–de); extension to low-resource or morphologically rich languages remains open.
- Fixed thresholds (τ_conf, τ_ent, τ_multi) are the primary source of errors — 97 % of misclassifications occur within margin 0.02 of a threshold boundary.
- Behaviour labels are derived from teacher-induced distributions, not independent human annotations; external validation is a prerequisite for high-stakes deployment.
- The model uses a larger backbone (~560M params) than published MLQA baselines (~340M); efficiency–behaviour trade-offs are not yet characterised.
Planned / Future Work:
- Dynamic threshold calibration (e.g., temperature scaling, Platt calibration)
- Extension to low-resource languages (Hindi, Swahili, Arabic)
- Human evaluation of behaviour decisions as an external benchmark
- Systematic multi-objective ablation of λ
- Lightweight distillation to sub-100M student models for on-device deployment
- Integration with retrieval-augmented systems (RAG + CenterDistill)
If you use CenterDistill in your research, please cite:
@inproceedings{chakraborty2026centerdistill,
title = {CenterDistill: Weakly-Supervised Distillation for
Ambiguity-Aware Cross-Lingual Question Answering},
author = {Chakraborty, Somyajit and Naskar, Sayak and Paul, Soham
and Jana, Angshuman and Chakraborty, Nilotpal and Gayen, Avijit},
booktitle = {Proceedings of EAAAI 2026},
year = {2026}
}Contributions are welcome! Please follow these steps:
- Fork the repository and create your branch from
main. - Install development dependencies:
pip install -e ".[dev]" - Test your changes:
pytest tests/ -v - Ensure
check_leakage.pystill passes with zero overlaps. - Open a Pull Request with a clear description.
For major changes or new language-pair experiments, please open an issue first to discuss scope.
Found a bug? Open an issue with the label bug and include your environment details (python --version, pip list | grep torch).
| Author | Affiliation | |
|---|---|---|
| Somyajit Chakraborty | University College Cork | 123100668@umail.ucc.ie |
| Sayak Naskar | IIIT Guwahati | sayak.naskar25m@iiitg.ac.in |
| Soham Paul | IIIT Guwahati | soham.paul25m@iiitg.ac.in |
| Avijit Gayen | IIIT Guwahati / Techno India University | avijit.gayen@iiitg.ac.in |
For questions about the codebase, open a GitHub Issue. For questions about the paper, email the corresponding authors.
Distributed under the Apache 2.0 License. See LICENSE for details.
⭐ If you find this work useful, please consider starring the repository — it helps other researchers discover it.
