🎯 CenterDistill

Weakly-Supervised Distillation for Ambiguity-Aware Cross-Lingual Question Answering

A framework that teaches QA models when to answer, when to ask back, and when to offer alternatives — entirely without ambiguity labels.

Somyajit Chakraborty¹ · Sayak Naskar² · Soham Paul² · Angshuman Jana² · Nilotpal Chakraborty² · Avijit Gayen²·³

¹ University College Cork | ² IIIT Guwahati | ³ Techno India University

📌 TL;DR

Standard QA systems commit to one answer even when a question is genuinely ambiguous. CenterDistill fixes this by learning the geometry of question meaning — no ambiguity labels required. It clusters question embeddings into semantic centers, distils those soft distributions into an XLM-RoBERTa student, and at inference time routes each query to one of three behaviours: Answer, Clarify, or Return Alternatives.

On 1 000 English–Spanish MLQA examples: 90.1% behaviour accuracy, 92.6% center assignment accuracy, 77.3 QA-F1 — outperforming all baselines while adding zero annotation cost.

📖 Table of Contents

The Problem
Key Contributions
How It Works
Pipeline Overview
Results
Quick Start
Installation
Usage
Dataset
Hyperparameters
Reproducibility
Project Structure
Industry Applications
Limitations & Future Work
Paper Alignment Notes
Citation
Contributing

🔥 The Problem

Every time a QA system answers a question like "What are the side effects?" or "When does school start?", it is silently choosing one interpretation and discarding all others. According to Pew Research (2023) and Statista (2024), LLM-based assistant usage surged from 28 % → 68 % between 2022 and 2024 while traditional forums (which surface multiple answers naturally) declined from 72 % to 28 %. As users rely more heavily on single-system responses, undetected ambiguity causes real downstream harm:

Domain	Ambiguous Query	Risk of Single Confident Answer
🏥 Medical	"What are the side effects?"	Wrong drug assumed; patient harm
🛒 E-Commerce	"When will my order arrive?"	Policy vs. shipment vs. estimate confusion
🎓 Education	"When does school start?"	Daily time vs. semester start conflated
⚖️ Legal	"What is the statute of limitations?"	Jurisdiction silently assumed

Min et al. (2020) showed that standard QA models fail to detect ambiguity in 64 % of naturally ambiguous questions, returning a high-confidence single answer. Existing fixes (AmbigQA, ASQA) require expensive interpretation annotations and handle ambiguity at evaluation time only — not during training.

✨ Key Contributions

Label-free semantic center induction — Clusters LaBSE question embeddings via spectral clustering to discover K semantic regions with zero manual annotation.
Soft distillation as ambiguity signal — Computes cosine-similarity teacher distributions over cluster centroids; feeds them to the student as KL-divergence supervision, richer than hard one-hot labels.
Three-way inference-time behaviour policy — The predicted center distribution determines whether to Answer directly, Clarify (request elaboration via mT5), or Return Alternatives (surface multiple valid answers).
Automatic hyperparameter derivation — All six critical thresholds (K, τ, λ, τ_conf, τ_ent, τ_multi) are computed programmatically from training statistics before any gradient step — no grid search on held-out labels.
Cross-lingual transfer — A single model trained on English data transfers the behaviour capability to Spanish and German test sets without additional tuning.
New evaluation protocol — Behaviour Accuracy + Worst-Cluster F1 (RoMQA) as primary metrics alongside standard EM/F1, enabling deployment-oriented assessment.

⚙️ How It Works

Conceptual Overview

Question + Context
        │
   XLM-RoBERTa-large         (shared encoder — deepset/xlm-roberta-large-squad2)
        │
   ┌────┴──────────────────────────┐
   │                               │
span_head                     center_head
Linear(hidden → 2)            Linear(hidden → K=5)
start / end logits            centre distribution  P_S(c | q, d)
   │                               │
Extractive answer             Behaviour policy
                         ┌─────────┼─────────────────┐
                    max(PS)>0.44  H(PS)>1.51    two peaks>0.24
                         │            │                │
                      ANSWER      CLARIFY        ALTERNATIVES
                         │            │                │
                    Span text   mT5 clarification  All valid spans

Training Objective

The student jointly minimises extractive span loss and KL divergence from the teacher's soft center distribution:

ℒ = λ · KL( P_T(c|q) ‖ P_S(c|q,d) )  +  (1−λ) · ℒ_span(start*, end*)
    └──────────────────────────────┘    └─────────────────────────────┘
         ambiguity-awareness signal          standard extractive QA
              (λ = 0.70)                         (1−λ = 0.30)

The teacher distribution P_T is computed offline using LaBSE embeddings and spectral clustering — no annotation, no fine-tuning of the teacher. All six hyperparameters are derived automatically from training-set statistics.

Three-Stage Center Induction

Stage 1 ─ Embed   : LaBSE(q) → ê  (768-dim, L2-normalised)
Stage 2 ─ Cluster : SpectralClustering(cosine affinity, K=5, seed=42) → {C_k}
Stage 3 ─ Teacher : P_T(c_k|q) = softmax(τ · µ̃_k⊤ ê_q)   [τ=10.0]

🔄 Pipeline Overview

Figure: End-to-end CenterDistill pipeline — from raw MLQA inputs through center induction, distillation training, to the three-way inference-time behaviour policy.

The five stages map directly to the codebase modules:

Stage	Module	Key Function
Data loading	`centerdistill/data.py`	`load_en_en / es / de`, `make_tokenise_fn`
Center induction	`centerdistill/cluster.py`	`encode_questions()`, `induce_centers()`
Teacher distributions	`centerdistill/cluster.py`	`compute_teacher_distributions()`
Training	`centerdistill/model.py`	`CenterDistillModel`, `CenterDistillTrainer`
Inference & evaluation	`centerdistill/evaluate.py`	`evaluate_behaviour()`, `bootstrap_ci()`

📊 Results

Behaviour Selection — English→Spanish (N = 1 000, seed = 42)

Method	Beh. Acc. ↑	WC-F1 ↑	QA-F1 ↑	Params
MLQA Baseline (Lewis et al., 2019)	—	—	74.0	≈ 340M
AmbigQA (Min et al., 2020)	—	—	71.3	≈ 340M
Majority-Class (always Clarify)	75.9 %	—	—	≈ 560M
Standard XLM-R	8.6 %	0.3	77.3	≈ 560M
Confidence-based threshold	81.4 %	7.8	—	≈ 560M
Multi-task distillation	71.4 %	6.5	—	≈ 560M
CenterDistill (Ours)	90.1 %	8.8	77.3	≈ 560M

95 % bootstrap CI for CenterDistill behaviour accuracy: [88.2 %, 91.8 %] (10 K resamples).

The non-overlapping CI vs. the confidence-based baseline ([78.2 %, 84.4 %]) confirms the improvement is statistically significant at α = 0.05.

WC-F1 = Worst-Cluster F1 (per-cluster behaviour accuracy × 10) — a lower bound on robustness across semantic groups.

Cross-Lingual Transfer

Pair	N	Baseline F1	CD F1	Beh. Acc.	WC-F1
en → es	1 000	77.3	75.2	90.1 %	8.8
en → de	500	75.6	74.6	91.0 %	8.7

The −2.05 % F1 on en–es reflects a deliberate trade-off: EM/F1 apply only to direct-answer cases; 75.9 % of examples are correctly routed to Clarify/Alternatives and excluded from the span metric. On en–de this gap narrows to −1.03 %.

Confusion Matrix — en→es (gold rows × predicted columns)

	Pred: Answer	Pred: Clarify	Pred: Alternatives
Gold: Answer	73	12	1
Gold: Clarify	17	708	34
Gold: Alternatives	0	35	120

Cluster Quality — 500-question en-en pool

Semantic Center	Size	Purity	Silhouette	Model Acc.
Center 1	120	90.53 %	0.03	98.39 %
Center 2	116	87.76 %	0.03	97.44 %
Center 3	65	89.22 %	0.03	98.36 %
Center 4	70	88.99 %	0.03	95.65 %
Center 5	129	89.33 %	0.03	98.45 %
Overall	500	89.17 %	0.04	97.66 % (micro)

Although silhouette scores are modest (0.03–0.04 — expected for natural language questions), prediction accuracy remains consistently high across all centers, confirming that spectral clustering on LaBSE embeddings yields semantically coherent regions despite their proximity in embedding space.

Qualitative Case Studies

Question	Gold Behaviour	Standard XLM-R	Multi-task	CenterDistill
"How long does the treatment last?" (chemo: 3–6 months vs. single session: 4–6 hours)	Alternatives	3–6 months (overconfident)	Unanswerable	✅ Alternatives: "3–6 months" / "4–6 hours"
"What did the company report?" (single salient referent: $4.2B revenue)	Answer	$4.2B (correct)	$4.2B (correct)	✅ $4.2B (no spurious clarification)
"What are the side effects?" (two medications in context)	Clarify	Nausea (anchors to first)	Unanswerable	✅ Clarify: "Which medication are you referring to?"

🚀 Quick Start

Option A — Google Colab (recommended, zero setup)

Click the badge at the top. Run all 28 cells top-to-bottom. The full pipeline — MLQA download, hyperparameter derivation, training, evaluation, and all publication figures — completes in approximately 45 minutes on a free T4 GPU.

Option B — Local / Slurm

# 1. Clone and install
git clone https://github.com/hacky1997/centerdistill.git
cd centerdistill
pip install -e .                   # or: pip install -r requirements.txt

# 2. Download MLQA
curl -L -o MLQA_V1.zip https://dl.fbaipublicfiles.com/MLQA/MLQA_V1.zip
unzip -q MLQA_V1.zip && rm MLQA_V1.zip

# 3. Full pipeline: embed → cluster → train → evaluate → save all figures
python scripts/run_pipeline.py \
    --mlqa_root  MLQA_V1 \
    --output_dir outputs/centerdistill_seed42 \
    --seed       42

Expected runtime: ~3–4 hours on a single A100 / ~6–8 hours on a V100.

🛠️ Installation

Requirements

Dependency	Version
Python	≥ 3.9
PyTorch	2.10.0+cu128
HuggingFace Transformers	4.37.2
Accelerate	0.26.1
sentence-transformers	(for LaBSE)
scikit-learn	(spectral clustering)

# Recommended: create a dedicated environment
conda create -n centerdistill python=3.10 -y
conda activate centerdistill

# Install pinned dependencies
pip install -r requirements.txt

# Or install as an editable package
pip install -e .

Verify Installation

from centerdistill import CenterDistillModel
from centerdistill.config import BASE_CFG, derive_hyperparameters
print("CenterDistill installed successfully ✓")

💻 Usage

Reproduce Individual Tables and Figures from a Saved Model

# Table 3 + Table 4 + confusion matrix + all figures
python scripts/evaluate_only.py \
    --output_dir outputs/centerdistill_seed42 \
    --mlqa_root  MLQA_V1

# Table 3: baseline comparison
python scripts/baselines.py \
    --output_dir outputs/centerdistill_seed42 \
    --mlqa_root  MLQA_V1

# Table 4: K ablation sweep (K ∈ {3,4,5,6,7})
python scripts/ablation.py \
    --output_dir outputs/centerdistill_seed42 \
    --mlqa_root  MLQA_V1

# LaTeX source for Tables 2–5 + confusion matrix
python scripts/generate_latex.py \
    --output_dir outputs/centerdistill_seed42

# Data leakage verification (should report 0 overlaps)
python scripts/check_leakage.py --mlqa_root MLQA_V1

Run Inference on a Custom Example

from centerdistill import CenterDistillModel
from centerdistill.config import BASE_CFG

model = CenterDistillModel.from_pretrained("outputs/centerdistill_seed42")

question = "What are the side effects?"
context  = "Medication A causes nausea. Medication B causes headaches and dizziness."

result = model.predict(question, context)
# result = {
#   "behaviour": "CLARIFY",
#   "clarification": "Which medication are you referring to?",
#   "center_distribution": [0.12, 0.09, 0.38, 0.31, 0.10],
#   "entropy": 1.57
# }
print(result)

Derive Hyperparameters Automatically

from centerdistill.config import BASE_CFG, derive_hyperparameters

cfg = derive_hyperparameters(BASE_CFG, cluster_pool_questions)
# Returns: K=5, tau=10.0, lambda=0.7, tau_conf=0.44, tau_ent=1.51, tau_multi=0.24
print(cfg)

📁 Dataset

CenterDistill uses MLQA (Lewis et al., 2019) — a publicly available extractive QA benchmark in SQuAD format covering 7 languages.

Split	N	Role
MLQA en-en (train)	918	Model training
MLQA en-en (val)	230	Hyperparameter selection (λ)
MLQA en-en (test)	11 590	Baseline QA eval
MLQA en-es (val)	500	—
MLQA en-es (test)	5 253 → first 1 000 used	Primary behaviour eval
MLQA en-de (val)	512	—
MLQA en-de (test)	4 517 → first 500 used	Secondary behaviour eval
Cluster pool	500	Center induction (en-en train subset)

Download:

curl -L -o MLQA_V1.zip https://dl.fbaipublicfiles.com/MLQA/MLQA_V1.zip
unzip -q MLQA_V1.zip && rm MLQA_V1.zip

ℹ️ Data Leakage: check_leakage.py confirms zero train/test ID overlap, zero context overlap across pool and test sets, and a single documented question-text overlap that changes no reported metric.

🔧 Hyperparameters

All six parameters are derived automatically in config.derive_hyperparameters() before any model training. No manual tuning or validation-label grid search is performed.

Parameter	Value	Derivation
K (semantic centres)	5	Silhouette sweep K ∈ {2…8} with cosine affinity; K = 5 maximises score among semantically meaningful partitions (K ≥ 4)
τ (temperature)	10.0	Maximises discrimination score: std(max P_T) × above-chance mass × (1 − concentration penalty)
λ (KL weight)	0.7	Information-theoretic ideal ≈ 0.91; capped at 0.7 to protect extractive QA quality (span F1 stays within 1.5 % of baseline)
τ_conf	0.44	75th percentile of max(P_T) on training pool; guarantees ≥ 20 % routed to ANSWER
τ_ent	1.51	Median entropy of training examples not routed to ANSWER
τ_multi	0.24	60th percentile of second-highest P_T mass among non-answer training examples

Note on K = 6/7 ablation numbers: At K ≥ 6, mean teacher entropy drops below 0.8 nats, making soft labels near-deterministic. The reported 98.5 % and 99.1 % behaviour accuracy reflect label sharpness rather than better semantic structure — silhouette scores confirm the underlying geometry does not improve beyond K = 5.

🔬 Reproducibility

All experiments use seed = 42 locked across Python random, NumPy, PyTorch, and HuggingFace set_seed.

Reproducibility Check	Status
Train / test ID overlap	✅ 0
Cluster pool / test ID overlap	✅ 0
Context text overlap (pool vs. any test set)	✅ 0
Question text overlap (train vs. test)	✅ 0
Spectral clustering stability (n_init = 5)	✅ Identical assignments across all initialisations
Single Colab T4 run reproduces all Table 2–5 numbers exactly	✅

# Verify all of the above on your machine
python scripts/check_leakage.py --mlqa_root MLQA_V1

Training Configuration

backbone:        deepset/xlm-roberta-large-squad2
epochs:          4
batch_size:      8  (grad_accum=4 → effective=32)
optimizer:       AdamW  (lr=3e-5, weight_decay=0.01)
scheduler:       cosine with 10% warmup
precision:       FP16 mixed
gradient_ckpt:   true
seed:            42

📂 Project Structure

centerdistill/
│
├── centerdistill/                  # Importable Python package
│   ├── __init__.py
│   ├── config.py                   # BASE_CFG + derive_hyperparameters()
│   ├── data.py                     # MLQA loaders, tokenisation, QA evaluation
│   ├── cluster.py                  # LaBSE encoding, spectral clustering, teacher distributions
│   ├── model.py                    # CenterDistillModel + CenterDistillTrainer
│   ├── evaluate.py                 # evaluate_behaviour(), bootstrap_ci(), error_analysis()
│   └── visualize.py                # All publication figures (300–900 dpi)
│
├── scripts/
│   ├── run_pipeline.py             # End-to-end: download → embed → train → eval → save
│   ├── evaluate_only.py            # Reproduce all tables from a saved model
│   ├── baselines.py                # Table 3 — all baseline comparisons
│   ├── ablation.py                 # Table 4 — K ablation sweep
│   ├── generate_latex.py           # LaTeX source for Tables 2–5 + confusion matrix
│   └── check_leakage.py            # Data leakage verification
│
├── notebooks/
│   └── CenterDistill_Colab.ipynb   # Original Colab notebook (all 28 cells)
│
├── results/                        # Saved JSON artefacts (populated at runtime)
├── figures/                        # Generated figures (populated at runtime)
├── requirements.txt                # Pinned dependencies
├── setup.py                        # Pip-installable package
└── LICENSE                         # Apache 2.0

🏭 Industry Applications

CenterDistill's three-way behaviour policy addresses a gap in virtually every NLP deployment that surfaces a single answer to users:

Domain	CenterDistill Behaviour	Value
Virtual assistants / chatbots	Routes ambiguous queries to Clarify before hallucinating	Reduces hallucination-driven churn
Enterprise search & knowledge bases	Surfaces Alternatives when a query spans multiple departments	Prevents silent mis-routing
Medical / legal QA	High-stakes Clarify routing when interpretation is uncertain	Reduces liability from overconfident answers
Customer support automation	Distinguishes product-specific vs. policy questions	Improves first-contact resolution
Multilingual helpdesks	Cross-lingual transfer without per-language retraining	Reduces localisation cost

Deployment Considerations

Latency overhead: The center head is a single linear layer on the CLS token — negligible inference overhead over the base XLM-RoBERTa model.
Threshold calibration: All thresholds are derived from training statistics. For new domains, re-run derive_hyperparameters() on a representative in-domain pool (~500 questions is sufficient).
No architectural changes required: CenterDistill is a drop-in wrapper around any HuggingFace extractive QA model via patch_to_hf_qa_model().
Clarification generation: mT5-based generation is optional; the behaviour policy operates independently if generation is not deployed.

🔭 Limitations & Future Work

Current Limitations:

Surface-level answer variation can create spurious centers, contributing to false clarification errors (~13 % of misclassifications).
Evaluation is limited to two high-resource language pairs (en–es, en–de); extension to low-resource or morphologically rich languages remains open.
Fixed thresholds (τ_conf, τ_ent, τ_multi) are the primary source of errors — 97 % of misclassifications occur within margin 0.02 of a threshold boundary.
Behaviour labels are derived from teacher-induced distributions, not independent human annotations; external validation is a prerequisite for high-stakes deployment.
The model uses a larger backbone (~560M params) than published MLQA baselines (~340M); efficiency–behaviour trade-offs are not yet characterised.

Planned / Future Work:

Dynamic threshold calibration (e.g., temperature scaling, Platt calibration)
Extension to low-resource languages (Hindi, Swahili, Arabic)
Human evaluation of behaviour decisions as an external benchmark
Systematic multi-objective ablation of λ
Lightweight distillation to sub-100M student models for on-device deployment
Integration with retrieval-augmented systems (RAG + CenterDistill)

📚 Citation

If you use CenterDistill in your research, please cite:

@inproceedings{chakraborty2026centerdistill,
  title     = {CenterDistill: Weakly-Supervised Distillation for
               Ambiguity-Aware Cross-Lingual Question Answering},
  author    = {Chakraborty, Somyajit and Naskar, Sayak and Paul, Soham
               and Jana, Angshuman and Chakraborty, Nilotpal and Gayen, Avijit},
  booktitle = {Proceedings of EAAAI 2026},
  year      = {2026}
}

🤝 Contributing

Contributions are welcome! Please follow these steps:

Fork the repository and create your branch from main.
Install development dependencies: pip install -e ".[dev]"
Test your changes: pytest tests/ -v
Ensure check_leakage.py still passes with zero overlaps.
Open a Pull Request with a clear description.

For major changes or new language-pair experiments, please open an issue first to discuss scope.

Found a bug? Open an issue with the label bug and include your environment details (python --version, pip list | grep torch).

📬 Contact

Author	Affiliation	Email
Somyajit Chakraborty	University College Cork	123100668@umail.ucc.ie
Sayak Naskar	IIIT Guwahati	sayak.naskar25m@iiitg.ac.in
Soham Paul	IIIT Guwahati	soham.paul25m@iiitg.ac.in
Avijit Gayen	IIIT Guwahati / Techno India University	avijit.gayen@iiitg.ac.in

For questions about the codebase, open a GitHub Issue. For questions about the paper, email the corresponding authors.

Distributed under the Apache 2.0 License. See LICENSE for details.

⭐ If you find this work useful, please consider starring the repository — it helps other researchers discover it.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
centerdistill		centerdistill
figures		figures
notebooks		notebooks
results		results
scripts		scripts
.gitignore		.gitignore
CenterDistill Workflow Pipeline.png		CenterDistill Workflow Pipeline.png
Final.png		Final.png
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🎯 CenterDistill

Weakly-Supervised Distillation for Ambiguity-Aware Cross-Lingual Question Answering

📌 TL;DR

📖 Table of Contents

🔥 The Problem

✨ Key Contributions

⚙️ How It Works

Conceptual Overview

Training Objective

Three-Stage Center Induction

🔄 Pipeline Overview

📊 Results

Behaviour Selection — English→Spanish (N = 1 000, seed = 42)

Cross-Lingual Transfer

Confusion Matrix — en→es (gold rows × predicted columns)

Cluster Quality — 500-question en-en pool

Qualitative Case Studies

🚀 Quick Start

Option A — Google Colab (recommended, zero setup)

Option B — Local / Slurm

🛠️ Installation

Requirements

Verify Installation

💻 Usage

Reproduce Individual Tables and Figures from a Saved Model

Run Inference on a Custom Example

Derive Hyperparameters Automatically

📁 Dataset

🔧 Hyperparameters

🔬 Reproducibility

Training Configuration

📂 Project Structure

🏭 Industry Applications

Deployment Considerations

🔭 Limitations & Future Work

📚 Citation

🤝 Contributing

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages