Jérémie Dentan1, Alexi Canesse1, Davide Buscaldi1, 2, Aymen Shabou3, Sonia Vanier1
1LIX (École Polytechnique, IP Paris, CNSR), 2LIPN (Université Sorbonne Paris Nord), 3Crédit Agricole SA
This repository provides source code for the generation and utilization of the MUCH dataset, the first claim-level Uncertainty Quantification benchmark designed for fair and reproducible evaluation of future methods under realistic conditions.
The dataset contains 4,873 samples from the MUCH benchmark, designed for evaluating claim-level Uncertainty Quantification methods. It includes a training set that was annotated automatically (4,673 samples) and a test set that was both automatically and manually annotated by human experts (200 samples).
Alongside this repository, we provide the following resources:
- Our research paper introducing MUCH and describing its generation in detail:
arXiv:2511.17081 - The dataset, available on HuggingFace:
- Main dataset:
orailix/MUCH - Generation configs:
orailix/MUCH-configs - Baseline evaluation data:
orailix/MUCH-signals
- Main dataset:
- A PyPI package implementing our claim segmentation algorithm:
much-segmenter
This repository serves three main purposes:
- 🛠️ Goal A: Reproducible research, provide the source code needed to reproduce the generation of MUCH.
- 📊 Goal B: Reproducible research, provide the source code needed to reproduce the evaluation of five baselines on MUCH, as described in our research paper.
- 🚀 Goal C: Encourage research, provide researchers with a framework to easily develop, evaluate and compare new white-box claim-level UQ methods on MUCH.
These three goals are represented by three usage paths described below. They correspond to the three folders in the scripts/ directory.
This repository contains two primary components: a Python module to implement the experiments, and BASH/SLURM scripts to run them in the correct order.
Python environment: the code is expected to run with Python 3.12, using the requirements provided in requirements.txt.
Python module in the src directory:
- The module contains seven submodules responsible for different parts of the experiments:
utils: Various utilities, including path management and the definition of a command-linetyperapp used by other submodules.questions: Generate and store the questions used for MUCH generation.generation: Generate LLM answers for each question.annotation: Annotate the LLM answers, either with GPT or manual annotations.signals: Implement token-level signals for each baseline evaluated on MUCH.evaluation: Compare thesignalsand ground-truth annotations of MUCH for evaluation.hf_setup: HuggingFace API, to export or import data from the Hub.
- These modules implement various commands that can be used from the terminal. The list of commands can be obtained via
python -m src --help
BASH/SLURM code in scripts:
- These scripts are organized in folders
A/B/C, corresponding to each user path. - Scripts containing SLURM syntax are expected to run on an HPC cluster. They implement GPU-intensive computations such as LLM generation for MUCH, and computation of CCP and SAR signals (which require using an NLI model).
- For each script, you should adapt (1) the project root using the
ROOTvariable, and (2) the Python environment, by editing theconda activate ...line.
The scripts needed for this user path are available in scripts/A_generate_much.
A1_compute_questions:
This BASH script downloads and processes the questions from the MU-Shroom dataset:
python -m src get-mushroom-questions
A2_generation:
This SLURM script generates the LLM answers along with the 24 logits for each generated token. It is intended to be deployed on an HPC cluster.
python -m src generate
A3_segmentation:
This BASH script relies on much-segmenter to segment all claims of the LLM-generated responses.
python -m src segment-all-generations
At this stage, you have generated the questions, the LLM responses, the 24 logits for each LLM-generated token, and the claims produced from segmenting the answers.
A4_wiki_cache:
To add labels to each claim, you must first download the Wikipedia page associated with each question:
python -m src download-all-wiki-references
We separated the Wikipedia cache from the annotation because parallelism during annotation nullifies the benefits of caching.
A5_gpt_scripts:
We then use GPT-4o and GPT-4.1 to automatically annotate the claims. Note that this command consumed around USD 50 of OpenAI credits during execution. This cost may vary, so you should monitor your credit consumption and set limits.
python -m src add-all-gpt-labels
You can also monitor the cost of annotation on a smaller batch of samples using commands like:
python -m src add-some-gpt-labels --lang es --size 100
Human annotations (excluded from scripts):
To add human annotations, use the following command:
python -m src add-human-annotations --lang=fr --size=50 --annotator=an0
Here, an0 represents the annotator's name, which you may change. This will sample 50 responses to annotate in French. The sampling is deterministic, so all annotators receive the same samples. Samples already annotated by an0 are skipped. You can also annotate a specific sample using the option --annot-id=<...>, which overrides lang and size.
A6_export:
This script exports all samples for which GPT-4o and GPT-4.1 agree on all claims.
python -m src export-much
B1_import_much:
We evaluated five baselines on MUCH dataset : CCP, SAR, Maximum Likelihood
(Max-L), Token Likelihood (T-L) and Token Entropy (T-E). First, you should import MUCH benchmark from the Hugging Face Hub. We advise you to clean you output repository first.
python -m src import-muchB2_locals
The lighter baselines can be computed without heavy GPU resources. This BASH scripts provides the commands to compute them locally.
B3_remote_ccp and B4_remote_sar
CCP and SAR baselines require heavy GPU resources, which is why we advise to compute them on an HPC cluster. These SLURm scripts provide the command to deploy their computation. The computation time is logged on the standard output of the job.
Evaluation of the baselines:
Finally, you should evaluate your baselines. Please refer to User Path C above for this part (parts 4 and 5 only if you are not implementing new baselines).
Congratulations, you made an excellent choice in using MUCH to implement and evaluate a new claim-level UQ method 😉
This dataset is designed to evaluate claim-level uncertainty quantification (UQ).
Each sample includes precomputed logits for every generated token, allowing new methods to estimate the factuality of each segmented claim. Your UQ estimator can then be evaluated by comparing its predictions with the ground-truth values provided in the labels field.
The benchmark is intended to reflect realistic production constraints. Consequently, UQ methods should be fast and efficient. They should not rely on external knowledge sources, as such resources are typically unavailable or impractical in real-world deployments.
Here are our suggestions to facilitate this process.
1- Implementation of your method:
- Your method should correspond to a new script like
src/signals/sig_your_awesome_method.py. - This script should implement a subclass of
src.signals.UQSignal, for exampleclass SignalYourAwesomeMethod(UQSignal): - This class should at least implement the following methods:
signal_name(to specify where to store the token-level signals) andcompute_signal_value(the core of your method, where you compute the token-level signal). - You can check the implementation of more complex signals such as the CCP baseline in
src/signals/sig_ccp.py.
2- Compare with other baselines: We have already implemented several baselines and computed their token-level signals on MUCH. You can directly download them with the C1_import_baselines.sh script, without re-computing them.
3- Evaluate GPU time: As discussed in our research paper (see above), it's crucial to monitor the runtime of your baseline to ensure it remains realistic for real-world scenarios. If your method is CPU-only, you can directly compare it to the GPU cost of MUCH generation, which is 2,758s (see details in our research paper). If your method uses a GPU, you must evaluate how long MUCH generation would take on your GPU. Not all GPUs are the same, so this may be more or less than 2,758s depending on your machine.
To evaluate this computational cost, you can run C2_estimate_time.sh. It's intended to be deployed on an HPC cluster. This script evaluates computation time on a single GPU. Consequently, your baseline should also be evaluated on a single GPU. However, you likely do not need to worry about this because this is the default behavior of C3_evaluate_signals.sh (see below).
4- Evaluate your method
Each method must be evaluated for all four languages (en, fr, de, es), and some require parameter sweeps. You can check how we evaluate the baselines we implemented and add your baseline in this script.
bash scripts/C3_evaluate_signals.sh
Note that this script evaluates baselines in a single-GPU setting, which is the correct configuration for comparison with MUCH generation. Below is a minimal example showing how to evaluate one signal on one language:
python -m src evaluate-signal max_likelihood --lang en
5- Exploring the results
Finally, you can explore the results by adapting the notebook figures/02_baselines.ipynb. We encourage focusing on low-FPR (typically less than 20%) and high-precision (typically higher than 90%) regions (see discussion at the end of our research paper).
🎉 Congrats! You implemented and evaluated your new UQ method! We hope it achieves great results to advance our understanding of LLM UQ 😃
To reproduce the figures of the paper, you can use 01_annotation_quality.ipynb and 02_baselines.ipynb. Note that the first notebook should be use with MUCH data, including the "trash" split containing the samples that were removed from MUCH due to their insufficient quality.
Below, we report the
- TOTAL for the 6448 samples before filtering: 4540s (01:15:40)
- TOTAL for the 4873 samples after filtering: 2758s (00:45:58) (Normalization: 4540 * (4873/6448) * (4.26/5.30) = 2758)
- TOTAL for 6448 values = 8s
- TOTAL for 4873 values = 6s
- Token Likelihood (CPU - Apple M4 Pro) : 8.2s
- Max Likelihood (CPU - Apple M4 Pro) : 8.2s
- Token Entropy (CPU - Apple M4 Pro) : 9.0s
- CCP (GPU - Nvidia A100 + Intel Xeon 6248)
- CCP-10-3: 4047s
- CCP-10-5: 3230s
- CCP-10-8: 3410s
- CCP-24-3: 3508s
- CCP-24-5: 4268s
- CCP-24-8: 5429s
- SAR (GPU - Nvidia A100 + Intel Xeon 6248)
- SAR-3: 419s
- SAR-5: 510s
- SAR-8: 613s
The prompt, wiki_url, and lang fields of the MUCH samples are extracted from the Mu-SHROOM [1], a dataset released under CC-BY-4.0 license.
This work received financial support from the research chair Trustworthy and Responsible AI at École Polytechnique.
This work was granted access to the HPC resources of IDRIS under the allocation AD011014843R1, made by GENCI.
[1] Raúl Vázquez, Timothee Mickus, Elaine Zosa, Teemu Vahtola, Jörg Tiedemann, Aman Sinha, Vincent Segonne, Fernando Sánchez-Vega, Alessandro Raganato, Jindřich Libovický, Jussi Karlgren, Shaoxiong Ji, Jindřich Helcl, Liane Guillou, Ona de Gibert, Jaione Bengoetxea, Joseph Attieh, Marianna Apidianaki. SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes ArXiv preprint. 2025. https://arxiv.org/abs/2504.11975
Copyright 2025–present Laboratoire d’Informatique de l’École Polytechnique.
This repository is released under the Apache-2.0 license.
Please cite this dataset as follows:
@misc{dentan_much_2025,
title = {MUCH: A Multilingual Claim Hallucination Benchmark},
author = {Dentan, Jérémie and Canesse, Alexi and Buscaldi, Davide and Shabou, Aymen and Vanier, Sonia},
year = {2025},
url = {https://arxiv.org/abs/2511.17081},
}