MUCH : A Multilingual Claim Hallucination Benchmark

Jérémie Dentan¹, Alexi Canesse¹, Davide Buscaldi^{1, 2}, Aymen Shabou³, Sonia Vanier¹

¹LIX (École Polytechnique, IP Paris, CNSR), ²LIPN (Université Sorbonne Paris Nord), ³Crédit Agricole SA

Overview

This repository provides source code for the generation and utilization of the MUCH dataset, the first claim-level Uncertainty Quantification benchmark designed for fair and reproducible evaluation of future methods under realistic conditions.

The dataset contains 4,873 samples from the MUCH benchmark, designed for evaluating claim-level Uncertainty Quantification methods. It includes a training set that was annotated automatically (4,673 samples) and a test set that was both automatically and manually annotated by human experts (200 samples).

Alongside this repository, we provide the following resources:

Our research paper introducing MUCH and describing its generation in detail: arXiv:2511.17081
The dataset, available on HuggingFace:
- Main dataset: orailix/MUCH
- Generation configs: orailix/MUCH-configs
- Baseline evaluation data: orailix/MUCH-signals
A PyPI package implementing our claim segmentation algorithm: much-segmenter

How to use this repository?

This repository serves three main purposes:

🛠️ Goal A: Reproducible research, provide the source code needed to reproduce the generation of MUCH.
📊 Goal B: Reproducible research, provide the source code needed to reproduce the evaluation of five baselines on MUCH, as described in our research paper.
🚀 Goal C: Encourage research, provide researchers with a framework to easily develop, evaluate and compare new white-box claim-level UQ methods on MUCH.

These three goals are represented by three usage paths described below. They correspond to the three folders in the scripts/ directory.

Code overview

This repository contains two primary components: a Python module to implement the experiments, and BASH/SLURM scripts to run them in the correct order.

Python environment: the code is expected to run with Python 3.12, using the requirements provided in requirements.txt.

Python module in the src directory:

The module contains seven submodules responsible for different parts of the experiments:
- utils: Various utilities, including path management and the definition of a command-line typer app used by other submodules.
- questions: Generate and store the questions used for MUCH generation.
- generation: Generate LLM answers for each question.
- annotation: Annotate the LLM answers, either with GPT or manual annotations.
- signals: Implement token-level signals for each baseline evaluated on MUCH.
- evaluation: Compare the signals and ground-truth annotations of MUCH for evaluation.
- hf_setup: HuggingFace API, to export or import data from the Hub.
These modules implement various commands that can be used from the terminal. The list of commands can be obtained via python -m src --help

BASH/SLURM code in scripts:

These scripts are organized in folders A / B / C, corresponding to each user path.
Scripts containing SLURM syntax are expected to run on an HPC cluster. They implement GPU-intensive computations such as LLM generation for MUCH, and computation of CCP and SAR signals (which require using an NLI model).
For each script, you should adapt (1) the project root using the ROOT variable, and (2) the Python environment, by editing the conda activate ... line.

🛠️ User path A: reproduce the generation of MUCH

The scripts needed for this user path are available in scripts/A_generate_much.

A1_compute_questions:

This BASH script downloads and processes the questions from the MU-Shroom dataset:

python -m src get-mushroom-questions

A2_generation:

This SLURM script generates the LLM answers along with the 24 logits for each generated token. It is intended to be deployed on an HPC cluster.

python -m src generate

A3_segmentation:

This BASH script relies on much-segmenter to segment all claims of the LLM-generated responses.

python -m src segment-all-generations

At this stage, you have generated the questions, the LLM responses, the 24 logits for each LLM-generated token, and the claims produced from segmenting the answers.

A4_wiki_cache:

To add labels to each claim, you must first download the Wikipedia page associated with each question:

python -m src download-all-wiki-references

We separated the Wikipedia cache from the annotation because parallelism during annotation nullifies the benefits of caching.

A5_gpt_scripts:

We then use GPT-4o and GPT-4.1 to automatically annotate the claims. Note that this command consumed around USD 50 of OpenAI credits during execution. This cost may vary, so you should monitor your credit consumption and set limits.

python -m src add-all-gpt-labels

You can also monitor the cost of annotation on a smaller batch of samples using commands like:

python -m src add-some-gpt-labels --lang es --size 100

Human annotations (excluded from scripts):

To add human annotations, use the following command:

python -m src add-human-annotations --lang=fr --size=50 --annotator=an0

Here, an0 represents the annotator's name, which you may change. This will sample 50 responses to annotate in French. The sampling is deterministic, so all annotators receive the same samples. Samples already annotated by an0 are skipped. You can also annotate a specific sample using the option --annot-id=<...>, which overrides lang and size.

A6_export:

This script exports all samples for which GPT-4o and GPT-4.1 agree on all claims.

python -m src export-much

📊 User path B: re-compute baselines

B1_import_much:

We evaluated five baselines on MUCH dataset : CCP, SAR, Maximum Likelihood (Max-L), Token Likelihood (T-L) and Token Entropy (T-E). First, you should import MUCH benchmark from the Hugging Face Hub. We advise you to clean you output repository first.

python -m src import-much

B2_locals

The lighter baselines can be computed without heavy GPU resources. This BASH scripts provides the commands to compute them locally.

B3_remote_ccp and B4_remote_sar

CCP and SAR baselines require heavy GPU resources, which is why we advise to compute them on an HPC cluster. These SLURm scripts provide the command to deploy their computation. The computation time is logged on the standard output of the job.

Evaluation of the baselines:

Finally, you should evaluate your baselines. Please refer to User Path C above for this part (parts 4 and 5 only if you are not implementing new baselines).

🚀 User path C: Implement, evaluate and compare new UQ methods

Congratulations, you made an excellent choice in using MUCH to implement and evaluate a new claim-level UQ method 😉

This dataset is designed to evaluate claim-level uncertainty quantification (UQ). Each sample includes precomputed logits for every generated token, allowing new methods to estimate the factuality of each segmented claim. Your UQ estimator can then be evaluated by comparing its predictions with the ground-truth values provided in the labels field.

The benchmark is intended to reflect realistic production constraints. Consequently, UQ methods should be fast and efficient. They should not rely on external knowledge sources, as such resources are typically unavailable or impractical in real-world deployments.

Here are our suggestions to facilitate this process.

1- Implementation of your method:

Your method should correspond to a new script like src/signals/sig_your_awesome_method.py.
This script should implement a subclass of src.signals.UQSignal, for example class SignalYourAwesomeMethod(UQSignal):
This class should at least implement the following methods: signal_name (to specify where to store the token-level signals) and compute_signal_value (the core of your method, where you compute the token-level signal).
You can check the implementation of more complex signals such as the CCP baseline in src/signals/sig_ccp.py.

2- Compare with other baselines: We have already implemented several baselines and computed their token-level signals on MUCH. You can directly download them with the C1_import_baselines.sh script, without re-computing them.

3- Evaluate GPU time: As discussed in our research paper (see above), it's crucial to monitor the runtime of your baseline to ensure it remains realistic for real-world scenarios. If your method is CPU-only, you can directly compare it to the GPU cost of MUCH generation, which is 2,758s (see details in our research paper). If your method uses a GPU, you must evaluate how long MUCH generation would take on your GPU. Not all GPUs are the same, so this may be more or less than 2,758s depending on your machine.

To evaluate this computational cost, you can run C2_estimate_time.sh. It's intended to be deployed on an HPC cluster. This script evaluates computation time on a single GPU. Consequently, your baseline should also be evaluated on a single GPU. However, you likely do not need to worry about this because this is the default behavior of C3_evaluate_signals.sh (see below).

4- Evaluate your method

Each method must be evaluated for all four languages (en, fr, de, es), and some require parameter sweeps. You can check how we evaluate the baselines we implemented and add your baseline in this script.

bash scripts/C3_evaluate_signals.sh

Note that this script evaluates baselines in a single-GPU setting, which is the correct configuration for comparison with MUCH generation. Below is a minimal example showing how to evaluate one signal on one language:

python -m src evaluate-signal max_likelihood --lang en

5- Exploring the results

Finally, you can explore the results by adapting the notebook figures/02_baselines.ipynb. We encourage focusing on low-FPR (typically less than 20%) and high-precision (typically higher than 90%) regions (see discussion at the end of our research paper).

🎉 Congrats! You implemented and evaluated your new UQ method! We hope it achieves great results to advance our understanding of LLM UQ 😃

Reproduce Paper Figures

To reproduce the figures of the paper, you can use 01_annotation_quality.ipynb and 02_baselines.ipynb. Note that the first notebook should be use with MUCH data, including the "trash" split containing the samples that were removed from MUCH due to their insufficient quality.

Computation Times

Below, we report the

Generation time

TOTAL for the 6448 samples before filtering: 4540s (01:15:40)
TOTAL for the 4873 samples after filtering: 2758s (00:45:58) (Normalization: 4540 * (4873/6448) * (4.26/5.30) = 2758)

Segmentation time

TOTAL for 6448 values = 8s
TOTAL for 4873 values = 6s

Signal times

Token Likelihood (CPU - Apple M4 Pro) : 8.2s
Max Likelihood (CPU - Apple M4 Pro) : 8.2s
Token Entropy (CPU - Apple M4 Pro) : 9.0s
CCP (GPU - Nvidia A100 + Intel Xeon 6248)
- CCP-10-3: 4047s
- CCP-10-5: 3230s
- CCP-10-8: 3410s
- CCP-24-3: 3508s
- CCP-24-5: 4268s
- CCP-24-8: 5429s
SAR (GPU - Nvidia A100 + Intel Xeon 6248)
- SAR-3: 419s
- SAR-5: 510s
- SAR-8: 613s

Acknowledgement

The prompt, wiki_url, and lang fields of the MUCH samples are extracted from the Mu-SHROOM [1], a dataset released under CC-BY-4.0 license.

This work received financial support from the research chair Trustworthy and Responsible AI at École Polytechnique.

This work was granted access to the HPC resources of IDRIS under the allocation AD011014843R1, made by GENCI.

[1] Raúl Vázquez, Timothee Mickus, Elaine Zosa, Teemu Vahtola, Jörg Tiedemann, Aman Sinha, Vincent Segonne, Fernando Sánchez-Vega, Alessandro Raganato, Jindřich Libovický, Jussi Karlgren, Shaoxiong Ji, Jindřich Helcl, Liane Guillou, Ona de Gibert, Jaione Bengoetxea, Joseph Attieh, Marianna Apidianaki. SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared Task on Hallucinations and Related Observable Overgeneration Mistakes ArXiv preprint. 2025. https://arxiv.org/abs/2504.11975

Copyright and License

This repository is released under the Apache-2.0 license.

Please cite this dataset as follows:

@misc{dentan_much_2025,
  title = {MUCH: A Multilingual Claim Hallucination Benchmark},
  author = {Dentan, Jérémie and Canesse, Alexi and Buscaldi, Davide and Shabou, Aymen and Vanier, Sonia},
  year = {2025},
  url = {https://arxiv.org/abs/2511.17081},
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
doc		doc
figures		figures
scripts		scripts
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MUCH : A Multilingual Claim Hallucination Benchmark

Overview

How to use this repository?

Code overview

🛠️ User path A: reproduce the generation of MUCH

📊 User path B: re-compute baselines

🚀 User path C: Implement, evaluate and compare new UQ methods

Reproduce Paper Figures

Computation Times

Generation time

Segmentation time

Signal times

Acknowledgement

Copyright and License

About

Uh oh!

Releases

Packages

Languages

License

orailix/much

Folders and files

Latest commit

History

Repository files navigation

MUCH : A Multilingual Claim Hallucination Benchmark

Overview

How to use this repository?

Code overview

🛠️ User path A: reproduce the generation of MUCH

📊 User path B: re-compute baselines

🚀 User path C: Implement, evaluate and compare new UQ methods

Reproduce Paper Figures

Computation Times

Generation time

Segmentation time

Signal times

Acknowledgement

Copyright and License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages