- 12.11.2025: We have been accepted to the EIML workshop @ EurIPS 2025!
RewardUQ is a unified framework for training and evaluating uncertainty-aware reward models. Built on top of the Hugging Face ecosystem using π€ TRL, π€ Transformers, and PyTorch, it provides a variety of state-of-the-art uncertainty quantification methods alongside easily accessible training pipelines.
This repository is designed to function simultaneously as an importable library and a standalone research framework. We want to encourage both usage styles to foster adoption and contribution from the community.
-
π¦ As library: Import specific components (models, functional APIs, utilities) into external projects or production inference pipelines.
-
π§ͺ As research framework: Use Hydra configurations and entry points for rapid experimentation with version-controllable configs and seamless hyperparameter sweeps.
| Method | Description | Config Path |
|---|---|---|
| MLP Head Ensemble | Multiple independent MLP heads on a shared frozen backbone. Uncertainty from prediction variance across ensemble members. | mlp_head_ensemble/ |
| LoRA Ensemble | Ensemble of independent LoRA adapters, each with its own linear head. | lora_ensemble/ |
| DPO-based MC Dropout | Monte Carlo dropout applied to DPO's implicit reward model. Uncertainty from stochastic forward passes during inference. | dpo_head_dropout_ensemble/ |
| Bayesian Linear Head | Single linear head with Gaussian posterior via Laplace approximation. | bayesian_linear_head/ |
Install the package via pip:
pip install rewarduqClone the repository:
git clone https://github.com/Florian-toll/rewarduq.git
cd rewarduqWe recommend uv to manage dependencies:
uv syncAlternatively, use pip with the requirements.txt:
pip install -r requirements.txtAfter the installation verify that torch recognizes your device. If you have a CUDA-capable GPU, the following command should return True, otherwise make sure to install the correct version of PyTorch for your system from pytorch.org:
uv run python -c "import torch; print(torch.cuda.is_available())"# Install dev dependencies
uv sync --dev
# Or, install all extras
uv sync --all-extras
# Optionally, install pre-commit hooks (recommended)
uv run pre-commit install
# Optionally, install nbstripout hooks (recommended)
uv run nbstripout --installRun the pre-commit hooks manually with:
uv run pre-commit run --all-filesfrom rewarduq import load_pipeline
from rewarduq.utils import get_config
# Load the config
config = get_config("configs/<config_file>.yaml")
# Load the pipeline
rm_pipeline = load_pipeline(config)
# Train the reward model
rm_pipeline.train(train_dataset, eval_dataset)All forward passes of RewardUQ models return a tensor of shape (batch_size, 3) where the first column is the reward, the second column is the lower bound, and the third column is the upper bound.
RewardUQ uses Hydra for configuration management. For example, to train a model you can use the following command:
uv run python ./scripts/train.py \
dataset/train=ultrafeedback_binarized \
dataset/eval=ultrafeedback_binarized \
method=mlp_head_ensemble/default \
model.base_model_name_or_path=Qwen/Qwen3-0.6BBy default this uses the configs/ folder in the repository and the train.yaml file. If you run experiments outside of the installed the repository, you must specify the --config-path <absolute_config_path> parameter. You can also change the config file with the --config-name <config_name> parameter, for instance to manage multiple experiments at once.
In our paper, we use RewardBench as our primary evaluation benchmark. To evaluate on RewardBench with automatic weighted averaging over categories, use the following config override:
dataset/eval=reward_benchTrain uncertainty-aware reward models using Hydra configuration.
Usage:
uv run python ./scripts/train.py \
dataset/train=<train_dataset> \
dataset/eval=<eval_dataset> \
method=<method_config> \
[additional_overrides...]Key Parameters:
dataset/train: Training dataset configuration (e.g.,ultrafeedback_binarized,tulu_3_8b_preference_mixture)dataset/eval: Evaluation dataset configuration (can benullfor no evaluation)method: UQ method configuration (e.g.,mlp_head_ensemble/default)model.base_model_name_or_path: HuggingFace model identifier or local pathresume: Resume from checkpoint (path orTruefor latest)
Examples:
# Train MLP Head Ensemble on Qwen3-0.6B
uv run python ./scripts/train.py \
dataset/train=ultrafeedback_binarized \
dataset/eval=ultrafeedback_binarized \
method=mlp_head_ensemble/qwen3_0.6bTo enable logging, set trainer.report_to=wandb. You can also override the entity and project directly in the command:
wandb login
# Run training
uv run python ./scripts/train.py \
trainer.report_to=wandb \
wandb.entity=your-entity \
wandb.project=your-projectTo run hyperparameter sweeps using for example the configs in configs/sweeps/:
# Create sweep
wandb sweep \
--entity <your-entity> \
--project <your-project> \
--name <your-sweep-name> \
./configs/sweeps/sweep_ens_mlp.yaml
# Run sweep
wandb agent --count 1 "<your-entity>/<your-project>/<sweep-id>"Run inference on custom prompts and completions.
Usage:
python ./scripts/run_inference.py \
--model <model_path> \
--dataset <dataset_name> <split> \
[--batch-size BATCH_SIZE] \
[--out OUTPUT_DIR] \
[--debug]Parameters:
--model: Path to trained model or HuggingFace identifier (required)--dataset: Dataset containing prompts and completions (required)--batch-size: Inference batch size (default: 16)--out: Output directory for predictions (default: current directory)--debug: Limit dataset size for quick testing
Output:
Saves predictions as .npy files containing reward scores with uncertainty bounds.
Configs are organized in the configs/ directory:
configs/
βββ train.yaml # Main training config with defaults
βββ dataset/
β βββ train/ # Training dataset configs
β β βββ ultrafeedback_binarized.yaml
β β βββ tulu_3_8b_preference_mixture.yaml
β β βββ ...
β βββ eval/ # Evaluation dataset configs
β βββ reward_bench.yaml
β βββ ...
βββ method/
β βββ base.yaml # Base config for all methods
β βββ mlp_head_ensemble/
β β βββ default.yaml # Default config
β β βββ qwen3_14b.yaml # Model-specific tuned config
β β βββ ...
β βββ lora_ensemble/
β βββ dpo_head_dropout_ensemble/
β βββ bayesian_linear_head/
βββ accelerate/ # Examples for distributed training configs
β βββ default.yaml
β βββ fsdp.yaml
βββ paths/ # Path configurations
β βββ default.yaml
βββ hydra/ # Hydra-specific settings
βββ default.yaml
Models inherit from transformers.PreTrainedModel:
transformers.PreTrainedModel
βββ rewarduq.methods.base.RewardUQModel (base class for all UQ models)
βββ MLPHeadEnsembleModel
βββ LoraEnsembleModel
βββ DPOHeadDropoutEnsembleModel
βββ BayesianLinearHeadModel
Trainers extend TRL's specialized trainers:
transformers.Trainer
βββ trl.RewardTrainer / trl.DPOTrainer
βββ rewarduq.trainers.TrainerExtension (adds UQ-specific features)
βββ rewarduq.trainers.RewardUQTrainer (extends RewardTrainer)
βββ rewarduq.trainers.DPORewardUQTrainer (extends DPOTrainer)
All forward passes of RewardUQ models return a tensor of shape (batch_size, 3) where the first column is the reward, the second column is the lower bound, and the third column is the upper bound.
We welcome and encourage contributions from the community! Whether you want to add a new uncertainty quantification method, improve existing ones, or fix bugs, your help is appreciated. If you have an idea for a new feature or improvement:
- Check existing Issues or open a new one to discuss your idea.
- Fork the repository and create a feature branch.
- Implement your changes (don't forget tests :D).
- Submit a Pull Request.
@misc{yang2025rewarduq,
author = {Daniel Yang and Samuel Stante and Florian Redhardt and Lena Libon and Barna Pasztor and Parnian Kassraie and Ido Hakimi and Andreas Krause},
title = {RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/lasgroup/rewarduq}}
}This repository's source code is available under the Apache-2.0 License.