This is the official repository for paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"
If you find this work relevant or helpful to your work, please kindly cite us:
@article{processbench,
title={ProcessBench: Identifying Process Errors in Mathematical Reasoning},
author={
Chujie Zheng and Zhenru Zhang and Beichen Zhang and Runji Lin and Keming Lu and
Bowen Yu and Dayiheng Liu and Jingren Zhou and Junyang Lin
},
journal={arXiv preprint arXiv:2412.06559},
year={2024}
}
- [12/13/2024] Released the evaluation code for the RLHFlow PRMs
- [12/11/2024] Released the evaluation code and the data on HuggingFace
- [12/10/2024] Released the paper on arXiv
You can use the following code to preview the ProcessBench data:
import json
from datasets import load_dataset
dataset = load_dataset('Qwen/ProcessBench', split='gsm8k')
print(json.dumps(dataset[0], indent=2))
# Expected output:
"""
{
"id": "gsm8k-0",
"generator": "Qwen2-7B-Instruct",
"problem": "Sue lives in a fun neighborhood...",
"steps": [
"To find out how many more pink plastic flamingos were out than...",
...
],
"final_answer_correct": false,
"label": 1
}
"""
You can refer to the code folder for the evaluation code and the prompt templates we use in this work
In TRL v0.13.0 a PRM trainer was introduced. The resulting PRM returns probabilities for the different tokens, and works directly with the Token classification pipeline. To evaluate these models, clone this repository and install the requirements-trl.txt
dependencies:
uv pip install -r requirements-trl.txt
Now go to the /code
folder, and run the following script:
python run_eval_prm_trl.py \
--model_name "plaguss/Qwen2.5-Math-1.5B-Instruct-PRM-0.2" \
--output_dir "./outputs" \
--batch_size 256 \
--sep "\n\n"
Other than the model to evaluate, and the token used as a separator, the only relevant argument is the batch size. Internally, the process runs using a transformers pipeline, and it benefits from bigger sizes. For reference, for a 7B model, a batch size of 128 should work, taking close to 2 hours to complete the benchmark. The results are saved in the output_dir
, and if the command is rerun, it will check for the results to only compute the final metrics.
The help for the script can be found here:
usage: run_eval_prm_trl.py [-h] [--config {gsm8k,math,olympiadbench,omnimath,all}] --model_name MODEL_NAME [--output_dir OUTPUT_DIR] [--sep SEP] [--batch_size BATCH_SIZE] [--max_elements MAX_ELEMENTS]
options:
-h, --help show this help message and exit
--config {gsm8k,math,olympiadbench,omnimath,all}
The configuration to run from the dataset, by default will use 'all'.
--model_name MODEL_NAME
--output_dir OUTPUT_DIR
The path to save the results to.
--sep SEP Separator of the model, ensure it corresponds to the same one used during training.
--batch_size BATCH_SIZE
The number of examples to run in a single batch. Each question has multiple steps, and a batch can contain multiple from different questions to speed up the process.
--max_elements MAX_ELEMENTS
Number of elements to run. Helpful for testing, by default will run the full dataset.
- Analyzing the results:
The following output corresponds to plaguss/Qwen2.5-Math-7B-Instruct-PRM-0.2:
Individual Results:
----------------------------------------------------------------------
gsm8k -> Precision: 22.71 Recall: 93.78 F1 Score: 36.56
math -> Precision: 38.22 Recall: 70.69 F1 Score: 49.61
olympiadbench -> Precision: 27.08 Recall: 53.98 F1 Score: 36.07
omnimath -> Precision: 27.93 Recall: 54.77 F1 Score: 37.00
Weighted Averages:
----------------------------------------------------------------------
Weighted -> Precision: 30.09 Recall: 63.81 F1 Score: 40.38
It yields the individual results, and finally the weighted average by the number of examples in in subset. The weighted F1 Score corresponds to the value shown in the reference paper to compare.