ProcessBench

This is the official repository for paper "ProcessBench: Identifying Process Errors in Mathematical Reasoning"

If you find this work relevant or helpful to your work, please kindly cite us:

@article{processbench,
  title={ProcessBench: Identifying Process Errors in Mathematical Reasoning}, 
  author={
    Chujie Zheng and Zhenru Zhang and Beichen Zhang and Runji Lin and Keming Lu and
    Bowen Yu and Dayiheng Liu and Jingren Zhou and Junyang Lin
  },
  journal={arXiv preprint arXiv:2412.06559},
  year={2024}
}

News

[12/13/2024] Released the evaluation code for the RLHFlow PRMs
[12/11/2024] Released the evaluation code and the data on HuggingFace
[12/10/2024] Released the paper on arXiv

Data Usage

You can use the following code to preview the ProcessBench data:

import json
from datasets import load_dataset

dataset = load_dataset('Qwen/ProcessBench', split='gsm8k')
print(json.dumps(dataset[0], indent=2))

# Expected output:
"""
{
  "id": "gsm8k-0",
  "generator": "Qwen2-7B-Instruct",
  "problem": "Sue lives in a fun neighborhood...",
  "steps": [
    "To find out how many more pink plastic flamingos were out than...",
    ...
  ],
  "final_answer_correct": false,
  "label": 1
}
"""

Evaluation

You can refer to the code folder for the evaluation code and the prompt templates we use in this work

Evaluating TRL based models

In TRL v0.13.0 a PRM trainer was introduced. The resulting PRM returns probabilities for the different tokens, and works directly with the Token classification pipeline. To evaluate these models, clone this repository and install the requirements-trl.txt dependencies:

uv pip install -r requirements-trl.txt

Now go to the /code folder, and run the following script:

python run_eval_prm_trl.py \
    --model_name "plaguss/Qwen2.5-Math-1.5B-Instruct-PRM-0.2" \
    --output_dir "./outputs" \
    --batch_size 256 \
    --sep "\n\n"

Other than the model to evaluate, and the token used as a separator, the only relevant argument is the batch size. Internally, the process runs using a transformers pipeline, and it benefits from bigger sizes. For reference, for a 7B model, a batch size of 128 should work, taking close to 2 hours to complete the benchmark. The results are saved in the output_dir, and if the command is rerun, it will check for the results to only compute the final metrics.

The help for the script can be found here:

usage: run_eval_prm_trl.py [-h] [--config {gsm8k,math,olympiadbench,omnimath,all}] --model_name MODEL_NAME [--output_dir OUTPUT_DIR] [--sep SEP] [--batch_size BATCH_SIZE] [--max_elements MAX_ELEMENTS]

options:
  -h, --help            show this help message and exit
  --config {gsm8k,math,olympiadbench,omnimath,all}
                        The configuration to run from the dataset, by default will use 'all'.
  --model_name MODEL_NAME
  --output_dir OUTPUT_DIR
                        The path to save the results to.
  --sep SEP             Separator of the model, ensure it corresponds to the same one used during training.
  --batch_size BATCH_SIZE
                        The number of examples to run in a single batch. Each question has multiple steps, and a batch can contain multiple from different questions to speed up the process.
  --max_elements MAX_ELEMENTS
                        Number of elements to run. Helpful for testing, by default will run the full dataset.

Analyzing the results:

The following output corresponds to plaguss/Qwen2.5-Math-7B-Instruct-PRM-0.2:

Individual Results:
----------------------------------------------------------------------
gsm8k         -> Precision: 22.71  Recall: 93.78  F1 Score: 36.56
math          -> Precision: 38.22  Recall: 70.69  F1 Score: 49.61
olympiadbench -> Precision: 27.08  Recall: 53.98  F1 Score: 36.07
omnimath      -> Precision: 27.93  Recall: 54.77  F1 Score: 37.00
Weighted Averages:
----------------------------------------------------------------------
Weighted      -> Precision: 30.09  Recall: 63.81  F1 Score: 40.38

It yields the individual results, and finally the weighted average by the number of examples in in subset. The weighted F1 Score corresponds to the value shown in the reference paper to compare.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
code		code
README.md		README.md
requirements-trl.txt		requirements-trl.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ProcessBench

News

Data Usage

Evaluation

Evaluating TRL based models

About

Uh oh!

Releases

Packages

Languages

huggingface/ProcessBench

Folders and files

Latest commit

History

Repository files navigation

ProcessBench

News

Data Usage

Evaluation

Evaluating TRL based models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages