Skip to content

Conversation

@uyzhang
Copy link
Contributor

@uyzhang uyzhang commented May 11, 2025

R-Bench PR Description

Motivation

This PR adds support for the R-Bench dataset to OpenCompass. R-Bench is a graduate-level multi-disciplinary benchmark designed to evaluate complex reasoning capabilities of both language models (LLMs) and multimodal language models (MLLMs). By incorporating R-Bench into OpenCompass, we enable comprehensive evaluation of model performance on challenging reasoning tasks across 19 academic disciplines and over 100 subjects, available in both English and Chinese.

Modification

This PR adds the configuration file opencompass/configs/datasets/R-Bench/R-Bench.md which includes:

  • Detailed introduction of the R-Bench benchmark
  • Links to the official paper and resources
  • Current evaluation results from top models (both text-only and multimodal)
  • Citation information for proper reference

The file follows the standard OpenCompass dataset documentation format, similar to other benchmark configurations like QuALITY.

Use cases

R-Bench can be used to:

  • Evaluate advanced reasoning capabilities of LLMs across multiple disciplines
  • Compare model performance on complex graduate-level problems requiring deep reasoning
  • Test reasoning abilities in both English and Chinese languages
  • Assess multimodal reasoning through its dedicated multimodal test set
  • Provide a more challenging benchmark that even state-of-the-art models struggle with (top model achieves only 53.2% on multimodal tasks)

Checklist

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues.
  • The documentation has been modified accordingly, like docstring or example tutorials.
  • The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. (Not applicable as this is a new feature)

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects.
  • CLA has been signed and all committers have signed the CLA in this PR.

@tonysy
Copy link
Collaborator

tonysy commented May 14, 2025

Please update the lint

@uyzhang
Copy link
Contributor Author

uyzhang commented May 19, 2025

Please update the lint

I've updated it, can you help run CI?

@tonysy
Copy link
Collaborator

tonysy commented Jun 5, 2025

Hi, have you tried using OpenCompass to reproduce your reported performance?

@tonysy
Copy link
Collaborator

tonysy commented Jun 5, 2025

Also please check the pre-commit again. Thanks.

@uyzhang
Copy link
Contributor Author

uyzhang commented Jun 5, 2025

Hi, have you tried using OpenCompass to reproduce your reported performance?

Yes, we conducted the experiment using opencompass and reproduced the previous results on this pr.

Copy link
Collaborator

@tonysy tonysy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tonysy tonysy requested review from MaiziXiao, Myhs-phz and liushz June 6, 2025 16:29
Copy link
Contributor

@MaiziXiao MaiziXiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested. LGTM

@MaiziXiao MaiziXiao merged commit 4f42c12 into open-compass:main Jun 13, 2025
8 checks passed
zyc140345 pushed a commit to zyc140345/opencompass that referenced this pull request Oct 23, 2025
* [Dataset] Add R-Bench (ICML 2025)

* fixed lint

* format rbench.py by isort

* rbench fix

* r-bench fix

* update

---------

Co-authored-by: leoyizhang <leoyizhang@tencent.com>
Co-authored-by: Myhs-phz <demarcia2014@126.com>
Co-authored-by: MaiziXiao <xxllcc1993@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants