🧭 BrowseComp-ZH: Benchmarking the Web Browsing Ability of Large Language Models in Chinese

BrowseComp-ZH is the first high-difficulty benchmark specifically designed to evaluate the real-world web browsing and reasoning capabilities of large language models (LLMs) in the Chinese information ecosystem. Inspired by BrowseComp (Wei et al., 2025), BrowseComp-ZH targets the unique linguistic, structural, and retrieval challenges of the Chinese web, including fragmented platforms, implicit linguistic patterns, and content censorship.

📄 Paper Link（arXiv）

👥 Authors

Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua

🌟 Key Features

🔍 Native Chinese Construction: All questions, retrieval chains, and browsing steps are authored directly in Chinese by experts to avoid translation artifacts and ensure authentic search difficulty.
🧩 Reverse-Engineered Multi-Hop Queries: Each task starts from a known factual answer and is crafted with multiple constraints (e.g., time, entity type, description) to ensure high retrieval difficulty and answer uniqueness.
🌐 Tri-Engine Validation and Dual-Stage Quality Control: All questions are verified across Baidu, Bing (China), and Google; a two-stage human-in-the-loop protocol filters out easily retrievable or ambiguous samples.
🤖 Comprehensive Benchmarking: 20+ systems—including open-source LLMs, closed-source APIs, and agentic search systems—are evaluated to diagnose browsing and reasoning capabilities across different architectures.

📁 Repository Structure

BrowseComp-ZH/
├── data/
│   ├── browsecomp-zh-encrypted.xlsx   # Encrypted dataset
│   └── browsecomp-zh-decrypt.py       # Decryption script
├── images/                            # Visualizations and charts
├── paper/                             # Paper and supplementary 
├── README.md
└── requirements.txt

🛠️ Environment Installation

cd BrowseComp-ZH
conda create --name BrowseComp python=3.12
conda activate BrowseComp
pip install -r requirements.txt

🔐 Dataset Access

The BrowseComp-ZH dataset contains 289 complex multi-hop retrieval and reasoning questions, spanning 11 domains including Film & TV, Technology, Medicine, and History.

To prevent unauthorized pretraining and preserve the evaluation value of the dataset, all samples are encrypted.
To decrypt the dataset:

python data/browsecomp-zh-decrypt.py --input data/browsecomp-zh-encrypted.xlsx --output data/browsecomp-zh-decrypted.xlsx --json_output raw_data/browsecomp-zh-decrypted.json

You will be prompted for a canary token embedded within the file.

📊Evaluation

The evaluation is divided into two parts: model evaluation and result statistics.

cd BrowseComp-ZH
# model evaluation
bash run.sh
# result statistics
python run_acc_calibration_error.py

Folder Structure

raw_data：json format evaluation dataset
predict_data：detailed model responses
eval_data：gpt-4o answer extraction results
output_data：final evaluation results
outcome_data：acc and calibration_error statistics

🏆 Model Performance Overview

Model	Category	Reasoning	Browsing	Accuracy	Calibration Error (%)	Enterprise
DeepSeek-V3	Open-Source	No	No	8.7%	72	DeepSeek
DeepSeek-R1	Open-Source	Yes	No	23.2%	59	DeepSeek
Qwen2.5-72B-Instruct	Open-Source	No	No	6.6%	62	Alibaba
QwQ-32B	Open-Source	Yes	No	11.1%	64	Alibaba
Qwen3-235B-A22B (Non-Thinking)	Open-Source	No	No	8.0%	80	Alibaba
Qwen3-235B-A22B (Thinking)	Open-Source	Yes	No	13.2%	67	Alibaba
LlaMa4	Open-Source	No	No	4.8%	70	Meta
GPT4o	Closed-Source	No	No	6.2%	73	OpenAI
O1	Closed-Source	Yes	No	29.1%	52	OpenAI
O4-mini	Closed-Source	Yes	No	15.2%	42	OpenAI
Claude-3.5-Sonnet	Closed-Source	No	No	5.5%	78	Anthropic
Claude-3.7-Sonnet	Closed-Source	Yes	No	17.7%	71	Anthropic
Gemini-2.0-Flash	Closed-Source	No	No	6.9%	74	Google
Gemini-2.5-Pro	Closed-Source	Yes	No	27.3%	59	Google
Qwen2.5-MAX	Closed-Source	No	No	7.6%	78	Alibaba
OpenAI DeepResearch	AI Search Product	-	Yes	42.9%	9	OpenAI
Grok3 (Research)	AI Search Product	-	Yes	12.9%	39	xAI
Perplexity (Research)	AI Search Product	-	Yes	22.6%	53	Perplexity
Doubao (Deep Search)	AI Search Product	-	Yes	26.0%	61	ByteDance
Doubao (Standard)	AI Search Product	-	Yes	18.7%	37	ByteDance
Kimi (Deep Think)	AI Search Product	-	Yes	8.0%	58	Moonshot
Yuanbao (Hunyuan Model)	AI Search Product	-	Yes	12.2%	56	Tencent
DeepSeek (Deep Think)	AI Search Product	-	Yes	7.6%	65	DeepSeek
DeepSeek (Standard)	AI Search Product	-	Yes	4.8%	66	DeepSeek

📊 Key Findings

📉 Most standalone LLMs achieve less than 10% accuracy on BrowseComp-ZH, reflecting the benchmark’s difficulty.
🧠 Models with explicit reasoning capabilities consistently outperform their non-reasoning counterparts (e.g., DeepSeek-R1 vs. DeepSeek-V3, Claude-3.7 vs. Claude-3.5).
🔍 Retrieval-augmented systems significantly outperform pure LLMs, with DeepResearch achieving the highest accuracy (42.9%).
🔄 Multi-hop retrieval pipelines are critical: Single-shot retrieval systems (e.g., DeepSeek, Kimi) struggle to meet task complexity.
📈 Calibration error correlates with retrieval-reasoning effectiveness, highlighting challenges in confidence estimation during browsing.

📎 Citation

If you use BrowseComp-ZH in your research, please cite:

@article{zhou2025browsecomp,
  title={BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese},
  author={Zhou, Peilin and Leon, Bruce and Ying, Xiang and Zhang, Can and Shao, Yifan and Ye, Qichen and Chong, Dading and Jin, Zhiling and Xie, Chenxuan and Cao, Meng and others},
  journal={arXiv preprint arXiv:2504.19314},
  year={2025}
}

🤝 Contact & Contribution

We welcome questions, suggestions, and contributions!
Please open an issue or contact @PALIN2018.

🛡️ License

BrowseComp-ZH is released under the MIT License.
The dataset is intended solely for academic research purposes and must not be used for sensitive or high-stakes decision-making.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
raw_data		raw_data
.env		.env
.gitignore		.gitignore
README-ZH.md		README-ZH.md
README.md		README.md
prompt.py		prompt.py
requirements.txt		requirements.txt
run.py		run.py
run.sh		run.sh
run_acc_calibration_error.py		run_acc_calibration_error.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧭 BrowseComp-ZH: Benchmarking the Web Browsing Ability of Large Language Models in Chinese

👥 Authors

🌟 Key Features

📁 Repository Structure

🛠️ Environment Installation

🔐 Dataset Access

📊Evaluation

Folder Structure

🏆 Model Performance Overview

📊 Key Findings

📎 Citation

🤝 Contact & Contribution

🛡️ License

About

Uh oh!

Releases

Packages

Languages

iamsk/BrowseComp-ZH

Folders and files

Latest commit

History

Repository files navigation

🧭 BrowseComp-ZH: Benchmarking the Web Browsing Ability of Large Language Models in Chinese

👥 Authors

🌟 Key Features

📁 Repository Structure

🛠️ Environment Installation

🔐 Dataset Access

📊Evaluation

Folder Structure

🏆 Model Performance Overview

📊 Key Findings

📎 Citation

🤝 Contact & Contribution

🛡️ License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages