GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Authors: Jinhao Duan*, Renming Zhang*, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Elias Stengel-Eskin, Mohit Bansal, Tianlong Chen, and Kaidi Xu (*equal contribution)
arXiv
GTBench HF Leaderboard

Overview

This repo contains code for our paper: GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations. GTBench is a language-driven environment, evaluating the strategic reasoning limitations of LLMs through game-theoretic tasks. GTBench is built on top of OpenSpiel, supporting 10 widely-recognized games:

Environment

Dependencies can be installed by running

pip install -r requirements.txt

LLM Inference

GTBench uses LangChain for LLM Inference (./gamingbench/chat/chat.py), supporting

Remote API access such as OpenAI / Anyscale / DeepInfra.
OpenAI-Compatible APIs via FastChat.

Scripts

LLM-vs-X

GTBench supports

LLM-vs-Conventional: LLM agent competes against conventional solvers such as Monte Carlo Tree Search (MCTS)
LLM-vs-LLM: LLM agent competes against other LLM agents.

The following is a script for GPT-3.5-turbo-1106 w/ Prompt Agent vs. GPT-3.5-turbo-1106 w/ CoT Agent, over Tic-Tac-Toe

seed=0
output_root="./experiments"
exp_name='test'
num_matches=50 # number of matches
num_workers=20 # run 20 matches in parallel
threshold_matches=100 # maximum number of matches, stop criteria for low completion rate, e.g., LLM agents always generate illegal actions.
# suports all the games listed in ./gamingbench/configs/game_configs/*.yaml
game_name='tictactoe'
# supports all the llms defined in ./gamingbench/configs/model_configs/*.yaml
model_config_root='gamingbench/configs/model_configs'
llm_name='gpt-35-turbo-1106'
opponent_llm_name='gpt-35-turbo-1106'
# supports all the reasoning methods defined in ./gamingbench/agent_configs/*.yaml
agent_config_root='gamingbench/configs/agent_configs'
agent_name='prompt_agent'
opponent_agent_name='cot_agent'
declare -a api_keys=("<YOUR-OPENAIAPI-KEY>" "<YOUR_DEEPINFRA_KEY>")

python3 -m gamingbench.main \
    --num-matches ${num_matches} \
    --exp-root ${output_root}/${exp_name}/${llm_name} \
    --seed ${seed} \
    --game-name ${game_name} \
    --agent-configs ${agent_config_root}/${agent_name}.yaml ${agent_config_root}/${opponent_agent_name}.yaml \
    --model-configs ${model_config_root}/${llm_name}.yaml ${model_config_root}/${opponent_llm_name}.yaml \
    --api-keys ${api_keys[@]} \
    --exchange-first-player \
    --num-workers ${num_workers} \
    --threshold-matches ${threshold_matches}

Customized LLM Agent

Will be ready soon.

Upload to GTBench HF Leaderboard

Will be ready soon.

Reference

Please cite our paper as

@article{duan2024gtbench,
  title={GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations},
  author={Duan, Jinhao and Zhang, Renming and Diffenderfer, James and Kailkhura, Bhavya and Sun, Lichao and Stengel-Eskin, Elias and Bansal, Mohit and Chen, Tianlong and Xu, Kaidi},
  journal={arXiv preprint arXiv:2402.12348},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
gamingbench		gamingbench
scripts		scripts
.gitignore		.gitignore
README.md		README.md
framework.png		framework.png
logo.png		logo.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gamingbench

gamingbench

scripts

scripts

.gitignore

.gitignore

README.md

README.md

framework.png

framework.png

logo.png

logo.png

requirements.txt

requirements.txt

Repository files navigation

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Overview

Environment

LLM Inference

Scripts

LLM-vs-X

Customized LLM Agent

Upload to GTBench HF Leaderboard

Reference

About

Releases

Packages

Languages

jinhaoduan/GTBench

Folders and files

Latest commit

History

Repository files navigation

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Overview

Environment

LLM Inference

Scripts

LLM-vs-X

Customized LLM Agent

Upload to GTBench HF Leaderboard

Reference

About

Resources

Stars

Watchers

Forks

Languages