Evaluate the social reasoning capabilities of LLM agents in multi-party environments.
Requires Python 3.11+ and uv.
git clone https://github.com/microsoft/social-reasoning-bench.git srbench
cd srbench
uv sync --all-packages --all-groups --all-extras
source .venv/bin/activateEvaluate the social reasoning ability of your own LLM. For example's sake, we'll assume your LLM is served as my-model via an OpenAI compatible endpoint at http://localhost:8000.
# To reproduce our results use Gemini as the counterparty.
GEMINI_API_KEY=<your api key>
# Run the v0.1.0 experiment sweep with your model as the assistant
srbench experiment experiments/v0.1.0 \
--output-base outputs/my-model
--assistant-model openai/my-model \
--assistant-base-url http://localhost:8000/v1 \
--assistant-api-key none
# To just test a few examples per experiment in the sweep
# --set limit=10
# View the results
srbench dashboard outputs/my-modelSee Installation, Experiments, and LLMs for detailed instructions.
