Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
When the rules of reality change, even in trivial ways, can AI adapt as seamlessly as humans do?
Try it now:
This project investigates whether Large Language Models (LLMs) can reason effectively in absurd world models through a custom game called Absurd Soccer. By creating variations of soccer rules, we test if LLMs can solve small tasks, problems that humans can perform with limited memory, when the rules deviate from standard expectations.
This work explores whether models can:
- Adapt to rule variations (e.g., "lowest score wins" instead of "highest score wins")
- Handle symbolic role reversals (e.g., "shooting a net into a ball")
- Maintain reasoning capabilities when world models differ from their training data
Absurd Soccer isolates these capabilities by using a familiar domain (soccer) with systematically modified rules.
- Universal familiarity - Everyone understands basic rules
- Simple domain - Isolates reasoning from complexity
- Systematic variations - Clear rule modifications
- Easy verification - Deterministic outcomes
- Minimal memory - Humans solve effortlessly
Each variant modifies the REAL ruleset through:
- Changing roles of symbols (e.g., shooting a net into a ball)
- Changing score mechanics (e.g., team with least score wins)
- Two teams of players
- Each team starts with a score of 0
- Five matches per game
- Teams alternate shooting a ball at a net
- Hitting the net increases score by 1
- Highest score wins
| Ruleset | Modification | Change Ttpe |
|---|---|---|
REAL |
No modification | N/A |
MISSING |
Score increases by 1 when a shot misses | Score mechanics |
LEAST |
Team with the lowest score wins | Score mechanics |
ICE CREAM |
Team earns ice cream when hitting the net | Score mechanics |
CAR |
Team earns cars when hitting the net | Score mechanics |
SWITCH |
Teams shoot a net into a ball (roles reversed) | Roles of symbols |
MISS & SWITCH |
Combination of MISSING and SWITCH | Both |
Determine Outcome (DO)
- Models receive game commentary and ruleset
- Task: Determine which team won
- Metric: Accuracy of predicted winner
Determine Outcome Few-Shot (DOFS)
- Models receive 3 correct question-answer pairs
- Then answer a new question with the same ruleset
- Metric: Accuracy of predicted winner
We evaluated 17 models across three categories:
- google/gemini-2.0-flash-001
- openai/gpt-4o-mini
- meta-llama/llama-4-maverick-17b-128e-instruct
- qwen/qwen-2.5-72b-instruct
- google/gemini-2.5-flash
- anthropic/claude-3-5-haiku
- nousresearch/hermes-3-llama-3.1-405b
- sao10k/l3.1-euryale-70b
- meta-llama/llama-3.1-405b-instruct
- thedrummer/skyfall-36b-v2
- openai/gpt-4.1-mini
- google/gemma-2-27b-it
- deepseek/deepseek-r1-0528
- mistralai/magistral-small-2506
- nvidia/llama-3.1-nemotron-ultra-253b-v1
- perplexity/sonar-reasoning
-
Python Environment:
- Ensure Python 3.8+ is installed.
-
API Key:
- Obtain an OpenRouter API key and keep it ready.
-
Run a Single Task:
- To evaluate a specific ruleset with a set of models, use the
task_1function inmain.py. - Example:
python main.py --task DO --api_key YOUR_API_KEY --num_sims 100 --ruleset "Switch" --models "cheap"
- To evaluate a specific ruleset with a set of models, use the
-
Run All Rulesets:
- To evaluate all rulesets for a specific task, use the
run_all_rulesetsfunction. - Example:
python main.py --task DO --api_key YOUR_API_KEY --num_sims 100 --models "all"
- To evaluate all rulesets for a specific task, use the
-
Run Full Experiment:
- To run all tasks and rulesets, use the
run_full_expfunction. - Example:
python main.py --folder_name "experiment_results" --api_key YOUR_API_KEY --num_sims 100 --models "all"
- To run all tasks and rulesets, use the
- Results are saved as CSV files in the specified folder.
- Each file contains detailed model responses and outcomes for the experiments.
