Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

When the rules of reality change, even in trivial ways, can AI adapt as seamlessly as humans do?

Try it now:

Overview

This project investigates whether Large Language Models (LLMs) can reason effectively in absurd world models through a custom game called Absurd Soccer. By creating variations of soccer rules, we test if LLMs can solve small tasks, problems that humans can perform with limited memory, when the rules deviate from standard expectations.

Why This Matters

This work explores whether models can:

Adapt to rule variations (e.g., "lowest score wins" instead of "highest score wins")
Handle symbolic role reversals (e.g., "shooting a net into a ball")
Maintain reasoning capabilities when world models differ from their training data

Absurd Soccer isolates these capabilities by using a familiar domain (soccer) with systematically modified rules.

Why Soccer?

Universal familiarity - Everyone understands basic rules
Simple domain - Isolates reasoning from complexity
Systematic variations - Clear rule modifications
Easy verification - Deterministic outcomes
Minimal memory - Humans solve effortlessly

Example Scenario

Methodology

Game: Absurd Soccer

Each variant modifies the REAL ruleset through:

Changing roles of symbols (e.g., shooting a net into a ball)
Changing score mechanics (e.g., team with least score wins)

REAL Ruleset

Two teams of players
Each team starts with a score of 0
Five matches per game
Teams alternate shooting a ball at a net
Hitting the net increases score by 1
Highest score wins

Rule Variants

Ruleset	Modification	Change Ttpe
`REAL`	No modification	N/A
`MISSING`	Score increases by 1 when a shot misses	Score mechanics
`LEAST`	Team with the lowest score wins	Score mechanics
`ICE CREAM`	Team earns ice cream when hitting the net	Score mechanics
`CAR`	Team earns cars when hitting the net	Score mechanics
`SWITCH`	Teams shoot a net into a ball (roles reversed)	Roles of symbols
`MISS & SWITCH`	Combination of MISSING and SWITCH	Both

Tasks

Determine Outcome (DO)

Models receive game commentary and ruleset
Task: Determine which team won
Metric: Accuracy of predicted winner

Determine Outcome Few-Shot (DOFS)

Models receive 3 correct question-answer pairs
Then answer a new question with the same ruleset
Metric: Accuracy of predicted winner

Models Tested

We evaluated 17 models across three categories:

Cheap Models ($0.1-$0.5 per million input tokens)

google/gemini-2.0-flash-001
openai/gpt-4o-mini
meta-llama/llama-4-maverick-17b-128e-instruct
qwen/qwen-2.5-72b-instruct
google/gemini-2.5-flash

Expensive Models ($0.5-$1.0 per million input tokens)

anthropic/claude-3-5-haiku
nousresearch/hermes-3-llama-3.1-405b
sao10k/l3.1-euryale-70b
meta-llama/llama-3.1-405b-instruct
thedrummer/skyfall-36b-v2
openai/gpt-4.1-mini
google/gemma-2-27b-it

Reasoning Models (specialized for reasoning)

deepseek/deepseek-r1-0528
mistralai/magistral-small-2506
nvidia/llama-3.1-nemotron-ultra-253b-v1
perplexity/sonar-reasoning

Running the Experiments

Prerequisites

Python Environment:
- Ensure Python 3.8+ is installed.
API Key:
- Obtain an OpenRouter API key and keep it ready.

Steps to Run

Run a Single Task:
- To evaluate a specific ruleset with a set of models, use the task_1 function in main.py.
- Example:
```
python main.py --task DO --api_key YOUR_API_KEY --num_sims 100 --ruleset "Switch" --models "cheap"
```
Run All Rulesets:
- To evaluate all rulesets for a specific task, use the run_all_rulesets function.
- Example:
```
python main.py --task DO --api_key YOUR_API_KEY --num_sims 100 --models "all"
```

Run Full Experiment:

To run all tasks and rulesets, use the run_full_exp function.

Example:

python main.py --folder_name "experiment_results" --api_key YOUR_API_KEY --num_sims 100 --models "all"

Output

Results are saved as CSV files in the specified folder.
Each file contains detailed model responses and outcomes for the experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
docs		docs
.gitignore		.gitignore
Absurd_Soccer_test_DO_DOFS.ipynb		Absurd_Soccer_test_DO_DOFS.ipynb
LICENSE		LICENSE
README.md		README.md
absurd.png		absurd.png
absurd_soccer.json		absurd_soccer.json
image.png		image.png
logo1.png		logo1.png
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

Overview

Why This Matters

Why Soccer?

Example Scenario

Methodology

Game: Absurd Soccer

REAL Ruleset

Rule Variants

Tasks

Models Tested

Cheap Models ($0.1-$0.5 per million input tokens)

Expensive Models ($0.5-$1.0 per million input tokens)

Reasoning Models (specialized for reasoning)

Running the Experiments

Prerequisites

Steps to Run

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

Overview

Why This Matters

Why Soccer?

Example Scenario

Methodology

Game: Absurd Soccer

REAL Ruleset

Rule Variants

Tasks

Models Tested

Cheap Models ($0.1-$0.5 per million input tokens)

Expensive Models ($0.5-$1.0 per million input tokens)

Reasoning Models (specialized for reasoning)

Running the Experiments

Prerequisites

Steps to Run

Output

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages