Skip to content

ju-baer/Absurd-World

Repository files navigation

Absurd Soccer Logo

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

When the rules of reality change, even in trivial ways, can AI adapt as seamlessly as humans do?

License: MIT Python 3.8+ Hugging Face Dataset


Try it now:

Open In Colab


Overview

This project investigates whether Large Language Models (LLMs) can reason effectively in absurd world models through a custom game called Absurd Soccer. By creating variations of soccer rules, we test if LLMs can solve small tasks, problems that humans can perform with limited memory, when the rules deviate from standard expectations.

Why This Matters

This work explores whether models can:

  • Adapt to rule variations (e.g., "lowest score wins" instead of "highest score wins")
  • Handle symbolic role reversals (e.g., "shooting a net into a ball")
  • Maintain reasoning capabilities when world models differ from their training data

Absurd Soccer isolates these capabilities by using a familiar domain (soccer) with systematically modified rules.

Why Soccer?

  • Universal familiarity - Everyone understands basic rules
  • Simple domain - Isolates reasoning from complexity
  • Systematic variations - Clear rule modifications
  • Easy verification - Deterministic outcomes
  • Minimal memory - Humans solve effortlessly

Example Scenario

Absurd Soccer Banner


Methodology

Game: Absurd Soccer

Each variant modifies the REAL ruleset through:

  1. Changing roles of symbols (e.g., shooting a net into a ball)
  2. Changing score mechanics (e.g., team with least score wins)

REAL Ruleset

  • Two teams of players
  • Each team starts with a score of 0
  • Five matches per game
  • Teams alternate shooting a ball at a net
  • Hitting the net increases score by 1
  • Highest score wins

Rule Variants

Ruleset Modification Change Ttpe
REAL No modification N/A
MISSING Score increases by 1 when a shot misses Score mechanics
LEAST Team with the lowest score wins Score mechanics
ICE CREAM Team earns ice cream when hitting the net Score mechanics
CAR Team earns cars when hitting the net Score mechanics
SWITCH Teams shoot a net into a ball (roles reversed) Roles of symbols
MISS & SWITCH Combination of MISSING and SWITCH Both

Tasks

Determine Outcome (DO)

  • Models receive game commentary and ruleset
  • Task: Determine which team won
  • Metric: Accuracy of predicted winner

Determine Outcome Few-Shot (DOFS)

  • Models receive 3 correct question-answer pairs
  • Then answer a new question with the same ruleset
  • Metric: Accuracy of predicted winner

Models Tested

We evaluated 17 models across three categories:

Cheap Models ($0.1-$0.5 per million input tokens)

  • google/gemini-2.0-flash-001
  • openai/gpt-4o-mini
  • meta-llama/llama-4-maverick-17b-128e-instruct
  • qwen/qwen-2.5-72b-instruct
  • google/gemini-2.5-flash

Expensive Models ($0.5-$1.0 per million input tokens)

  • anthropic/claude-3-5-haiku
  • nousresearch/hermes-3-llama-3.1-405b
  • sao10k/l3.1-euryale-70b
  • meta-llama/llama-3.1-405b-instruct
  • thedrummer/skyfall-36b-v2
  • openai/gpt-4.1-mini
  • google/gemma-2-27b-it

Reasoning Models (specialized for reasoning)

  • deepseek/deepseek-r1-0528
  • mistralai/magistral-small-2506
  • nvidia/llama-3.1-nemotron-ultra-253b-v1
  • perplexity/sonar-reasoning

Running the Experiments

Prerequisites

  1. Python Environment:

    • Ensure Python 3.8+ is installed.
  2. API Key:

    • Obtain an OpenRouter API key and keep it ready.

Steps to Run

  1. Run a Single Task:

    • To evaluate a specific ruleset with a set of models, use the task_1 function in main.py.
    • Example:
      python main.py --task DO --api_key YOUR_API_KEY --num_sims 100 --ruleset "Switch" --models "cheap"
  2. Run All Rulesets:

    • To evaluate all rulesets for a specific task, use the run_all_rulesets function.
    • Example:
      python main.py --task DO --api_key YOUR_API_KEY --num_sims 100 --models "all"
  3. Run Full Experiment:

    • To run all tasks and rulesets, use the run_full_exp function.
    • Example:
      python main.py --folder_name "experiment_results" --api_key YOUR_API_KEY --num_sims 100 --models "all"

Output

  • Results are saved as CSV files in the specified folder.
  • Each file contains detailed model responses and outcomes for the experiments.

About

Testing LLM reasoning in absurd world models through modified soccer rules. Can AI adapt when the rules change?

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors