k-server-bench

This project aims to automate and accelerate progress toward the k-server conjecture, a central open problem in competitive analysis, using open-ended AI-driven discovery.

The core task is to discover a potential function that satisfies a large, graph-structured system of linear inequalities. The evaluation procedure is sound but incomplete:

Any violated inequality definitively refutes a candidate.
Satisfying all inequalities does not constitute a formal proof of the conjecture.

Nevertheless, a candidate that satisfies all constraints would provide strong evidence toward a valid proof. To the best of our knowledge, no known potential satisfies all constraints under our formulation for the open circle case with $$k = 4$$.

As such, a successful candidate would already represent a meaningful contribution—and could potentially lead to a full theoretical result when paired with a formal proof.

For additional mathematical background, see the documentation under docs/. A paper link can be added here once the public version is finalized.

Results

Among previously human-discovered solutions for the open circle case with $$k = 4$$, the best known potential has:

17 violations out of ~7 million inequalities

Using the tools in this repository together with coding agents such as Codex, we discovered a potential with:

3 violations out of ~7 million inequalities

One reproducible path to that result is the staged search in examples/search_n7_async_pipeline/: a Codex-developed, human-guided coefficient search for a fixed n=7 index matrix. It first filters candidates on circle_k4_m6, then on hard circle_taxi_k4_m6 edges, then on timeout-limited random taxi edges, and finally runs full taxi evaluation. With random candidates at the beginning, this procedure can find a 3-violation coefficient vector in roughly 20-30 minutes.

This problem is an extremely challenging optimization task:

The search space is vast
Improvements are incremental and rare
Each evaluation requires checking millions of inequalities

These factors make large-scale exploration computationally and algorithmically demanding.

Target Audience

This repository may be useful as:

Research tooling For researchers exploring potential-function approaches to the $$k$$-server problem, particularly the open circle $$k = 4$$ case and beyond.
AI discovery benchmark As a benchmark for agentic, code-based mathematical discovery, where progress is measured via automatically computed violations.

Core Concepts

Candidate (Potential Function)

A candidate is typically implemented as a Python class defining a potential function:

class Potential:

  def __init__(
    self,
    context,   # k-server parameters (metric space, number of servers, etc.)
    **kwargs   # optional hyperparameters
  ):
    self.context = context
    ...

  def __call__(
    self,
    wf: np.ndarray   # representation of a graph node
  ) -> float:
    ...

See examples for both human-designed and AI-discovered potentials. The staged n=7 Codex search example also documents the exact reproduction command and search-family settings.

Evaluator

The evaluator is a Python program that:

Computes potential values
Checks inequalities over a precomputed dataset of instances (see metrics/)

Each instance corresponds to a metric space $$M$$ and defines a graph $$(V, E)$$, where:

Nodes $$v \in V$$ are parameterized by vectors $$w \in \mathbb{R}^d$$
Edges $$(u, v) \in E$$ are parameterized by requests $$r \in M$$

The goal is to find a potential $$\Phi$$ such that for every edge $$(u, v, r)$$:

$$ \Phi(v) - \Phi(u) \geq \nabla(u, r, c), $$

where:

$$ \nabla(u, r, c) = \max_X \big(w_v(X) - w_u(X)\big) - (c + 1)\Big(\min_X w_v(X) - \min_X w_u(X)\Big). $$

The evaluator reports:

Number of violations
Additional diagnostic statistics

See tools/ for details.

Quickstart

Git LFS (required for metrics)

This repository uses Git LFS to store large precomputed metric datasets.

Before cloning the repository, ensure Git LFS is installed and initialized:

git lfs install
git clone <repo-url>
cd <repo>
git lfs pull

Without Git LFS, the files in metrics/ will not be downloaded correctly, and the evaluator will not function.

Installation

The evaluator depends on the k-servers package, which can be installed in an environment separate from the agent:

pip install -e ./k-servers

This separation is recommended when running agents in isolated or containerized environments.

Coding Agents

To prepare a workspace with benchmark documentation and source materials:

bash agents/setup.sh <target-dir>

This setup is agent-agnostic and copies only the necessary benchmark-facing docs and papers into the target directory.

Running the evaluator

You have two main options:

1. Direct invocation (simplest) Instruct your coding agent to run:

/abs/path/to/tools/evaluator/evaluate.py

This can be added to agent context files (e.g., AGENTS.md, CLAUDE.md) so that the agent can call the evaluator when needed.

2. MCP integration

Alternatively, you can expose the evaluator via MCP using the entrypoint:

tools/evaluator/evaluate_mcp_server.py

This allows agents to interact with the evaluator as a tool rather than invoking it directly.

For Codex, we provide a convenience setup script that performs both:

the base workspace setup
MCP configuration

pip install tomli tomli_w mcp
bash agents/codex/setup.sh <target-dir>

For additional details (including split-environment setups), see agents/codex/README.md.

We also provide an additional autoresearch layering that can be applied on top of an existing workspace setup. It adds an AGENTS.md file plus an autoresearch-style program.md.

bash agents/autoresearch/setup.sh <target-dir>

See agents/autoresearch/README.md.

Sanity Check

After setup:

Enter the target directory.
Start the agent.
Run a simple test task.

Example prompt:

Implement a very simple potential that always returns zero and evaluate it on circle_k3_m6.pickle

Expected output (truncated):

{
  "correct": true,
  "error": null,
  "combined_score": 0.7285714285714286,
  "public": {
    "0/violations_k": 570,
    "0/violations_k_score": 0.7285714285714286
  }
}

Discovery Agents

For discovery systems (e.g., AlphaEvolve), the main evaluator entry point is:

tools/evaluator/evaluate.py

Key parameters:

--program_path: path to the candidate implementation
--metrics_names: comma-separated list of metric files

See:

Tasks, Hints, and Presets

The tasks/ directory contains reusable task definitions, including:

Goals
Implementation constraints
Hints to reduce the search space

Predefined tasks:

Docker

You can build a fully reproducible environment using Docker.

Build images

docker build -t k-server-env:latest -f docker/Dockerfile .

docker build -t generic-k-server:latest -f agents/Dockerfile .

Or, if you specifically want the Codex image:

docker build -t codex-k-server:latest -f agents/codex/Dockerfile .

k-server-env: base environment with dependencies and tooling
generic-k-server: base environment plus a generic coding-agent workspace bootstrap, which you can use for tools such as Claude, for example
codex-k-server: environment configured for running Codex agents

Run Codex agent container

docker run --rm -it \
  -v $HOME/.codex/auth.json:/run/secrets/codex-auth.json \
  codex-k-server:latest

This mounts your Codex authentication file into the container and launches an interactive session.

To layer autoresearch into the same image, invoke its entrypoint directly:

docker run --rm -it <BASE_IMAGE> /home/kserver/k-server-bench/agents/autoresearch/entrypoint.sh

where <BASE_IMAGE> might be codex-k-server:latest or generic-k-server:latest.

Documentation Guide

Start here:

docs/concepts.md: core mathematical concepts and scoring
docs/task-model.md: task composition and sweeps
docs/evaluators.md: evaluator interfaces
docs/experiments.md: experiment structure

Repository Structure

docs/: detailed documentation
k-servers/: core Python package
tasks/: task definitions and hints
experiments/: experiment configurations
tools/: evaluators and utilities
metrics/: benchmark datasets
examples/: example runs and candidates
tests/: basic validation tests
agents/: agent integrations
docker/: reproducible environments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

k-server-bench

Results

Target Audience

Core Concepts

Candidate (Potential Function)

Evaluator

Quickstart

Git LFS (required for metrics)

Installation

Coding Agents

Running the evaluator

Sanity Check

Discovery Agents

Tasks, Hints, and Presets

Docker

Build images

Run Codex agent container

Documentation Guide

Repository Structure

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
agents		agents
docker		docker
docs		docs
examples		examples
experiments		experiments
k-servers		k-servers
metrics		metrics
tasks		tasks
tests		tests
tools		tools
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

k-server-bench

Results

Target Audience

Core Concepts

Candidate (Potential Function)

Evaluator

Quickstart

Git LFS (required for metrics)

Installation

Coding Agents

Running the evaluator

Sanity Check

Discovery Agents

Tasks, Hints, and Presets

Docker

Build images

Run Codex agent container

Documentation Guide

Repository Structure

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages