Skip to content

kibrq/k-server-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

k-server-bench

This project aims to automate and accelerate progress toward the k-server conjecture, a central open problem in competitive analysis, using open-ended AI-driven discovery.

The core task is to discover a potential function that satisfies a large, graph-structured system of linear inequalities. The evaluation procedure is sound but incomplete:

  • Any violated inequality definitively refutes a candidate.
  • Satisfying all inequalities does not constitute a formal proof of the conjecture.

Nevertheless, a candidate that satisfies all constraints would provide strong evidence toward a valid proof. To the best of our knowledge, no known potential satisfies all constraints under our formulation for the open circle case with $$k = 4$$.

As such, a successful candidate would already represent a meaningful contribution—and could potentially lead to a full theoretical result when paired with a formal proof.

For additional mathematical background, see the documentation under docs/. A paper link can be added here once the public version is finalized.


Results

Among previously human-discovered solutions for the open circle case with $$k = 4$$, the best known potential has:

  • 17 violations out of ~7 million inequalities

Using the tools in this repository together with coding agents such as Codex, we discovered a potential with:

  • 3 violations out of ~7 million inequalities

One reproducible path to that result is the staged search in examples/search_n7_async_pipeline/: a Codex-developed, human-guided coefficient search for a fixed n=7 index matrix. It first filters candidates on circle_k4_m6, then on hard circle_taxi_k4_m6 edges, then on timeout-limited random taxi edges, and finally runs full taxi evaluation. With random candidates at the beginning, this procedure can find a 3-violation coefficient vector in roughly 20-30 minutes.

This problem is an extremely challenging optimization task:

  • The search space is vast
  • Improvements are incremental and rare
  • Each evaluation requires checking millions of inequalities

These factors make large-scale exploration computationally and algorithmically demanding.


Target Audience

This repository may be useful as:

  1. Research tooling For researchers exploring potential-function approaches to the $$k$$-server problem, particularly the open circle $$k = 4$$ case and beyond.

  2. AI discovery benchmark As a benchmark for agentic, code-based mathematical discovery, where progress is measured via automatically computed violations.


Core Concepts

Candidate (Potential Function)

A candidate is typically implemented as a Python class defining a potential function:

class Potential:

  def __init__(
    self,
    context,   # k-server parameters (metric space, number of servers, etc.)
    **kwargs   # optional hyperparameters
  ):
    self.context = context
    ...

  def __call__(
    self,
    wf: np.ndarray   # representation of a graph node
  ) -> float:
    ...

See examples for both human-designed and AI-discovered potentials. The staged n=7 Codex search example also documents the exact reproduction command and search-family settings.


Evaluator

The evaluator is a Python program that:

  • Computes potential values
  • Checks inequalities over a precomputed dataset of instances (see metrics/)

Each instance corresponds to a metric space $$M$$ and defines a graph $$(V, E)$$, where:

  • Nodes $$v \in V$$ are parameterized by vectors $$w \in \mathbb{R}^d$$
  • Edges $$(u, v) \in E$$ are parameterized by requests $$r \in M$$

The goal is to find a potential $$\Phi$$ such that for every edge $$(u, v, r)$$:

$$ \Phi(v) - \Phi(u) \geq \nabla(u, r, c), $$

where:

$$ \nabla(u, r, c) = \max_X \big(w_v(X) - w_u(X)\big) - (c + 1)\Big(\min_X w_v(X) - \min_X w_u(X)\Big). $$

The evaluator reports:

  • Number of violations
  • Additional diagnostic statistics

See tools/ for details.

Quickstart

Git LFS (required for metrics)

This repository uses Git LFS to store large precomputed metric datasets.

Before cloning the repository, ensure Git LFS is installed and initialized:

git lfs install
git clone <repo-url>
cd <repo>
git lfs pull

Without Git LFS, the files in metrics/ will not be downloaded correctly, and the evaluator will not function.


Installation

The evaluator depends on the k-servers package, which can be installed in an environment separate from the agent:

pip install -e ./k-servers

This separation is recommended when running agents in isolated or containerized environments.


Coding Agents

To prepare a workspace with benchmark documentation and source materials:

bash agents/setup.sh <target-dir>

This setup is agent-agnostic and copies only the necessary benchmark-facing docs and papers into the target directory.

Running the evaluator

You have two main options:

1. Direct invocation (simplest) Instruct your coding agent to run:

/abs/path/to/tools/evaluator/evaluate.py

This can be added to agent context files (e.g., AGENTS.md, CLAUDE.md) so that the agent can call the evaluator when needed.


2. MCP integration

Alternatively, you can expose the evaluator via MCP using the entrypoint:

tools/evaluator/evaluate_mcp_server.py

This allows agents to interact with the evaluator as a tool rather than invoking it directly.

For Codex, we provide a convenience setup script that performs both:

  • the base workspace setup
  • MCP configuration
pip install tomli tomli_w mcp
bash agents/codex/setup.sh <target-dir>

For additional details (including split-environment setups), see agents/codex/README.md.

We also provide an additional autoresearch layering that can be applied on top of an existing workspace setup. It adds an AGENTS.md file plus an autoresearch-style program.md.

bash agents/autoresearch/setup.sh <target-dir>

See agents/autoresearch/README.md.


Sanity Check

After setup:

  1. Enter the target directory.
  2. Start the agent.
  3. Run a simple test task.

Example prompt:

Implement a very simple potential that always returns zero and evaluate it on circle_k3_m6.pickle

Expected output (truncated):

{
  "correct": true,
  "error": null,
  "combined_score": 0.7285714285714286,
  "public": {
    "0/violations_k": 570,
    "0/violations_k_score": 0.7285714285714286
  }
}

Discovery Agents

For discovery systems (e.g., AlphaEvolve), the main evaluator entry point is:

tools/evaluator/evaluate.py

Key parameters:

  • --program_path: path to the candidate implementation
  • --metrics_names: comma-separated list of metric files

See:


Tasks, Hints, and Presets

The tasks/ directory contains reusable task definitions, including:

  • Goals
  • Implementation constraints
  • Hints to reduce the search space

Predefined tasks:


Docker

You can build a fully reproducible environment using Docker.

Build images

docker build -t k-server-env:latest -f docker/Dockerfile .

docker build -t generic-k-server:latest -f agents/Dockerfile .

Or, if you specifically want the Codex image:

docker build -t codex-k-server:latest -f agents/codex/Dockerfile .
  • k-server-env: base environment with dependencies and tooling
  • generic-k-server: base environment plus a generic coding-agent workspace bootstrap, which you can use for tools such as Claude, for example
  • codex-k-server: environment configured for running Codex agents

Run Codex agent container

docker run --rm -it \
  -v $HOME/.codex/auth.json:/run/secrets/codex-auth.json \
  codex-k-server:latest

This mounts your Codex authentication file into the container and launches an interactive session.

To layer autoresearch into the same image, invoke its entrypoint directly:

docker run --rm -it <BASE_IMAGE> /home/kserver/k-server-bench/agents/autoresearch/entrypoint.sh

where <BASE_IMAGE> might be codex-k-server:latest or generic-k-server:latest.


Documentation Guide

Start here:


Repository Structure

About

Executable benchmark for AI-driven potential discovery for the k-server conjecture, with evaluators, metric instances, and released experiment workflows.

Topics

Resources

License

Stars

Watchers

Forks

Contributors