Anonymous supplementary code for the paper "Compute Allocation in Evolutionary Search: From Depth–Breadth to Multi-Armed Bandits" (under double-blind review).
pip install -r requirements.txt
# Point at any OpenAI-compatible chat endpoint (e.g. vLLM serving Qwen3-8B):
export LLM_API_BASE=http://localhost:8000/v1
export LLM_API_KEY=dummyEvery generation the LLM is asked for N independent full-rewrites of the current best
program in parallel; the best evaluated child replaces the parent if it is strictly
better. Total budget per run is C = T * N LLM calls.
# Circle packing on Qwen3-8B with T=8 generations and N=64 children per generation.
python run_greedy.py --task cp --model qwen3-8b --T 8 --N 64 --seed 40
python run_greedy.py --task mmd --model qwen3-14b --T 16 --N 32 --seed 41
python run_greedy.py --task ht --model llama-8b --T 4 --N 128 --seed 42Results land at results/<task>/<model>/greedy_C<C>_T<T>_N<N>_seed<seed>.json.
The JSON contains the full per-generation trajectory: best fitness, per-child fitness,
parent / child seeds, per-attempt OpenAI usage payload, and finish reasons.
K greedy arms (one seed each) are kept alive in parallel. Each round, the bandit picks
one arm and we advance only that arm by exactly one greedy generation (N children).
The arm's new best fitness is fed back to the bandit as the round's reward. Total LLM
calls = T_total * N regardless of how the bandit splits the rounds across arms.
# Online UCB1 over 10 parallel CP arms; total budget C = 8*64 = 512 LLM calls.
python run_bandit.py --task cp --model qwen3-8b --algo ucb --T 8 --N 64 --n-arms 10
python run_bandit.py --task mmd --model qwen3-8b --algo exp3p --T 8 --N 64 --n-arms 10
python run_bandit.py --task cp --model qwen3-14b --algo thompson --T 8 --N 64 --n-arms 10
python run_bandit.py --task ht --model llama-8b --algo random --T 8 --N 64 --n-arms 10Each run writes one JSON to results/bandit/<task>/<model>/<algo>_C<C>_T<T>_K<n_arms>_seed<bandit_seed>.json
containing: per-round arm selection, per-round reward, running-max curve, per-arm pull
counts, and the full per-arm greedy trajectory.
- Task: drop
initial_program.py+evaluator.pyundertasks/<key>/, then add aTaskSpecentry toallocator/tasks.py(with the system prompt for the LLM). - Model: add a
ModelSpecentry toallocator/models.py. By default all models share theLLM_API_BASE/LLM_API_KEYenv vars; overrideapi_base_env/api_key_envif you need per-model endpoints.
A common companion to this online orchestrator is a post-hoc bandit that replays UCB / EXP3.P / Thompson / Random over already-logged per-seed fitness trajectories. The implementation here is the online counterpart: the bandit makes each choice live, and the chosen arm's next generation is actually rolled out (LLM calls and evaluations happen during the bandit loop).
Same algorithmic FLOPs convention used throughout the paper:
flops_per_attempt = 2 · params_active · (uncached_prompt_tokens + completion_tokens)
uncached_prompt_tokens = usage.prompt_tokens − usage.prompt_tokens_details.cached_tokens
params_activeis per-model and lives inallocator/models.py. Dense Qwen3 / Llama models haveparams_active == params_total; the field exists so MoE variants can be added without a schema change.- Each LLM call records one
attemptper retry (including timed-out / errored ones) with the OpenAIusagepayload andfinish_reason. Attempts are stored on every child of every generation in the run trajectory. - Per-run aggregation supports three timeout policies for attempts with no
usage:worst_case(default — countmax_tokensas the output),exclude(drop), andbounded(median completion amongfinish_reason == "length"attempts in the same run).
Result JSONs include:
- Greedy:
result["flops"] = {worst_case, exclude, bounded}plusresult["flops_per_generation"](worst_case, one float per generation). - Bandit:
result["flops_per_round"],result["cumulative_flops"],result["total_flops_worst_case"], and full per-armflopssummaries underresult["arm_results"][i]["flops"].