| title | CacheForge Environment | |
|---|---|---|
| emoji | ⚡ | |
| colorFrom | indigo | |
| colorTo | blue | |
| sdk | docker | |
| pinned | false | |
| app_port | 8000 | |
| base_path | /web | |
| tags |
|
A production-grade multi-tier cache optimisation environment built on the OpenEnv specification. An RL agent observes live cache health metrics and tunes TTL, capacity, eviction policy, and tier placement to maximise hit rate while minimising latency and memory waste. Includes task-based evaluation with deterministic graders (0.0–1.0 scoring).
Designed to evaluate both reinforcement learning policies and LLM-based decision agents.
This environment is fully compliant with OpenEnv and supports both local Docker execution and remote evaluation via Hugging Face Spaces.
Every large-scale web service — Google, Netflix, Amazon, Cloudflare — relies on multi-tier caching to serve billions of requests per second. Manual cache tuning is often suboptimal and brittle:
- Static TTLs can't adapt to shifting traffic patterns.
- Over-provisioned caches waste memory; under-provisioned ones spike latency.
- Eviction policy choice (LRU vs LFU vs FIFO) depends on workload skew.
CacheForge models this problem as an RL environment: the agent receives real-time cache telemetry and must learn a policy that generalises across traffic patterns of increasing difficulty.
┌─────────┐ observation ┌───────┐ action ┌─────────────┐
│ Agent │ ◄────────────── │ Cache │ ◄────────── │ Agent │
│ (LLM) │ ──────────────► │ Env │ ──────────► │ Decision │
└─────────┘ reward └───────┘ └─────────────┘
- Observe: hit rate, latency, memory usage, request distribution
- Act: adjust TTL, resize capacity, set eviction policy, shift tiers
- Reward: composite signal balancing hit rate, latency, and memory
- Repeat for up to 200 steps per episode
| Field | Type | Range | Description |
|---|---|---|---|
hit_rate |
float |
0.0 – 1.0 | Cache hit rate across all tiers |
miss_rate |
float |
0.0 – 1.0 | Cache miss rate (1 - hit_rate) |
avg_latency |
float |
≥ 0.0 | Average request latency (ms) |
memory_usage |
float |
≥ 0.0 | Total memory as fraction of capacity |
request_rate |
int |
≥ 0 | Requests processed this step |
hot_keys_ratio |
float |
0.0 – 1.0 | Fraction of requests hitting hot keys |
cache_distribution |
dict |
— | Per-tier utilisation {L1, L2, L3} |
done |
bool |
— | Episode termination flag |
reward |
float |
— | Step reward value |
| Field | Type | Range | Description |
|---|---|---|---|
adjust_ttl |
int |
-10 to +10 | Global TTL delta (seconds) |
resize_cache |
float |
-0.2 to +0.2 | Relative capacity resize |
eviction_policy |
str |
"LRU" / "LFU" / "FIFO" |
Eviction strategy |
tier_shift |
str |
"none" / "L1→L2" / "L2→L3" |
Tier data migration |
CacheForge defines 3 tasks of increasing difficulty that map to different workload generators. Each task is evaluated using a deterministic grader returning a score in [0.0, 1.0]. Tasks are automatically selectable via reset(mode=...).
| Mode | easy |
| Workload | Static Zipf (α = 1.2), fixed key-space |
| Goal | Maximise cache hit rate |
| Grader | score = clamp(hit_rate / 0.75, 0.001, 0.998) |
| Mode | medium |
| Workload | Alternating between two Zipf distributions every 50 steps |
| Goal | Balance hit rate and latency |
| Grader | score = 0.5 × hit_rate + 0.5 × (1 - min(latency / 50, 1)) |
| Mode | hard |
| Workload | Sine-wave α modulation + Gaussian noise, continuous drift |
| Goal | Maintain high, stable performance under unpredictable load |
| Grader | 4-component score including stability bonus (hit-rate variance penalty) |
score = 0.4 × hit_rate
+ 0.2 × (1 - normalised_latency)
+ 0.2 × (1 - memory_penalty)
+ 0.2 × stability_bonus
The stability component discourages agents that spike then crash — consistent performance is rewarded.
All graders enforce strict open-interval bounds (0, 1) to comply with evaluation constraints, ensuring scores never reach exactly 0.0 or 1.0 due to floating-point rounding.
Note: Maximum score is capped at 0.998 instead of 1.0 to prevent floating-point rounding from producing invalid boundary values (e.g.,
"1.000") during evaluation.
Per-step reward (continuous, non-sparse):
reward = +2.0 × hit_rate
-1.5 × normalised_latency (latency / 50ms)
-1.0 × memory_overuse (max(0, usage - 0.85))
This provides a meaningful partial-progress signal every step.
# Install dependencies
uv sync
# Start the environment server
uv run python -m server.app
# Server runs at http://localhost:8000
# API docs at http://localhost:8000/docs# Build (Dockerfile is at project root)
docker build -t cacheforge-env:latest .
# Run
docker run -p 8000:8000 cacheforge-env:latestopenenv push --repo-id <your-username>/cacheforgeCacheForge is deployed and publicly accessible on Hugging Face Spaces:
👉 https://tuhindev2029-cacheforge.hf.space
curl https://tuhindev2029-cacheforge.hf.space/healthcurl -X POST https://tuhindev2029-cacheforge.hf.space/reset \
-H "Content-Type: application/json" \
-d '{}'curl -X POST https://tuhindev2029-cacheforge.hf.space/step \
-H "Content-Type: application/json" \
-d '{
"action": {
"adjust_ttl": 1,
"resize_cache": 0.05,
"eviction_policy": "LRU",
"tier_shift": "none"
}
}'The API is stateless per episode — a
/resetcall is required before each/stepsequence.
from client import CacheforgeEnv
from models import CacheforgeAction
with CacheforgeEnv(base_url="http://localhost:8000") as env:
result = env.reset(mode="easy", seed=42)
print(f"Initial hit rate: {result.observation.hit_rate}")
action = CacheforgeAction(
adjust_ttl=3,
resize_cache=0.1,
eviction_policy="LFU",
tier_shift="none",
)
result = env.step(action)
print(f"Reward: {result.reward:.3f}")# Reset
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"seed": 42, "mode": "easy"}'
# Step
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{
"action": {
"adjust_ttl": 3,
"resize_cache": 0.1,
"eviction_policy": "LFU",
"tier_shift": "none"
}
}'The inference.py script runs an LLM agent against the environment using an OpenAI-compatible client (HuggingFace Router). It supports two execution modes:
export HF_TOKEN="your_hf_token"
# Start server first
uv run python -m server.app
# Then run inference
python inference.pyexport HF_TOKEN="your_hf_token"
export API_BASE_URL="https://tuhindev2029-cacheforge.hf.space"
python inference.pyDefault model: Qwen/Qwen2.5-72B-Instruct (configurable via MODEL_NAME).
The script automatically connects to the environment using API_BASE_URL when set, otherwise defaults to local Docker execution.
The agent interacts with the environment via HTTP API calls, making the system compatible with both local and remote deployments.
| Task | Mode | Score |
|---|---|---|
| Easy | Static Zipf | 0.998 |
| Medium | Mixed Zipf | 0.856 |
| Hard | Dynamic + noise | 0.912 |
Baseline uses an LLM agent (Qwen/Qwen2.5-72B-Instruct via HuggingFace Router) with a fixed seed for deterministic, reproducible runs. Scores demonstrate strong generalisation across all three workload difficulty levels.
All scores are computed using deterministic graders defined in tasks.py, ensuring reproducibility across runs. Scores exceed the success threshold (0.6) across all tasks.
cacheforge/
├── Dockerfile # Container build
├── .dockerignore # Docker build exclusions
├── .gitignore # Git exclusions
├── openenv.yaml # OpenEnv manifest
├── pyproject.toml # Dependencies & metadata
├── uv.lock # Locked dependency versions
├── LICENSE # BSD 3-Clause
├── models.py # Action & Observation Pydantic models
├── client.py # CacheforgeEnv client (WebSocket)
├── tasks.py # Task definitions & graders
├── inference.py # Baseline inference script
├── README.md # This file
└── server/
├── __init__.py
├── app.py # FastAPI server
├── cacheforge_environment.py # Core environment simulation
└── requirements.txt # Server dependencies
BSD-style license. See LICENSE for details.