Problem
Milo's virtualized project control plane allocates dedicated storage infrastructure per project — watchcaches, etcd connections, informer factories — and never reclaims it. Every project that makes an API call permanently increases the server's memory and connection footprint until restart. This model works fine at small scale but degrades sharply as project count grows, and fails well before the scale a SaaS product needs.
etcd compounds the problem. A single etcd cluster has practical limits on concurrent watch streams, total key count, and write throughput. As project count grows, the number of per-project watch streams drives etcd toward these limits independently of the API server — and etcd's operational model (compaction, defragmentation, backup/restore) becomes increasingly difficult to manage as the dataset size grows with tenant count. At sufficient scale, a single etcd instance is not a viable storage backend regardless of how efficiently the API server uses it.
Success looks like
- Baseline performance benchmarks established so current scalability limits are understood and future improvements are measurable — covering both the API server and etcd
- A single Milo instance can serve tens of thousands of projects without degradation in memory, connection count, or API latency as project count grows
- Adding a new project does not allocate new storage infrastructure (connections, watchcaches, goroutine pools)
- The storage backend can scale independently of the API server, with a path to replace etcd if it becomes the bottleneck
- Existing projects continue to be fully isolated from one another
Context
kplane.dev is an open-source project that solves the same multi-tenant control plane problem at roughly 170× higher density. Their approach embeds tenant identity within the storage key rather than using per-tenant storage instances, which allows watchcaches and etcd connections to be shared across all tenants. Their kplane-dev/storage and kplane-dev/informer libraries directly address the bottlenecks Milo has today, and kplane-dev/kubernetes (a Kubernetes fork) unlocks shared watchcaches across tenants. kplane also provides kplane-dev/spanner, an alternative storage.Interface implementation backed by Google Cloud Spanner, which demonstrates a path beyond etcd for deployments that need to scale storage independently.
Delivering on the success criteria above will require evaluating and likely adopting parts of kplane's stack, designing a key layout migration for existing projects, understanding etcd's practical ceiling in Milo's deployment model, and determining how Milo's Organization → Project hierarchy maps onto kplane's tenant model.
References
Problem
Milo's virtualized project control plane allocates dedicated storage infrastructure per project — watchcaches, etcd connections, informer factories — and never reclaims it. Every project that makes an API call permanently increases the server's memory and connection footprint until restart. This model works fine at small scale but degrades sharply as project count grows, and fails well before the scale a SaaS product needs.
etcd compounds the problem. A single etcd cluster has practical limits on concurrent watch streams, total key count, and write throughput. As project count grows, the number of per-project watch streams drives etcd toward these limits independently of the API server — and etcd's operational model (compaction, defragmentation, backup/restore) becomes increasingly difficult to manage as the dataset size grows with tenant count. At sufficient scale, a single etcd instance is not a viable storage backend regardless of how efficiently the API server uses it.
Success looks like
Context
kplane.dev is an open-source project that solves the same multi-tenant control plane problem at roughly 170× higher density. Their approach embeds tenant identity within the storage key rather than using per-tenant storage instances, which allows watchcaches and etcd connections to be shared across all tenants. Their
kplane-dev/storageandkplane-dev/informerlibraries directly address the bottlenecks Milo has today, andkplane-dev/kubernetes(a Kubernetes fork) unlocks shared watchcaches across tenants. kplane also provideskplane-dev/spanner, an alternativestorage.Interfaceimplementation backed by Google Cloud Spanner, which demonstrates a path beyond etcd for deployments that need to scale storage independently.Delivering on the success criteria above will require evaluating and likely adopting parts of kplane's stack, designing a key layout migration for existing projects, understanding etcd's practical ceiling in Milo's deployment model, and determining how Milo's Organization → Project hierarchy maps onto kplane's tenant model.
References
internal/apiserver/storage/project/— storage mux and child lifecycleinternal/controllers/garbagecollector/wiring.go— per-project informer factory patterninternal/controllers/resourcemanager/project_controller.go— ProjectControlPlane lifecycle