Skip to content

storage: project control plane storage layer does not scale with project count #596

@scotwells

Description

@scotwells

Problem

Milo's virtualized project control plane allocates dedicated storage infrastructure per project — watchcaches, etcd connections, informer factories — and never reclaims it. Every project that makes an API call permanently increases the server's memory and connection footprint until restart. This model works fine at small scale but degrades sharply as project count grows, and fails well before the scale a SaaS product needs.

etcd compounds the problem. A single etcd cluster has practical limits on concurrent watch streams, total key count, and write throughput. As project count grows, the number of per-project watch streams drives etcd toward these limits independently of the API server — and etcd's operational model (compaction, defragmentation, backup/restore) becomes increasingly difficult to manage as the dataset size grows with tenant count. At sufficient scale, a single etcd instance is not a viable storage backend regardless of how efficiently the API server uses it.

Success looks like

  • Baseline performance benchmarks established so current scalability limits are understood and future improvements are measurable — covering both the API server and etcd
  • A single Milo instance can serve tens of thousands of projects without degradation in memory, connection count, or API latency as project count grows
  • Adding a new project does not allocate new storage infrastructure (connections, watchcaches, goroutine pools)
  • The storage backend can scale independently of the API server, with a path to replace etcd if it becomes the bottleneck
  • Existing projects continue to be fully isolated from one another

Context

kplane.dev is an open-source project that solves the same multi-tenant control plane problem at roughly 170× higher density. Their approach embeds tenant identity within the storage key rather than using per-tenant storage instances, which allows watchcaches and etcd connections to be shared across all tenants. Their kplane-dev/storage and kplane-dev/informer libraries directly address the bottlenecks Milo has today, and kplane-dev/kubernetes (a Kubernetes fork) unlocks shared watchcaches across tenants. kplane also provides kplane-dev/spanner, an alternative storage.Interface implementation backed by Google Cloud Spanner, which demonstrates a path beyond etcd for deployments that need to scale storage independently.

Delivering on the success criteria above will require evaluating and likely adopting parts of kplane's stack, designing a key layout migration for existing projects, understanding etcd's practical ceiling in Milo's deployment model, and determining how Milo's Organization → Project hierarchy maps onto kplane's tenant model.

References

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions