Production-grade self-hosted deployment of ClearML using Docker Compose.
This repository implements a production-oriented self-hosted deployment of ClearML using Docker Compose.
The setup is designed with infrastructure isolation, service segmentation, and scalability in mind.
The system follows a layered architecture:
- Ingress Layer
- Application Layer
- State & Data Layer
- Execution Layer
Nginx (Reverse Proxy / Entry Point)
Acts as the single ingress boundary of the system.
Responsibilities:
- Centralized traffic routing
- API/UI request segregation
- Future TLS termination point
- Security enforcement (rate limiting, headers, IP policies)
- Decouples public exposure from internal services
ClearML API Server
- Core orchestration and metadata management component
- System control plane
- Experiment lifecycle management
- Task orchestration
- Metadata persistence coordination
- Communication hub between storage, indexing, and execution layers
ClearML Web Server
- Presentation and interaction layer
- Read-only and interactive visualization interface
- Observability for experiment tracking
- Stateless, relies on API server for control
ClearML File Server
- Artifact persistence gateway
- Binary artifact storage abstraction
- Centralized model and dataset persistence
- Decouples execution layer from storage
- Scalable artifact management
MongoDB (Primary Metadata Store)
- System-of-record for structured state
- Persistent metadata storage (task definitions, config state)
- Provides transactional durability and long-term consistency
Redis (Ephemeral State & Messaging)
- Low-latency coordination layer
- Queue backend, caching, inter-service signaling
- Reduces load on persistent storage
Elasticsearch (Observability & Indexing Engine)
- Search and log analytics subsystem
- High-performance log indexing
- Metric filtering and aggregation
- Operational visibility
ClearML Agent Services
- Distributed compute execution engine
- Pull-based task execution
- Containerized experiment runtime
- Reproducible execution environment
- Horizontal scaling of compute workers
- Decouples orchestration from computation
- Network Segmentation – Public-facing components isolated from data services
- Separation of Concerns – Each layer has well-defined responsibilities
- Horizontal Scaling Readiness – Stateless services can be replicated; execution layer scales independently
- Persistent Storage Strategy – Named Docker volumes, clear separation between ephemeral and durable state
- Production-Oriented Resource Governance – Healthchecks, restart policies, Elasticsearch heap configuration
- Docker ≥ 24.0
- Docker Compose ≥ 2.0