Skip to content

Latest commit

 

History

History
99 lines (50 loc) · 8.43 KB

ROADMAP.md

File metadata and controls

99 lines (50 loc) · 8.43 KB

Existing Features

  • Default Components: Provide an extensible "default scheduler".

  • Multiple Pod Types: Support services with heterogeneous nodes types - e.g., HDFS with Name Nodes, Journal Nodes, Data Nodes.

  • Configurable State Store Backing Service for Default Scheduler: Enable configuring the state store backing service. Currently ZooKeeper is the only supported state store backing service.

  • Pod APIs: Declarative pod specification via markup YAML API and Java API via the builder pattern.

  • Variable Substitution in YAML API: Support variable substitution in YAML API - e.g., type: {{TYPE}}

  • Resources for Pods: CPU and memory resources required for pods.

  • Local Storage Volume Resources for Pods: Support for 'Root' and 'MOUNT' persistent storage volumes for stateful services.

  • Resource Reservation and Accounting for Pods: Support for reserving pod resources. Reservations ensure that pods can always re-launch on a healthy agent with its local persistent storage volume (i.e. in cases of pods crash and agent reboot).

  • Task RUNNING Goal State for Pods: Support long running tasks.

  • Binaries Assets for Tasks: Define artifacts needed for the tasks - e.g., JRE, Kafka binaries.

  • Configuration Templating for Tasks: Templating for task configuration files with Mustache so that task configuration is customizable.

  • Configurable Environment Variables for Tasks: Support configuring environment variables for tasks.

  • Scale-up/down CPU Resources for Pods: Support increasing and deceasing CPU resources for a pod. Resource reservations are automatically updated accordingly.

  • Scale-up/down Memory Resources for Pods: Support increasing and deceasing memory resources for a pod. Resource reservations are automatically updated accordingly.

  • Scale Out Count for Pods: Support increasing the number of pods. If required by the service, operators must manually rebalance data after the additional pods are running.

  • Maintenance Plans: Allow the definition and on-demand execution of plans with tasks that run inside the context of running pods. This allows for the definition of custom maintenance plans like backup and restore, data rebalance, etc.

  • Default Deploy Plan: Built-in plan for deploying pods with both serial and parallel execution strategies.

  • Task FINISHED Goal State for Tasks within a Pod: Support for tasks that run until finished. This is useful for operations such a formatting a data node volume.

  • Load Balanced Virtual Interface for Tasks: Specify load balanced virtual interfaces for tasks. Virtual addresses define a name, source port, destination port mapping, protocol, and if it is externally advertised. Virtual addresses marked for external advertisement are discoverable via /endpoints HTTP API.

  • Placement constraints within Pods: Support for Marathon constraints language for pod (with the exception of the Marathon constrain language for volumes).

  • Auto Suppress/Revive of Resource Offers for Default Scheduler: Scheduler automatically sends SUPPRESS / REVIVE messages to Apache Mesos so that Apache Mesos only sends offers when the scheduler needs this. This enables Apache Mesos to scale much larger in terms of number of frameworks.

  • Set rlimits for Tasks: Support for defining rlimit settings for a task.

  • Configurable Principal for Default Scheduler: Enable configuring of the principal.

  • Set User Account for Tasks: Support for setting the user account for a task.

  • Custom Strategies for Plans: Java API with documentation for developing custom plan strategies.

  • Basic Docker Image Support for Tasks: Support for a single Docker image per pod. (Note Advanced Docker Image Support For Tasks.)

  • Resource Sets: Multiple different tasks can be run in sequence using the same resources (in particular persistent volumes) to produce a cumulative effect. For example, initialization tasks before main tasks can be modeled in this way.

  • Health Checks for Tasks: Health check commands can be defined on a per task basis to automatically kill unhealthy tasks.

  • Readiness Checks for Tasks: Readiness checks enable plans that restart pods to wait until data replication has caught up before proceeding.

Planned Features

  1. DC/OS Networking API Integration for Tasks: Support for integrating with DC/OS Networking. DC/OS supports Container Networking Interface.

  2. External Storage Volumes for Pods: Add support for external storage volumes via Container Storage Interface (CSI).

  3. Enterprise DC/OS Secrets Management API: Integrate with Enterprise DC/OS API for secure distribution of secrets such as certificates, keytab, and other sensitive files.

  4. Enterprise DC/OS Role Based Access Control (RBAC) Integration: With the Enterprise DC/OS Roles Based Access Control feature, ACLs are applied to Marathon folders to control access.

  5. Enterprise DC/OS Service Account Configuration: Wrap the process of configuring an Enterprise DC/OS Service Account.

  6. Scheduler and Executor Metrics: Instrument the default scheduler and executor to send performance / health / usage metrics for monitoring and troubleshooting via DC/OS Metrics.

  7. Rack-aware Placement and Data Replication: Many stateful services such as Elasticsearch, Kafka, Cassandra, and HDFS are "rack-aware". This feature depends on Apache Mesos Fault Domains.

  8. Non-reserved Resources for Tasks: Currently the SDK always reserves resources. This is because the initial focus for the SDK is stateful workloads, where reserving resources in essential for safe operations. For other workloads like twelve-factor apps and analytics jobs, reserving resources is generally undesirable.

  9. GPU Resources for Pods: Support for GPU resources to enable analytics workloads such as TensorFlow.

  10. FINISHED Goal State for Default Scheduler: Currently services have an implied RUNNING goal state, which is appropriate for long running services. To support batch workloads such as TensorFlow, a new concept of a service goal state will be introduced with support for RUNNING and FINISHED goal states.

  11. Graceful Shutdown for Tasks: Support for graceful shutdown of services such as Apache Kafka. This is considered an optimization since hardware and software can fail unexpectedly.

  12. Advanced Docker Image Support for Tasks: Support for Docker image per task within a pod.

  13. Updatable Placement Constraints for Pods: Enable operators to update pods placement constraints so that pods can safely replace without losing data or violating performance SLAs.

  14. Partition-Aware API Integration: Support for Apache Mesos partition-aware frameworks APIs. For more details see MESOS-5344 and MESOS-6394.

  15. Scale-in Count for Pods: Reducing node (pod) count for most data services requires draining data. The SDK currently prevents reducing the pod count to avoid accidental deletion of data. With this feature, a plan could be called to drain data prior to termination. On termination, resources for the pods will be unreserved. Note: scale-out, scale-up, scale-down are already supported.

  16. Read-only Volumes for Pods: Support for read-only volumes with (MESOS-4324).

  17. HTTP/HTTPS Health Checks for Tasks: Support for HTTP/HTTPS health checks with (MESOS-2533).

  18. TCP Health Checks for Tasks: Support for TCP health checks with (MESOS-3567).

  19. Maintenance API: Apache Mesos offers maintenance primitives to notify the scheduler (framework) when cluster agents (servers) will be offline. With this feature, automated recovery plans could consider maintenance windows to make better decisions, and proactive task replacement plans will be possible.