@unterstein unterstein released this Jan 18, 2017 · 548 commits to master since this release

  • fb4aeb9 Quote some vars and make sed more explicit (#4890) (Lukas Lösche)
  • dafd0df workaround for zip64 incompat with shebang-prefixed jars (James DeFelice)
  • e888d16 Fixes #4637 Allow filtering SSE events by types (#4936) (janisz)
  • eaaa994 Parse JSON event only once before broadcast (#4927) (janisz)
  • 7b346de Split long lines (#4926) (janisz)
  • a7f4138 escape leader-latch lock (targets 1.3) (Brad Peters)
  • a05a01b Releases/1.3 async task tracker (#4912) (janisz)
  • 90843e4 Include TASK_FINISHED as failure state when upgrading (#4865) (fengyehong)
  • d3351ea Cherry-picked Handle spaces in arguments correctly (#4887) (janisz)
  • f72df1b Cherry-picked MigrationTo1_1 class to handle broken app groups (#4711) (#4772) (Aleksey Dukhovniy)
  • 4f75209 Allow Zookeeper Connection Timeout to be configured. (#4685) (Jason Gilanfarr)
  • 850db7f Allow Zookeeper Connection Timeout to be configured. (Jason Gilanfarr)
  • 116992d Update #1428 to use MARATHON_CMD to avoid breaking other integration (Jason Gilanfarr)
  • 51f103e Fixes #1428 Converts command arguments to environment variables and back again (merge to release/1.3) (#4644) (brad-peters)



@aquamatthias aquamatthias released this Jan 11, 2017 · 165 commits to master since this release

Changes from 1.4.0-RC4 to 1.4.0-RC5

Recommended Mesos version is 1.1.0

The complete List of changes in Marathon 1.4 is listed in 1.4.0-RC1.

Fixed Issues:

  • Fixes #4876 - Use 'openjdk:8-jdk' container image instead of 'java:8-jdk'.
  • Fixes #4874 - Document feature flag requirement for TASK_KILLING state.
  • Fixes #4900 - Do not restart already started task during leader abdication.
  • Fixes #4788 - Use async InstanceTracker in DeploymentActor.
  • Fixes #4897 - Fix kill and scale to never scale below 0.
  • Fixes #4146 - docker containerizer now allows relative containerPath starting with mesos 1.0.
  • Port Index validation for health checks is performed for all network based health checks.
  • Escape leader-latch lock to not go into a deadlock situation



@aquamatthias aquamatthias released this Dec 21, 2016 · 165 commits to master since this release

Changes from 1.4.0-RC3 to 1.4.0-RC4

Recommended Mesos version is 1.1.0

The complete List of changes in Marathon 1.4 is listed in 1.4.0-RC1.

Fixed Issues:

  • Fixes #4873 - Tasks with configured Marathon HealthChecks fail HealthChecks after migration to 1.4
  • Fixes #4842 - by also writing deprecated pods to the proto.
  • Fixes #4828 - Use async InstanceTracker methods in SchedulerActor
  • Fixes #4882 - by renaming KillSelection enum values
  • Fixes #4890 - Quote some vars and make sed more explicit
  • Fixes #4872 - Disallow usage of Command Checks on Pods until mesos supports them
  • Fixes #4863 - Inform the rate limiter for pods and apps
  • Fixes #4818 - Update kill selection in App as well.
  • Fixes #4877 - fix various bugs in our RAML specification. add omitEmpty to some fields for cleaner output.



@aquamatthias aquamatthias released this Dec 13, 2016 · 165 commits to master since this release

Changes from 1.4.0-RC2 to 1.4.0-RC3

Recommended Mesos version is 1.1.0

The complete List of changes in Marathon 1.4 is listed in 1.4.0-RC1.

Fixed Issues:

  • D323 - Fixes #4831: The LaunchQueueActor now forwards InstanceUpdate Events and sends Done in case of a non existing actor.
  • D324 - Fixes #4829: Handle spaces in arguments correctly
  • D316 - Extract Instance.update into InstanceUpdater
  • D319 - Renamed and moved InstanceUpdaterTest to correct name and package



@aquamatthias aquamatthias released this Dec 12, 2016 · 165 commits to master since this release

Changes from 1.4.0-RC1 to 1.4.0-RC2

Recommended Mesos version is 1.1.0

The complete List of changes in Marathon 1.4 is listed in 1.4.0-RC1.

Fixed Issues:

  • D273 - Mute ResidentTaskIntegrationTest.
  • D279 - Log plan id instead of plan.
  • D276 - Fix issues with Entrypoint/Cmd in Docker images.
  • D265 - clarify relationship between Condition and MesosTaskState
  • D286 - Fixes #4784 - Differ between AppNotFoundException and PodNotFoundException
  • D288 - Fixes a bug I found while testing default_network_name
  • D290 - Allow Marathon devs to annotate API fields/types as deprecated.
  • D289 - Fixes #4790 - POD instances should not be killable via /v2/tasks
  • D277 - MarathonSchedulerActor now respects launch queue tasks when scaling app
  • 1dfd99e - added version 1.3 information
  • D281 - When Launching lots of tasks Journald is pegged.
  • D293 - Mute unstable tests - according to jenkins loop test.
  • D292 - Protect against garbage mesos versions
  • D295 - Mark ResidentTaskInterationTest.persistent volume will be re-attached and keep state as unstable
  • D296 - Rename unreachable strategy parameters to avoid API stutter.
  • D297 - Fixes #4808, relocate pod unreachbleStrategy to PodSchedulingPolicy type
  • D195 - Rework DeploymentManager logic
  • D287 - changelog for Marathon 1.4.0
  • D294 - Move InstanceUpdateOpResolver to own file
  • efb38b9 - Parse JSON event only once before broadcast (#4771)
  • D300 - Turn off debug for jetty
  • D299 - use patienceConfig for ZK timeouts
  • D301 - Increase default expunge timeout tasks
  • D283 - Output debug logs for marathon integration tests.
  • D304 - add Unreachablestrategy to AppUpdate and apply
  • cc4cf66 - Ss/pods docs
  • 0cc19fc - pods typo
  • e354da0 - Change host port type to ephemeral port number (#4820)
  • a0d2b9f - Properly mark health check props as optional
  • D313 - Fix incorrect conversion from TaskStatus.timestamp to Timestamp
  • D308 - Removed unreachableInstances from QueuedInstanceInfo and QueuedInstanceInfoWithStatistics
  • D314 - Remove additional condition from InstanceChangedEventsGenerator.events
  • D309 - Handle OfferMatching timeouts as well as some performance improvements
  • D315 - Moved InstanceUpdateOpResolver to correct package
  • D317 - Fixes #4824 - Use GroupRepository instead of AppRepository and PodRepository in MarathonSchedulerActor
  • D274 - Test app resources with embedded task failures.
  • D302 - Abort on loss of leadership
  • D320 - Reverted the group function to not use transitiveGroupsById.
  • D321 - Used map instead of mapValues in RootGroup.



@jdef jdef released this Dec 5, 2016 · 165 commits to master since this release

Changes from 1.3.0 to 1.4.0-RC1

Breaking Changes

Plugin API has changed

In order to support the nature of pods, we had to change the plugin interfaces in a backward incompatible fashion.
Plugin writers need to update plugins, in order to use this version

Health reporting via the event stream

Adding support for pods in Marathon required the internal representation of tasks to be migrated to instances. An instance represents the executor on the Mesos side, and contains a list of tasks. This change is reflected in various parts of the API, which now accordingly reports health status etc for instances, not for tasks.
Until v1.3.x, Marathon published health_status_changed_events via the event stream. With the introduction of instances that can contain multiple tasks, Marathon moved away from that event in favor of instance_health_changed_events.
In case you were consuming that event you have to adjust your tooling to consume the introduced event instead, e.g.

    "instanceId": "some_app.marathon-49d976d3-9c6f-11e6-93cb-0242216b9f0d",
    "runSpecId": "/some/app",
    "healthy": true,
    "runSpecVersion": "2016-10-18T10:42:47.499Z",
    "timestamp": "2016-10-27T18:00:50.401Z",
    "eventType": "instance_health_changed_event"

Accordingly, the failed_health_check_event now reports an instanceId instead of a taskId:

    "instanceId": "some_app.marathon-49d976d3-9c6f-11e6-93cb-0242216b9f0d",
    "eventType": "failed_health_check_event"

This change affects the following API primitives in a similar way:

  • unhealthy_instance_kill_event (in favor of the previous unhealthy_task_kill_event) provides both the instanceId of the instance that got killed, as well as the taskId designating the task that failed health checks.
  • Health information as reported via the apps and tasks endpoint.



A pod is a collection of co-located and co-scheduled containers in a shared context.
The containers of a pod share a network namespace and may share access to the same filesystem(s).
Each pod instance’s containers are individually resource-isolated.

Mesos 1.1 adds support for launching a group of tasks (LAUNCH_GROUP).
A pod instance’s containers are launched via this Mesos primitive.
Mesos provides the executor implementation that Marathon will use to run pod instances.

We created a new primitive, PodDefinition, as well as new API endpoints.
Read more about to use pods in our Pods Documentation,
and the /v2/pods section of the REST API Reference

Pods are implemented as a new primitive in Marathon.
The general functionality of apps plus the related endpoints are still available.

Mesos-based health checks for HTTP, HTTPS, and TCP

Health checks are an integral part of application monitoring and have been available in Marathon since version 0.7.
At the time that health checks were first added to Marathon, there was no support for health checks in Mesos.
Prior to the availability of Mesos-based health checks, health checks were only performed directly in Marathon. This has the following consequences:

  • Marathon has to share the same network as the tasks to monitor, so it can reach all launched tasks
  • Network partitions could lead to wrong scheduling decisions
  • The health state is not available via the Mesos state
  • Marathon health checks do not scale to large numbers of tasks.

Starting with Mesos 1.1, it is now possible to perform network based health checks directly on the Mesos executor level.
Marathon makes all the Mesos-based health checks available.
See the updated Health Check Documentation,
especially the new protocols: MESOS_HTTP, MESOS_HTTPS, MESOS_TCP.

We strongly recommend Mesos-based health checks over Marathon-based health checks.
Marathon-based health checks are deprecated and will be removed in a future version.

New ZK persistent storage layout

ZooKeeper has a limitation on the number of nodes it can store in a directory node.
Until version 1.3, Marathon used a flat storage layout in ZooKeeper and encountered this limitation with large installations.
The latest version of Marathon uses a nested storage layout, which significantly increases the number of nodes that can be stored.

ZooKeeper has a limitation on the size of one node (typically 1MB).
In prior versions, a group was stored with all subgroups and applications.
This could lead to a node size larger than 1 MB, which could not be stored.
The latest version of Marathon stores a group only with references in order to keep node size under 1 MB.

A migration inside Marathon automatically migrates the prior layout to the new one.

Improve Task Lost behaviour

The connection between the Mesos master and an agent can be broken for several reasons (network partition, agent update, etc).
When this happens, there is limited knowledge of the status of the agent's tasks.
Prior versions of Mesos declared such tasks as lost after a timeout and killed the tasks if the agent rejoins the cluster.

Starting with Mesos 1.1, those task are declared unreachable, not lost.
The scheduler that launched the tasks decides how to handle unreachable tasks.

Marathon uses this feature and adds an unreachableStrategy to the AppDefinition and PodDefinition, which allows you to define:

  • inactiveAfterSeconds: how long Marathon should wait to start a replacement task.
  • expungeAfterSeconds: how long Marathon should wait for a task to come back.

If a task comes back and the replacement task is already started, Marathon needs to decide which task to kill.
In order to let the user define which task should be taken, a kill selection can be defined.

Insights into the Launch Process - AKA: Why isn't my app starting?

Marathon tries to schedule tasks based on app or pod definition, which incorporates resource matching, role matching, constraint matching etc.
There are situations when Marathon cannot fulfill a launch request, since there is no matching offer from Mesos.
It was very hard for users to understand why Marathon could not fulfill launch requests.
For users that run into such situations, it was very hard to understand the reasons for this.
This version of Marathon gives insight into the launch process, analyzes all incoming offers and gives the user
statistics so it easy to see, why offers were rejected.

The statics can be fetched via the /v2/queue endpoint. See the REST API Reference.
Marathon shows the offer matching process as a funnel, so it easy to see how many offers were rejected in which step.
It gives this information for the whole launch attempt as well as the last offer cycle.

Improve Deployment logic

During Marathon master failover all deployments are started from the beginning.
This can be cumbersome if you have long-running updates and a Marathon failover.
This version of Marathon reconciles the state of a deployment after a failover.
A running deployment will be continued on the new elected leader without restarting the deployment.

Every state change operation via the REST API will now return the deployment identifier as an HTTP response header.


Deprecate Marathon-based Health Checks

Mesos now supports command-based as well as network-based health checks.
Since those health check types are now also available in Marathon, the Marathon-based health checks are now deprecated.
Do not use health checks with the following protocols: HTTP, HTTPS, and TCP. Instead, use the Mesos equivalents: MESOS_HTTP, MESOS_HTTPS and MESOS_TCP.

Deprecate Event Callback Subscriptions

Marathon has two ways to subscribe to the internal event bus:

  • HTTP callback events managed via /v2/eventSubscriptions
  • Server Send Events via /v2/events (since Marathon 0.9)

We encourage everyone to use the /v2/events SSE stream instead of HTTP Callback listeners.
The event callback subscriptions will be removed in a future version.

Forcefully stop a deployment

Deployments in Marathon can be stopped with force.
All actions currently being performed in Marathon will be stopped; the state will not change.
This can lead to an inconsistent state and is dangerous.
We will remove this functionality without replacement.

Removed deprecated command line parameter

Removed the deprecated marathon_store_timeout command line parameter. It was deprecated since v0.12 and unused.



@zen-dog zen-dog released this Dec 2, 2016 · 775 commits to master since this release

Changes from 1.1.4 to 1.1.5

Added a migration that will fix improperly structured app groups. If an app entry is in the wrong group e.g.
Group( id = /, apps = [“/foo/bar”] ) it will be moved: Group ( id = / , Group( id = /foo, apps = [”/foo/bar”] )
thus taking care of structuring the groups properly.
Note: if an app has multiple entries with different versions then the newest is kept.