Respect minimumHealthCapacity during leader election #7115

kamaradclimber · 2020-01-09T08:43:09Z

Before this patch, during a deployment, a leader election would lead
marathon to not respect minimumHealthCapacity parameter.
Reason is that TaskReplaceActor used to ignore instance healthyness when
considering instance that could be killed "immediately" upon actor
start.

We now respect minimumHealthCapacity by taking into account healthy
instances.

This patch required to add an property "this app has configured
healthchecks" to be able to distinguish following cases:

app has HC but we don't have signal yet for given instance
app has HC and we have signal of healthyness for given instance
app has no HC

It also brings ability to fix TODO at
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267
in a future patch since we now know difference between no HC and no
information about HC.

Fixes: MARATHON-8716
Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58
Signed-off-by: Grégoire Seux g.seux@criteo.com

src/main/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActor.scala

src/main/scala/mesosphere/marathon/core/instance/Instance.scala

src/main/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActor.scala

kamaradclimber · 2020-01-14T14:00:06Z

Thanks @timcharper for your suggestions, I've implemented them and repushed

timcharper · 2020-01-15T01:05:10Z

src/main/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActor.scala

    // in addition to a spec which passed validation, we require:
    require(runSpec.instances > 0, s"instances must be > 0 but is ${runSpec.instances}")
-    require(runningInstancesCount >= 0, s"running instances count must be >=0 but is $runningInstancesCount")
+    require(consideredHealthyInstancesCount >= 0, s"running instances count must be >=0 but is $consideredHealthyInstancesCount")


Hmm, with this change it's possible that all instances are unhealthy and this actor will crash now. We'll need to handle zero appropriately. @jeschkies can you weigh in?

You're right.
From what I understand, this requirement was a safety because 0 running instances meant there was no task to replace (only task to start, which is handled by a different deployment step).

We can now have 0 healthy instance but >0 running instances.

I would naively tend to remove that assertion (following code will lead to creation of a RestartStrategy with nrToKillImmediately = 0).
Tasks from old app version will be killed later as we received healthy instance messages.

What do you think?

Hmpf, this good is the most brittle we have in all of Marathon. @kamaradclimber, could you give an example with let's say three instances to illustrate the issue you've encountered. Cause I don't quite understood the issue your trying to fix.

Sure!

Let's assume an application with 3 instances. Instances take 15minutes to pass their healthchecks.
Application configuration has a minimumHealthCapacity of 0.6 (allowing 1 instance to be killed by marathon during deployments) and maximumOverCapacity of 0 (never go beyond 3 instances).

T+0m: a deployment of the application is initiated. We have 3 instances: A1, B1, C1

T+0m: marathon kills one instance (A1), start a new one (A2) which is not healthy

T+5m: marathon leader disappear, a new marathon leader is elected

T+5m: new marathon leader kills B1, spawns B2.

At this moment, minimumHealthCapacity is violated because state is the following: A2 (not healthy), B2 (not healthy), C1 (healthy).

Later (at T+15m), A2 will pass its healthchecks and deployment will continue.

Thanks. That helps a lot. So, it seems we do not reconcile for old instance but only already started instances. See TaskReplaceActor:201 and ReadinessBehaviour:195. I actually came across this recently and was not sure which entity actually sets instance.state.healthy. We have Mesos and Marathon health checks. I know for sure that Mesos health checks alter the instance.state.healthy state. However, I don't know if Marathon health checks do the same. If they do we could simplify ReadinessBehaviour and the ignition strategy. Maybe that's too much for now.

Just so that we are not altering the logic to much. What do you think about moving the health filtering to the ignition strategy? Pass the seq instead of an integer.

Any specific reason to move health filtering logic in this method?

Yes. A lot of the other logic depends on activeInstances. I would rather not pull in the health checks there. If we would receive updates for health changes and have it attached to the instances it would be a different story.

I'm not sure to see the logic depending on activeInstances (apart from test).

Oh, you are right. I mixed it up with instancesAlreadyStarted.

Could we list remaining actions to do on this PR, I'm a bit lost.

Could we list remaining actions to do on this PR, I'm a bit lost.

It's all good from my side. All you need to do is to remove the unused imports I've listed below so that the build passes.

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-7115-4.

mesosphere-ci

✗ Build of #7115 failed.

Details

See the [logs](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/4//console) and [test results](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/4//testReport) for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

jeschkies · 2020-01-21T12:05:16Z

There are quite a few unused imports

[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:20:43: Unused import
[warn] import mesosphere.marathon.core.task.Task.Status
[warn]                                           ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:21:44: Unused import
[warn] import mesosphere.marathon.core.task.state.NetworkInfo
[warn]                                            ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:28:33: Unused import
[warn] import org.apache.mesos.Protos.{CheckStatusInfo, TaskID, TaskState, TaskStatus}
[warn]                                 ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:28:50: Unused import
[warn] import org.apache.mesos.Protos.{CheckStatusInfo, TaskID, TaskState, TaskStatus}
[warn]                                                  ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:28:58: Unused import
[warn] import org.apache.mesos.Protos.{CheckStatusInfo, TaskID, TaskState, TaskStatus}
[warn]                                                          ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:28:69: Unused import
[warn] import org.apache.mesos.Protos.{CheckStatusInfo, TaskID, TaskState, TaskStatus}
[warn]                                                                     ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:29:55: Unused import
[warn] import org.scalatest.concurrent.PatienceConfiguration.Timeout
[warn]

That's why the build fails.

kamaradclimber · 2020-01-22T13:23:44Z

There are quite a few unused imports

good catch, thanks. I've pushed a fix. I'm using sbt universal:packageZipTarball to build, what was the command to see those warnings? (not familiar at all with sbt)

jeschkies · 2020-01-22T14:05:00Z

good catch, thanks. I've pushed a fix. I'm using sbt universal:packageZipTarball to build, what was the command to see those warnings? (not familiar at all with sbt)

Unfortunately we could not enable an error for unused imports unless all warnings should be errors. That's why we simply check the logs after the build https://github.com/mesosphere/marathon/blob/master/ci/pipeline#L77. This is super dirty.

kamaradclimber · 2020-01-23T10:14:04Z

Let me know if I can fix anything else

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-7115-6.

mesosphere-ci

✗ Build of #7115 failed.

Details

See the [logs](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//console) and [test results](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//testReport) for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

kamaradclimber · 2020-01-23T18:01:22Z

Anyone to give me the error (or how to reproduce it)?

…

On Thu, Jan 23, 2020, 13:09 Mesosphere CI Robot ***@***.***> wrote: ***@***.**** commented on this pull request. *✗ Build of #7115 <#7115> failed.* Details See the [logs]( https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//console) and [test results]( https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//testReport) for details. Error message: Stage Compile and Test failed. *(๑′°︿°๑)* — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7115>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD254MJQ2CKZFGL4BTR753Q7GCIBANCNFSM4KEURUEQ> .

jeschkies · 2020-01-24T12:29:19Z

The following test fails
mesosphere.marathon.integration.GroupDeployIntegrationTest.GroupDeployment should An upgrade in progress cannot be interrupted without force

Here is the log file PR-7115-6.log.tar.gz

kamaradclimber · 2020-01-27T16:45:13Z

Thanks, I'll work to reproduce this failing test on my machine and fix it.

Before this patch, during a deployment, a leader election would lead marathon to not respect minimumHealthCapacity parameter. Reason is that TaskReplaceActor used to ignore instance healthyness when considering instance that could be killed "immediately" upon actor start. We now respect minimumHealthCapacity by taking into account healthy instances. This patch required to add an property "this app has configured healthchecks" to be able to distinguish following cases: - app has HC but we don't have signal yet for given instance - app has HC and we have signal of healthyness for given instance - app has no HC It also brings ability to fix TODO at https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267 in a future patch since we now know difference between no HC and no information about HC. Fixes: MARATHON-8716 Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58 Signed-off-by: Grégoire Seux <g.seux@criteo.com>

kamaradclimber · 2020-01-28T17:35:55Z

I think I understand why this test is failing.
Setup is pretty simple: app has 1 instance, 1 health check

Scenario is the following:

v1 of the app is deployed and passes its health check
a buggy v2 is deployed (not passing its health check) => deployment never finishes
a v3 is deployed without force => triggers a conflict
a v4 is deployed with force (and without bug)

Marathon is facing the following conditions for deployment of v4:

1 healthy task in v1
1 non healthy task in v2

minimumHealthCapacity is 1 (default value) and maximumOverCapacity is 1. Number of instances defined in configuration is 1. It allows marathon to move between 1 and 2 instances.

Our new code says we cannot kill any instance immediately (because we have 1 healthy instance and we cannot go below 1). We cannot spawn a new one either because of the maximumOverCapacity.

Even if we allowed marathon to kill one instance (what use to happen before my patch), the killed one might be the only healthy instance.
I think it should systematically prefer killing unhealthy instances (of the old version).

I'll try to find a solution to those issues.

Previous patch made TaskReplaceActor preStart safer by preventing to kill too many tasks if we were already below minimumHealthCapacity. Sadly it lead to blocking situation on cases where an app had more running instances than its target instance count under the following condition: - minimumHealthCapacity is at default value (1), maxOverCapacity is default (1) - there are 2 instances running: 1 healthy + 1 non-healthy This situation may arise in case of buggy deployment where new instances are not healthy. Upon forced-deployment, situation described above is a realistic scenario. We now correctly destroy enough instances to get back to target instance count. A later commit will also pick instances safely (to kill unhealthy instances first). Change-Id: Ic6d8e5b55dd796644f1f8444d30914ad3db37c51

kamaradclimber · 2020-01-28T18:52:56Z

I've implemented a first version to correct this behavior. Feedback is welcomed (I'm running tests on my machine to see if anything breaks).

EDIT: tests are not passing yet. I'll dig on that tomorrow.

During deployment, we now prefer to kill unhealthy instances first. For the following scenario: - application targets 1 instance, has default upgrade strategy - there are 1 healthy instance and 1 unhealthy instance (new version of the app) - deployment is stucked (the new instance has unlimited retry on its health check) - user creates a deployment with force flag Previous commit allows us to kill 1 instance to put us back in "nominal situation" where we have 1 instance (and thus can spawn a new instance while respecting maximumOverCapacity of 1). With this patch, we will prefer to kill the unhealthy instance (the buggy version) instead of killing the only healthy instance (which would have set us with 0 healthy instance, below minimumHealthCapacity) Change-Id: I38203ab79f599574e3e9536a49031f6edf36d9a2

kamaradclimber · 2020-01-29T08:02:32Z

Tests are now passing on my side.
To be precise I have exactly the same tests failing on my machine before and after my patches:

[error] Error: Total 148, Failed 3, Errors 5, Passed 140
[error] Failed tests:
[error] 	mesosphere.marathon.integration.AppDeployIntegrationTest
[error] 	mesosphere.marathon.integration.ResidentTaskIntegrationTest
[error] 	mesosphere.marathon.integration.GpuSchedulingIntegrationTest
[error] Error during tests:
[error] 	mesosphere.marathon.integration.SharedMemoryIntegrationTest
[error] 	mesosphere.marathon.integration.SeccompIntegrationTest
[error] 	mesosphere.marathon.integration.DockerAppIntegrationTest
[error] 	mesosphere.marathon.integration.UpgradeIntegrationTest
[error] 	mesosphere.marathon.integration.MesosAppIntegrationTest

mesosphere-ci

I'm building your change at jenkins-marathon-pipelines-PR-7115-8.

mesosphere-ci

✔ Build of #7115 completed successfully.

Details

See details at jenkins-marathon-pipelines-PR-7115-8.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.9.128-3951981c1/marathon-1.9.128-3951981c1.tgz",
"sha1": "87b850b78323d79dfa59cb9701a92bcafd75175d"

You can run system integration test changes of this PR against Marathon master:

The job will report back to this PR.

＼\ ٩( ᐛ )و /／

jeschkies · 2020-01-29T11:09:05Z

@timcharper could you take another look? @kamaradclimber, the tests that failed on your machine only run on Linux.

addressed

kamaradclimber · 2020-01-30T09:13:13Z

Thanks you both for your reviews. What should I do to get it merged?

jeschkies · 2020-01-30T10:51:42Z

@kamaradclimber, nothing. I landed it. Thanks for you endurance.

kamaradclimber · 2020-01-31T06:54:47Z

Thanks you both for your advices. My team mates are likely to submit another patch soon for a similar but we have observed in production.

…

On Thu, Jan 30, 2020, 11:51 Karsten Jeschkies ***@***.***> wrote: @kamaradclimber <https://github.com/kamaradclimber>, nothing. I landed it. Thanks for you endurance. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7115>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD254L7Z6XNEYETPHMFMQTRAKWL7ANCNFSM4KEURUEQ> .

jeschkies · 2020-01-31T08:00:17Z

My team mates are likely to submit another patch soon for a similar but we have observed in production.

This is great news! Which version of Marathon are you running?

kamaradclimber · 2020-01-31T08:08:01Z

We are still running 1.6.x (with a few custom patches to support network bandwidth resource and a safety feature to avoid killing too many instances if all instances of an app stop passing their healthchecks).
However the bug I'm referring to (we should open a ticket in jira to describe it) affects 1.8 & master as well.

kamaradclimber · 2020-02-21T15:56:35Z

It seems this PR only works for applications with mesos health checks only.

A simple application with only marathon health checks and maximumOverCapacity set to 0 cannot be updated anymore with this patch.

To reproduce:

app has a simple healthcheck, 10 instances
maximumOverCapacity 0
minimumHealthCapacity is set to any value > 0 (0.5 for instance)

When taskReplaceActor starts, it looks at instanceTracker to count healthy instances. It seems that instanceTracker is not aware of instance health checks status (for marathon health checks) so no instance is considered healthy.
So taskReplaceActor does not kill any task "immediately" and is blocked forever.

kamaradclimber · 2020-02-21T15:57:51Z

I suggest to revert my PR since the fix for this bug does not seem obvious to me. I'm thinking about this but any guidance would be appreciated.

(I can also open a proper ticket if you prefer)

kamaradclimber · 2020-02-21T16:58:45Z

After discussion with my team mates here are how I understand the issue.

My PR initially used instanceTracker to get list of instance (and eventually their health status). Instance tracker only knows about mesos status.
Only way to know health status of instances (with marathon healthchecks) is to be an actor and subscribe to health events.

There are two scenarios that we should handle.
We consider in both case an app with maximumOverCapacity at 0 and 1 marathon health check.

Scenario 1:

marathon leader has been running for a while
app has been running for a while, all instances have a known health status
a new version of app is being deployed

Scenario 2:

a new version of app is being deployed
marathon leader just crashed and a new one is taking over

To cope with scenario 1, my suggestion is for the TaskReplaceActor to send a message to another actor (the HealthCheckActor) to ask for a list of healthy instances. Based on that information and the instancetracker (to deal with mesos healthchecks), we can decide how many instances to kill immediately.

Scenario 2 introduces a additional challenge since, at leader startup, healthcheckactor has not a complete view of all instance health status.
Proposition for scenario1 would fall short since we would have a decision not to kill any instance.
To cope with this, I propose the TaskReplaceActor to frequently ask HealthCheckActor healthy instances count and launch instance killing based on that information.

This would also allow to deal with another bug (not yet reported) that was present in marathon (before my PR) due to https://github.com/mesosphere/marathon/blob/v1.8.232/src/main/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActor.scala#L194. It this line, we kill an old instance whenever we receive event about another instance being ready/healthy.
This can be dangerous (put the app below minimumHealthCapacity) for two reasons:

another task might have died for any other reason (and its eventual replacement not being started/healthy yet)
we can receive the event of a given task being healthy several time (that's the condition we've observed on production)

To summarize, the TaskReplaceActor should have the following behavior

at startup: do nothing
frequently (at frequent interval and upon task healthy event): send a message to HealthCheckActor to get list of healthy instance, based on that information kill some tasks if it is safe.

What do you think?

jeschkies · 2020-02-24T16:19:56Z

To cope with scenario 1, my suggestion is for the TaskReplaceActor to send a message to another actor (the HealthCheckActor) to ask for a list of healthy instances.

I really do not like that we keep the health results in a separate actor. Ideally the instances would be update in the instance tracker with their health results. This would simplify things quite a bit.

Summary: Before this patch, during a deployment, a leader election would lead marathon to not respect minimumHealthCapacity parameter. Reason is that TaskReplaceActor used to ignore instance healthyness when considering instance that could be killed "immediately" upon actor start. We now respect minimumHealthCapacity by taking into account healthy instances. This patch required to add an property "this app has configured healthchecks" to be able to distinguish following cases: - app has HC but we don't have signal yet for given instance - app has HC and we have signal of healthyness for given instance - app has no HC It also brings ability to fix TODO at https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267 in a future patch since we now know difference between no HC and no information about HC. JIRA issues: MARATHON-8716 Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58 Signed-off-by: Grégoire Seux <g.seux@criteo.com> (cherry picked from commit e044bd7)

kamaradclimber force-pushed the marathon-8716 branch from e5e173d to 5eab3bc Compare January 9, 2020 08:48

timcharper previously requested changes Jan 14, 2020

View reviewed changes

kamaradclimber force-pushed the marathon-8716 branch 2 times, most recently from fbc2fe1 to b110b92 Compare January 14, 2020 13:59

timcharper reviewed Jan 15, 2020

View reviewed changes

mesosphere-ci reviewed Jan 15, 2020

View reviewed changes

jeschkies approved these changes Jan 21, 2020

View reviewed changes

kamaradclimber force-pushed the marathon-8716 branch from b110b92 to a2696f5 Compare January 22, 2020 13:22

mesosphere-ci reviewed Jan 23, 2020

View reviewed changes

kamaradclimber force-pushed the marathon-8716 branch from a2696f5 to 854c2a1 Compare January 28, 2020 18:51

kamaradclimber force-pushed the marathon-8716 branch from 854c2a1 to 3951981 Compare January 29, 2020 08:01

mesosphere-ci reviewed Jan 29, 2020

View reviewed changes

jeschkies merged commit e044bd7 into d2iq-archive:master Jan 30, 2020

Respect minimumHealthCapacity during leader election #7115

Respect minimumHealthCapacity during leader election #7115

Conversation

kamaradclimber commented Jan 9, 2020

kamaradclimber commented Jan 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeschkies Jan 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

jeschkies commented Jan 21, 2020

kamaradclimber commented Jan 22, 2020

jeschkies commented Jan 22, 2020

kamaradclimber commented Jan 23, 2020

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

kamaradclimber commented Jan 23, 2020 via email

jeschkies commented Jan 24, 2020

kamaradclimber commented Jan 27, 2020

kamaradclimber commented Jan 28, 2020

kamaradclimber commented Jan 28, 2020 • edited Loading

kamaradclimber commented Jan 29, 2020

mesosphere-ci left a comment

Choose a reason for hiding this comment

mesosphere-ci left a comment

Choose a reason for hiding this comment

jeschkies commented Jan 29, 2020

kamaradclimber commented Jan 30, 2020

jeschkies commented Jan 30, 2020

kamaradclimber commented Jan 31, 2020 via email

jeschkies commented Jan 31, 2020

kamaradclimber commented Jan 31, 2020 • edited Loading

kamaradclimber commented Feb 21, 2020

kamaradclimber commented Feb 21, 2020

kamaradclimber commented Feb 21, 2020

jeschkies commented Feb 24, 2020

jeschkies Jan 15, 2020 •

edited

Loading

kamaradclimber commented Jan 28, 2020 •

edited

Loading

kamaradclimber commented Jan 31, 2020 •

edited

Loading