-
Notifications
You must be signed in to change notification settings - Fork 843
Respect minimumHealthCapacity during leader election #7115
Conversation
e5e173d
to
5eab3bc
Compare
src/main/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActor.scala
Outdated
Show resolved
Hide resolved
fbc2fe1
to
b110b92
Compare
Thanks @timcharper for your suggestions, I've implemented them and repushed |
// in addition to a spec which passed validation, we require: | ||
require(runSpec.instances > 0, s"instances must be > 0 but is ${runSpec.instances}") | ||
require(runningInstancesCount >= 0, s"running instances count must be >=0 but is $runningInstancesCount") | ||
require(consideredHealthyInstancesCount >= 0, s"running instances count must be >=0 but is $consideredHealthyInstancesCount") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, with this change it's possible that all instances are unhealthy and this actor will crash now. We'll need to handle zero appropriately. @jeschkies can you weigh in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
From what I understand, this requirement was a safety because 0 running instances meant there was no task to replace (only task to start, which is handled by a different deployment step).
We can now have 0 healthy instance but >0 running instances.
I would naively tend to remove that assertion (following code will lead to creation of a RestartStrategy with nrToKillImmediately = 0).
Tasks from old app version will be killed later as we received healthy instance messages.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmpf, this good is the most brittle we have in all of Marathon. @kamaradclimber, could you give an example with let's say three instances to illustrate the issue you've encountered. Cause I don't quite understood the issue your trying to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
Let's assume an application with 3 instances. Instances take 15minutes to pass their healthchecks.
Application configuration has a minimumHealthCapacity of 0.6 (allowing 1 instance to be killed by marathon during deployments) and maximumOverCapacity of 0 (never go beyond 3 instances).
- T+0m: a deployment of the application is initiated. We have 3 instances: A1, B1, C1
- T+0m: marathon kills one instance (A1), start a new one (A2) which is not healthy
- T+5m: marathon leader disappear, a new marathon leader is elected
- T+5m: new marathon leader kills B1, spawns B2.
At this moment, minimumHealthCapacity is violated because state is the following: A2 (not healthy), B2 (not healthy), C1 (healthy).
Later (at T+15m), A2 will pass its healthchecks and deployment will continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. That helps a lot. So, it seems we do not reconcile for old instance but only already started instances. See TaskReplaceActor:201
and ReadinessBehaviour:195
. I actually came across this recently and was not sure which entity actually sets instance.state.healthy
. We have Mesos and Marathon health checks. I know for sure that Mesos health checks alter the instance.state.healthy
state. However, I don't know if Marathon health checks do the same. If they do we could simplify ReadinessBehaviour
and the ignition strategy. Maybe that's too much for now.
Just so that we are not altering the logic to much. What do you think about moving the health filtering to the ignition strategy? Pass the seq instead of an integer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any specific reason to move health filtering logic in this method?
Yes. A lot of the other logic depends on activeInstances
. I would rather not pull in the health checks there. If we would receive updates for health changes and have it attached to the instances it would be a different story.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to see the logic depending on activeInstances
(apart from test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you are right. I mixed it up with instancesAlreadyStarted
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we list remaining actions to do on this PR, I'm a bit lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we list remaining actions to do on this PR, I'm a bit lost.
It's all good from my side. All you need to do is to remove the unused imports I've listed below so that the build passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm building your change at jenkins-marathon-pipelines-PR-7115-4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✗ Build of #7115 failed.
Details
See the [logs](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/4//console) and [test results](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/4//testReport) for details.Error message:
Stage Compile and Test failed.
(๑′°︿°๑)
There are quite a few unused imports
That's why the build fails. |
b110b92
to
a2696f5
Compare
good catch, thanks. I've pushed a fix. I'm using |
Unfortunately we could not enable an error for unused imports unless all warnings should be errors. That's why we simply check the logs after the build https://github.com/mesosphere/marathon/blob/master/ci/pipeline#L77. This is super dirty. |
Let me know if I can fix anything else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm building your change at jenkins-marathon-pipelines-PR-7115-6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✗ Build of #7115 failed.
Details
See the [logs](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//console) and [test results](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//testReport) for details.Error message:
Stage Compile and Test failed.
(๑′°︿°๑)
Anyone to give me the error (or how to reproduce it)?
…On Thu, Jan 23, 2020, 13:09 Mesosphere CI Robot ***@***.***> wrote:
***@***.**** commented on this pull request.
*✗ Build of #7115 <#7115>
failed.*
Details See the [logs](
https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//console)
and [test results](
https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//testReport)
for details.
Error message:
Stage Compile and Test failed.
*(๑′°︿°๑)*
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7115>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAD254MJQ2CKZFGL4BTR753Q7GCIBANCNFSM4KEURUEQ>
.
|
The following test fails Here is the log file PR-7115-6.log.tar.gz |
Thanks, I'll work to reproduce this failing test on my machine and fix it. |
Before this patch, during a deployment, a leader election would lead marathon to not respect minimumHealthCapacity parameter. Reason is that TaskReplaceActor used to ignore instance healthyness when considering instance that could be killed "immediately" upon actor start. We now respect minimumHealthCapacity by taking into account healthy instances. This patch required to add an property "this app has configured healthchecks" to be able to distinguish following cases: - app has HC but we don't have signal yet for given instance - app has HC and we have signal of healthyness for given instance - app has no HC It also brings ability to fix TODO at https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267 in a future patch since we now know difference between no HC and no information about HC. Fixes: MARATHON-8716 Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58 Signed-off-by: Grégoire Seux <g.seux@criteo.com>
I think I understand why this test is failing. Scenario is the following:
Marathon is facing the following conditions for deployment of v4:
minimumHealthCapacity is 1 (default value) and maximumOverCapacity is 1. Number of instances defined in configuration is 1. It allows marathon to move between 1 and 2 instances. Our new code says we cannot kill any instance immediately (because we have 1 healthy instance and we cannot go below 1). We cannot spawn a new one either because of the maximumOverCapacity. Even if we allowed marathon to kill one instance (what use to happen before my patch), the killed one might be the only healthy instance. I'll try to find a solution to those issues. |
Previous patch made TaskReplaceActor preStart safer by preventing to kill too many tasks if we were already below minimumHealthCapacity. Sadly it lead to blocking situation on cases where an app had more running instances than its target instance count under the following condition: - minimumHealthCapacity is at default value (1), maxOverCapacity is default (1) - there are 2 instances running: 1 healthy + 1 non-healthy This situation may arise in case of buggy deployment where new instances are not healthy. Upon forced-deployment, situation described above is a realistic scenario. We now correctly destroy enough instances to get back to target instance count. A later commit will also pick instances safely (to kill unhealthy instances first). Change-Id: Ic6d8e5b55dd796644f1f8444d30914ad3db37c51
a2696f5
to
854c2a1
Compare
I've implemented a first version to correct this behavior. Feedback is welcomed (I'm running tests on my machine to see if anything breaks). EDIT: tests are not passing yet. I'll dig on that tomorrow. |
During deployment, we now prefer to kill unhealthy instances first. For the following scenario: - application targets 1 instance, has default upgrade strategy - there are 1 healthy instance and 1 unhealthy instance (new version of the app) - deployment is stucked (the new instance has unlimited retry on its health check) - user creates a deployment with force flag Previous commit allows us to kill 1 instance to put us back in "nominal situation" where we have 1 instance (and thus can spawn a new instance while respecting maximumOverCapacity of 1). With this patch, we will prefer to kill the unhealthy instance (the buggy version) instead of killing the only healthy instance (which would have set us with 0 healthy instance, below minimumHealthCapacity) Change-Id: I38203ab79f599574e3e9536a49031f6edf36d9a2
854c2a1
to
3951981
Compare
Tests are now passing on my side.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm building your change at jenkins-marathon-pipelines-PR-7115-8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔ Build of #7115 completed successfully.
Details
See details at jenkins-marathon-pipelines-PR-7115-8.
You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:
"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.9.128-3951981c1/marathon-1.9.128-3951981c1.tgz",
"sha1": "87b850b78323d79dfa59cb9701a92bcafd75175d"
You can run system integration test changes of this PR against Marathon master:
The job will report back to this PR.
\\ ٩( ᐛ )و //
@timcharper could you take another look? @kamaradclimber, the tests that failed on your machine only run on Linux. |
Thanks you both for your reviews. What should I do to get it merged? |
@kamaradclimber, nothing. I landed it. Thanks for you endurance. |
Thanks you both for your advices. My team mates are likely to submit
another patch soon for a similar but we have observed in production.
…On Thu, Jan 30, 2020, 11:51 Karsten Jeschkies ***@***.***> wrote:
@kamaradclimber <https://github.com/kamaradclimber>, nothing. I landed
it. Thanks for you endurance.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7115>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAD254L7Z6XNEYETPHMFMQTRAKWL7ANCNFSM4KEURUEQ>
.
|
This is great news! Which version of Marathon are you running? |
We are still running 1.6.x (with a few custom patches to support network bandwidth resource and a safety feature to avoid killing too many instances if all instances of an app stop passing their healthchecks). |
It seems this PR only works for applications with mesos health checks only. A simple application with only marathon health checks and maximumOverCapacity set to 0 cannot be updated anymore with this patch. To reproduce:
When taskReplaceActor starts, it looks at instanceTracker to count healthy instances. It seems that instanceTracker is not aware of instance health checks status (for marathon health checks) so no instance is considered healthy. |
I suggest to revert my PR since the fix for this bug does not seem obvious to me. I'm thinking about this but any guidance would be appreciated. (I can also open a proper ticket if you prefer) |
After discussion with my team mates here are how I understand the issue. My PR initially used instanceTracker to get list of instance (and eventually their health status). Instance tracker only knows about mesos status. There are two scenarios that we should handle. Scenario 1:
Scenario 2:
To cope with scenario 1, my suggestion is for the TaskReplaceActor to send a message to another actor (the HealthCheckActor) to ask for a list of healthy instances. Based on that information and the instancetracker (to deal with mesos healthchecks), we can decide how many instances to kill immediately. Scenario 2 introduces a additional challenge since, at leader startup, healthcheckactor has not a complete view of all instance health status. This would also allow to deal with another bug (not yet reported) that was present in marathon (before my PR) due to https://github.com/mesosphere/marathon/blob/v1.8.232/src/main/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActor.scala#L194. It this line, we kill an old instance whenever we receive event about another instance being ready/healthy.
To summarize, the TaskReplaceActor should have the following behavior
What do you think? |
I really do not like that we keep the health results in a separate actor. Ideally the instances would be update in the instance tracker with their health results. This would simplify things quite a bit. |
Summary: Before this patch, during a deployment, a leader election would lead marathon to not respect minimumHealthCapacity parameter. Reason is that TaskReplaceActor used to ignore instance healthyness when considering instance that could be killed "immediately" upon actor start. We now respect minimumHealthCapacity by taking into account healthy instances. This patch required to add an property "this app has configured healthchecks" to be able to distinguish following cases: - app has HC but we don't have signal yet for given instance - app has HC and we have signal of healthyness for given instance - app has no HC It also brings ability to fix TODO at https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267 in a future patch since we now know difference between no HC and no information about HC. JIRA issues: MARATHON-8716 Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58 Signed-off-by: Grégoire Seux <g.seux@criteo.com> (cherry picked from commit e044bd7)
Summary: Before this patch, during a deployment, a leader election would lead marathon to not respect minimumHealthCapacity parameter. Reason is that TaskReplaceActor used to ignore instance healthyness when considering instance that could be killed "immediately" upon actor start. We now respect minimumHealthCapacity by taking into account healthy instances. This patch required to add an property "this app has configured healthchecks" to be able to distinguish following cases: - app has HC but we don't have signal yet for given instance - app has HC and we have signal of healthyness for given instance - app has no HC It also brings ability to fix TODO at https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267 in a future patch since we now know difference between no HC and no information about HC. JIRA issues: MARATHON-8716 Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58 Signed-off-by: Grégoire Seux <g.seux@criteo.com> (cherry picked from commit e044bd7)
Summary: Before this patch, during a deployment, a leader election would lead marathon to not respect minimumHealthCapacity parameter. Reason is that TaskReplaceActor used to ignore instance healthyness when considering instance that could be killed "immediately" upon actor start. We now respect minimumHealthCapacity by taking into account healthy instances. This patch required to add an property "this app has configured healthchecks" to be able to distinguish following cases: - app has HC but we don't have signal yet for given instance - app has HC and we have signal of healthyness for given instance - app has no HC It also brings ability to fix TODO at https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267 in a future patch since we now know difference between no HC and no information about HC. JIRA issues: MARATHON-8716 Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58 Signed-off-by: Grégoire Seux <g.seux@criteo.com> (cherry picked from commit e044bd7)
Summary: Before this patch, during a deployment, a leader election would lead marathon to not respect minimumHealthCapacity parameter. Reason is that TaskReplaceActor used to ignore instance healthyness when considering instance that could be killed "immediately" upon actor start. We now respect minimumHealthCapacity by taking into account healthy instances. This patch required to add an property "this app has configured healthchecks" to be able to distinguish following cases: - app has HC but we don't have signal yet for given instance - app has HC and we have signal of healthyness for given instance - app has no HC It also brings ability to fix TODO at https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267 in a future patch since we now know difference between no HC and no information about HC. JIRA issues: MARATHON-8716 Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58 Signed-off-by: Grégoire Seux <g.seux@criteo.com> (cherry picked from commit e044bd7)
Before this patch, during a deployment, a leader election would lead
marathon to not respect minimumHealthCapacity parameter.
Reason is that TaskReplaceActor used to ignore instance healthyness when
considering instance that could be killed "immediately" upon actor
start.
We now respect minimumHealthCapacity by taking into account healthy
instances.
This patch required to add an property "this app has configured
healthchecks" to be able to distinguish following cases:
It also brings ability to fix TODO at
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267
in a future patch since we now know difference between no HC and no
information about HC.
Fixes: MARATHON-8716
Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58
Signed-off-by: Grégoire Seux g.seux@criteo.com