Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

Respect minimumHealthCapacity during leader election #7115

Merged
merged 3 commits into from
Jan 30, 2020

Conversation

kamaradclimber
Copy link
Contributor

Before this patch, during a deployment, a leader election would lead
marathon to not respect minimumHealthCapacity parameter.
Reason is that TaskReplaceActor used to ignore instance healthyness when
considering instance that could be killed "immediately" upon actor
start.

We now respect minimumHealthCapacity by taking into account healthy
instances.

This patch required to add an property "this app has configured
healthchecks" to be able to distinguish following cases:

  • app has HC but we don't have signal yet for given instance
  • app has HC and we have signal of healthyness for given instance
  • app has no HC

It also brings ability to fix TODO at
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267
in a future patch since we now know difference between no HC and no
information about HC.

Fixes: MARATHON-8716
Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58
Signed-off-by: Grégoire Seux g.seux@criteo.com

@kamaradclimber kamaradclimber force-pushed the marathon-8716 branch 2 times, most recently from fbc2fe1 to b110b92 Compare January 14, 2020 13:59
@kamaradclimber
Copy link
Contributor Author

Thanks @timcharper for your suggestions, I've implemented them and repushed

// in addition to a spec which passed validation, we require:
require(runSpec.instances > 0, s"instances must be > 0 but is ${runSpec.instances}")
require(runningInstancesCount >= 0, s"running instances count must be >=0 but is $runningInstancesCount")
require(consideredHealthyInstancesCount >= 0, s"running instances count must be >=0 but is $consideredHealthyInstancesCount")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, with this change it's possible that all instances are unhealthy and this actor will crash now. We'll need to handle zero appropriately. @jeschkies can you weigh in?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.
From what I understand, this requirement was a safety because 0 running instances meant there was no task to replace (only task to start, which is handled by a different deployment step).

We can now have 0 healthy instance but >0 running instances.

I would naively tend to remove that assertion (following code will lead to creation of a RestartStrategy with nrToKillImmediately = 0).
Tasks from old app version will be killed later as we received healthy instance messages.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmpf, this good is the most brittle we have in all of Marathon. @kamaradclimber, could you give an example with let's say three instances to illustrate the issue you've encountered. Cause I don't quite understood the issue your trying to fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Let's assume an application with 3 instances. Instances take 15minutes to pass their healthchecks.
Application configuration has a minimumHealthCapacity of 0.6 (allowing 1 instance to be killed by marathon during deployments) and maximumOverCapacity of 0 (never go beyond 3 instances).

  • T+0m: a deployment of the application is initiated. We have 3 instances: A1, B1, C1
  • T+0m: marathon kills one instance (A1), start a new one (A2) which is not healthy
  • T+5m: marathon leader disappear, a new marathon leader is elected
  • T+5m: new marathon leader kills B1, spawns B2.

At this moment, minimumHealthCapacity is violated because state is the following: A2 (not healthy), B2 (not healthy), C1 (healthy).

Later (at T+15m), A2 will pass its healthchecks and deployment will continue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. That helps a lot. So, it seems we do not reconcile for old instance but only already started instances. See TaskReplaceActor:201 and ReadinessBehaviour:195. I actually came across this recently and was not sure which entity actually sets instance.state.healthy. We have Mesos and Marathon health checks. I know for sure that Mesos health checks alter the instance.state.healthy state. However, I don't know if Marathon health checks do the same. If they do we could simplify ReadinessBehaviour and the ignition strategy. Maybe that's too much for now.

Just so that we are not altering the logic to much. What do you think about moving the health filtering to the ignition strategy? Pass the seq instead of an integer.

Copy link
Contributor

@jeschkies jeschkies Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason to move health filtering logic in this method?

Yes. A lot of the other logic depends on activeInstances. I would rather not pull in the health checks there. If we would receive updates for health changes and have it attached to the instances it would be a different story.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to see the logic depending on activeInstances (apart from test).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you are right. I mixed it up with instancesAlreadyStarted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we list remaining actions to do on this PR, I'm a bit lost.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we list remaining actions to do on this PR, I'm a bit lost.

It's all good from my side. All you need to do is to remove the unused imports I've listed below so that the build passes.

Copy link

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm building your change at jenkins-marathon-pipelines-PR-7115-4.

Copy link

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✗ Build of #7115 failed.

Details See the [logs](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/4//console) and [test results](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/4//testReport) for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

@jeschkies
Copy link
Contributor

There are quite a few unused imports

[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:20:43: Unused import
[warn] import mesosphere.marathon.core.task.Task.Status
[warn]                                           ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:21:44: Unused import
[warn] import mesosphere.marathon.core.task.state.NetworkInfo
[warn]                                            ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:28:33: Unused import
[warn] import org.apache.mesos.Protos.{CheckStatusInfo, TaskID, TaskState, TaskStatus}
[warn]                                 ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:28:50: Unused import
[warn] import org.apache.mesos.Protos.{CheckStatusInfo, TaskID, TaskState, TaskStatus}
[warn]                                                  ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:28:58: Unused import
[warn] import org.apache.mesos.Protos.{CheckStatusInfo, TaskID, TaskState, TaskStatus}
[warn]                                                          ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:28:69: Unused import
[warn] import org.apache.mesos.Protos.{CheckStatusInfo, TaskID, TaskState, TaskStatus}
[warn]                                                                     ^
[warn] /home/admin/workspace/marathon-pipelines_PR-7115/src/test/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActorTest.scala:29:55: Unused import
[warn] import org.scalatest.concurrent.PatienceConfiguration.Timeout
[warn]     

That's why the build fails.

@kamaradclimber
Copy link
Contributor Author

There are quite a few unused imports

good catch, thanks. I've pushed a fix. I'm using sbt universal:packageZipTarball to build, what was the command to see those warnings? (not familiar at all with sbt)

@jeschkies
Copy link
Contributor

good catch, thanks. I've pushed a fix. I'm using sbt universal:packageZipTarball to build, what was the command to see those warnings? (not familiar at all with sbt)

Unfortunately we could not enable an error for unused imports unless all warnings should be errors. That's why we simply check the logs after the build https://github.com/mesosphere/marathon/blob/master/ci/pipeline#L77. This is super dirty.

@kamaradclimber
Copy link
Contributor Author

Let me know if I can fix anything else

Copy link

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm building your change at jenkins-marathon-pipelines-PR-7115-6.

Copy link

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✗ Build of #7115 failed.

Details See the [logs](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//console) and [test results](https://jenkins.mesosphere.com/service/jenkins/job/marathon-pipelines/job/PR-7115/6//testReport) for details.

Error message:

Stage Compile and Test failed.

(๑′°︿°๑)

@kamaradclimber
Copy link
Contributor Author

kamaradclimber commented Jan 23, 2020 via email

@jeschkies
Copy link
Contributor

The following test fails
mesosphere.marathon.integration.GroupDeployIntegrationTest.GroupDeployment should An upgrade in progress cannot be interrupted without force

Here is the log file PR-7115-6.log.tar.gz

@kamaradclimber
Copy link
Contributor Author

Thanks, I'll work to reproduce this failing test on my machine and fix it.

Before this patch, during a deployment, a leader election would lead
marathon to not respect minimumHealthCapacity parameter.
Reason is that TaskReplaceActor used to ignore instance healthyness when
considering instance that could be killed "immediately" upon actor
start.

We now respect minimumHealthCapacity by taking into account healthy
instances.

This patch required to add an property "this app has configured
healthchecks" to be able to distinguish following cases:
- app has HC but we don't have signal yet for given instance
- app has HC and we have signal of healthyness for given instance
- app has no HC

It also brings ability to fix TODO at
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267
in a future patch since we now know difference between no HC and no
information about HC.

Fixes: MARATHON-8716
Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58
Signed-off-by: Grégoire Seux <g.seux@criteo.com>
@kamaradclimber
Copy link
Contributor Author

I think I understand why this test is failing.
Setup is pretty simple: app has 1 instance, 1 health check

Scenario is the following:

  • v1 of the app is deployed and passes its health check
  • a buggy v2 is deployed (not passing its health check) => deployment never finishes
  • a v3 is deployed without force => triggers a conflict
  • a v4 is deployed with force (and without bug)

Marathon is facing the following conditions for deployment of v4:

  • 1 healthy task in v1
  • 1 non healthy task in v2

minimumHealthCapacity is 1 (default value) and maximumOverCapacity is 1. Number of instances defined in configuration is 1. It allows marathon to move between 1 and 2 instances.

Our new code says we cannot kill any instance immediately (because we have 1 healthy instance and we cannot go below 1). We cannot spawn a new one either because of the maximumOverCapacity.

Even if we allowed marathon to kill one instance (what use to happen before my patch), the killed one might be the only healthy instance.
I think it should systematically prefer killing unhealthy instances (of the old version).

I'll try to find a solution to those issues.

Previous patch made TaskReplaceActor preStart safer by preventing to
kill too many tasks if we were already below minimumHealthCapacity.

Sadly it lead to blocking situation on cases where an app had more
running instances than its target instance count under the following
condition:
- minimumHealthCapacity is at default value (1), maxOverCapacity is
default (1)
- there are 2 instances running: 1 healthy + 1 non-healthy

This situation may arise in case of buggy deployment where new
instances are not healthy. Upon forced-deployment, situation described
above is a realistic scenario.

We now correctly destroy enough instances to get back to target instance
count. A later commit will also pick instances safely (to kill unhealthy
instances first).

Change-Id: Ic6d8e5b55dd796644f1f8444d30914ad3db37c51
@kamaradclimber
Copy link
Contributor Author

kamaradclimber commented Jan 28, 2020

I've implemented a first version to correct this behavior. Feedback is welcomed (I'm running tests on my machine to see if anything breaks).

EDIT: tests are not passing yet. I'll dig on that tomorrow.

During deployment, we now prefer to kill unhealthy instances first.

For the following scenario:
- application targets 1 instance, has default upgrade strategy
- there are 1 healthy instance and 1 unhealthy instance (new version of
the app)
- deployment is stucked (the new instance has unlimited retry on its
health check)
- user creates a deployment with force flag

Previous commit allows us to kill 1 instance to put us back in "nominal
situation" where we have 1 instance (and thus can spawn a new instance
while respecting maximumOverCapacity of 1).

With this patch, we will prefer to kill the unhealthy instance (the
buggy version) instead of killing the only healthy instance (which would
have set us with 0 healthy instance, below minimumHealthCapacity)

Change-Id: I38203ab79f599574e3e9536a49031f6edf36d9a2
@kamaradclimber
Copy link
Contributor Author

Tests are now passing on my side.
To be precise I have exactly the same tests failing on my machine before and after my patches:

[error] Error: Total 148, Failed 3, Errors 5, Passed 140
[error] Failed tests:
[error] 	mesosphere.marathon.integration.AppDeployIntegrationTest
[error] 	mesosphere.marathon.integration.ResidentTaskIntegrationTest
[error] 	mesosphere.marathon.integration.GpuSchedulingIntegrationTest
[error] Error during tests:
[error] 	mesosphere.marathon.integration.SharedMemoryIntegrationTest
[error] 	mesosphere.marathon.integration.SeccompIntegrationTest
[error] 	mesosphere.marathon.integration.DockerAppIntegrationTest
[error] 	mesosphere.marathon.integration.UpgradeIntegrationTest
[error] 	mesosphere.marathon.integration.MesosAppIntegrationTest

Copy link

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm building your change at jenkins-marathon-pipelines-PR-7115-8.

Copy link

@mesosphere-ci mesosphere-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔ Build of #7115 completed successfully.

Details

See details at jenkins-marathon-pipelines-PR-7115-8.

You can create a DC/OS with your patched Marathon by creating a new pull
request with the following changes in buildinfo.json:

"url": "https://s3.amazonaws.com/downloads.mesosphere.io/marathon/builds/1.9.128-3951981c1/marathon-1.9.128-3951981c1.tgz",
"sha1": "87b850b78323d79dfa59cb9701a92bcafd75175d"

You can run system integration test changes of this PR against Marathon master:

The job will report back to this PR.

\\ ٩( ᐛ )و //

@jeschkies
Copy link
Contributor

@timcharper could you take another look? @kamaradclimber, the tests that failed on your machine only run on Linux.

@kamaradclimber
Copy link
Contributor Author

Thanks you both for your reviews. What should I do to get it merged?

@jeschkies jeschkies merged commit e044bd7 into d2iq-archive:master Jan 30, 2020
@jeschkies
Copy link
Contributor

@kamaradclimber, nothing. I landed it. Thanks for you endurance.

@kamaradclimber
Copy link
Contributor Author

kamaradclimber commented Jan 31, 2020 via email

@jeschkies
Copy link
Contributor

My team mates are likely to submit another patch soon for a similar but we have observed in production.

This is great news! Which version of Marathon are you running?

@kamaradclimber
Copy link
Contributor Author

kamaradclimber commented Jan 31, 2020

We are still running 1.6.x (with a few custom patches to support network bandwidth resource and a safety feature to avoid killing too many instances if all instances of an app stop passing their healthchecks).
However the bug I'm referring to (we should open a ticket in jira to describe it) affects 1.8 & master as well.

@kamaradclimber
Copy link
Contributor Author

It seems this PR only works for applications with mesos health checks only.

A simple application with only marathon health checks and maximumOverCapacity set to 0 cannot be updated anymore with this patch.

To reproduce:

  • app has a simple healthcheck, 10 instances
  • maximumOverCapacity 0
  • minimumHealthCapacity is set to any value > 0 (0.5 for instance)

When taskReplaceActor starts, it looks at instanceTracker to count healthy instances. It seems that instanceTracker is not aware of instance health checks status (for marathon health checks) so no instance is considered healthy.
So taskReplaceActor does not kill any task "immediately" and is blocked forever.

@kamaradclimber
Copy link
Contributor Author

I suggest to revert my PR since the fix for this bug does not seem obvious to me. I'm thinking about this but any guidance would be appreciated.

(I can also open a proper ticket if you prefer)

@kamaradclimber
Copy link
Contributor Author

After discussion with my team mates here are how I understand the issue.

My PR initially used instanceTracker to get list of instance (and eventually their health status). Instance tracker only knows about mesos status.
Only way to know health status of instances (with marathon healthchecks) is to be an actor and subscribe to health events.

There are two scenarios that we should handle.
We consider in both case an app with maximumOverCapacity at 0 and 1 marathon health check.

Scenario 1:

  • marathon leader has been running for a while
  • app has been running for a while, all instances have a known health status
  • a new version of app is being deployed

Scenario 2:

  • a new version of app is being deployed
  • marathon leader just crashed and a new one is taking over

To cope with scenario 1, my suggestion is for the TaskReplaceActor to send a message to another actor (the HealthCheckActor) to ask for a list of healthy instances. Based on that information and the instancetracker (to deal with mesos healthchecks), we can decide how many instances to kill immediately.

Scenario 2 introduces a additional challenge since, at leader startup, healthcheckactor has not a complete view of all instance health status.
Proposition for scenario1 would fall short since we would have a decision not to kill any instance.
To cope with this, I propose the TaskReplaceActor to frequently ask HealthCheckActor healthy instances count and launch instance killing based on that information.

This would also allow to deal with another bug (not yet reported) that was present in marathon (before my PR) due to https://github.com/mesosphere/marathon/blob/v1.8.232/src/main/scala/mesosphere/marathon/core/deployment/impl/TaskReplaceActor.scala#L194. It this line, we kill an old instance whenever we receive event about another instance being ready/healthy.
This can be dangerous (put the app below minimumHealthCapacity) for two reasons:

  • another task might have died for any other reason (and its eventual replacement not being started/healthy yet)
  • we can receive the event of a given task being healthy several time (that's the condition we've observed on production)

To summarize, the TaskReplaceActor should have the following behavior

  • at startup: do nothing
  • frequently (at frequent interval and upon task healthy event): send a message to HealthCheckActor to get list of healthy instance, based on that information kill some tasks if it is safe.

What do you think?

@jeschkies
Copy link
Contributor

To cope with scenario 1, my suggestion is for the TaskReplaceActor to send a message to another actor (the HealthCheckActor) to ask for a list of healthy instances.

I really do not like that we keep the health results in a separate actor. Ideally the instances would be update in the instance tracker with their health results. This would simplify things quite a bit.

Lqp1 pushed a commit to criteo-forks/marathon that referenced this pull request Jun 30, 2020
Summary:
Before this patch, during a deployment, a leader election would lead
marathon to not respect minimumHealthCapacity parameter.
Reason is that TaskReplaceActor used to ignore instance healthyness when
considering instance that could be killed "immediately" upon actor
start.

We now respect minimumHealthCapacity by taking into account healthy
instances.

This patch required to add an property "this app has configured
healthchecks" to be able to distinguish following cases:
- app has HC but we don't have signal yet for given instance
- app has HC and we have signal of healthyness for given instance
- app has no HC

It also brings ability to fix TODO at
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267
in a future patch since we now know difference between no HC and no
information about HC.

JIRA issues: MARATHON-8716
Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58
Signed-off-by: Grégoire Seux <g.seux@criteo.com>
(cherry picked from commit e044bd7)
komuta pushed a commit to criteo-forks/marathon that referenced this pull request Aug 31, 2020
Summary:
Before this patch, during a deployment, a leader election would lead
marathon to not respect minimumHealthCapacity parameter.
Reason is that TaskReplaceActor used to ignore instance healthyness when
considering instance that could be killed "immediately" upon actor
start.

We now respect minimumHealthCapacity by taking into account healthy
instances.

This patch required to add an property "this app has configured
healthchecks" to be able to distinguish following cases:
- app has HC but we don't have signal yet for given instance
- app has HC and we have signal of healthyness for given instance
- app has no HC

It also brings ability to fix TODO at
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267
in a future patch since we now know difference between no HC and no
information about HC.

JIRA issues: MARATHON-8716
Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58
Signed-off-by: Grégoire Seux <g.seux@criteo.com>
(cherry picked from commit e044bd7)
komuta pushed a commit to criteo-forks/marathon that referenced this pull request Aug 31, 2020
Summary:
Before this patch, during a deployment, a leader election would lead
marathon to not respect minimumHealthCapacity parameter.
Reason is that TaskReplaceActor used to ignore instance healthyness when
considering instance that could be killed "immediately" upon actor
start.

We now respect minimumHealthCapacity by taking into account healthy
instances.

This patch required to add an property "this app has configured
healthchecks" to be able to distinguish following cases:
- app has HC but we don't have signal yet for given instance
- app has HC and we have signal of healthyness for given instance
- app has no HC

It also brings ability to fix TODO at
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267
in a future patch since we now know difference between no HC and no
information about HC.

JIRA issues: MARATHON-8716
Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58
Signed-off-by: Grégoire Seux <g.seux@criteo.com>
(cherry picked from commit e044bd7)
Lqp1 pushed a commit to criteo-forks/marathon that referenced this pull request Sep 16, 2020
Summary:
Before this patch, during a deployment, a leader election would lead
marathon to not respect minimumHealthCapacity parameter.
Reason is that TaskReplaceActor used to ignore instance healthyness when
considering instance that could be killed "immediately" upon actor
start.

We now respect minimumHealthCapacity by taking into account healthy
instances.

This patch required to add an property "this app has configured
healthchecks" to be able to distinguish following cases:
- app has HC but we don't have signal yet for given instance
- app has HC and we have signal of healthyness for given instance
- app has no HC

It also brings ability to fix TODO at
https://github.com/mesosphere/marathon/blob/master/src/main/scala/mesosphere/marathon/core/instance/Instance.scala#L267
in a future patch since we now know difference between no HC and no
information about HC.

JIRA issues: MARATHON-8716
Change-Id: Ia7b11cbb22f86967b49298f774c6d27fc01a6e58
Signed-off-by: Grégoire Seux <g.seux@criteo.com>
(cherry picked from commit e044bd7)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants