New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Marathon has a cap at 2664 apps. #4802
Comments
Also of interest, we cap at 115,685 instances for a single app too. |
What happened when you schedule more? Any hypothesis? ZK limits? Actors/threads limits? This is global limit on all apps or apps in one group? |
Please define your setup + steps to reproduce, as well as your initial insights. |
This is in the shakedown suite as well. All I'm doing is creating stupidly simple apps in a loop and checking separately the count of items in /v2/tasks and /v2/apps. My current suspect is launch queue and possibly instance tracker. Once we hit this magical cap, everything timesout with an ask timeout. @janisz it's unlikely it's zk, when I tested the new persistence layer I could get to 100k apps. |
Did a bunch more investigation trying to find the root cause and split off #4813 as a blocker. We likely want to just revert, but moved the discussion for that over there. test script:
We should also repeat this same thing but with say 10K applications with 400,000 tasks. I don't see a huge reason why we can't get further. I created https://phabricator.mesosphere.com/D305 with a bunch of random stuff that I was using for investigation. This is just my notes/temporary patches but had little effective change (the variance without nested groups was roughly +/- 300). My last experiment disabled scaling - which made marathon half-decent when you approached the cap and sped up deployment, but it didn't effect the cap a ton. I also disabled some timers out of curiosity (no effect). I'm using the simulator and I've hacked it a little to resemble a 100 node cluster with basically unlimited resources and made it send offers every 5 seconds (instead of every second) as this will more closely resemble what mesos actually does. I also made it respond to LaunchRequests faster and removed all the probabilities in there. In addition, I started messing around with some defaults - some I think are probably actually good, like making LaunchQueue ask timeout be 10s instead of 1s. This didn't have a big effect. Thinking maxTasksPerOffer may have something to do with the problem revealed very little. I put Kamon in since we can dive in via JMX to find out about actor queueing/time processing/errors, etc and found it really handy to combine with the profile data. A few things do stand out:
Some other things stand out as well, but I'm at a bit of a loss as to where we queue the updates to launch tokens, etc - there seems to be a lot of contention for updates from mesos (for example, has mesos ack'd a launch request?) and if this is also an actor, it should probably not be. Other interesting symptoms, as we start to approach the scale cap, marathon is fighting with itself, launching tasks and killing them trying to reach a stable state. And once you get to the scale cap, more or less nothing works at all (but this is somewhat inconsistent, at the very least all deployment plans timeout). @aquamatthias Could you please have someone (or multiple people) in Hamburg investigate this further and suggest some patches? I did about as much investigation as I was able to handle today. I'm happy to help tomorrow with the little time I have. |
I reproduced the problem on my side with the simulator. Before I touched the code base, I updated the simulator to not create a timer per task status update. Problems that popped up:
Setting for running the Simulator:
After applying the patches to Marathon and the Simulator, I did following tests:
|
The reason I run into thousands of threads in Marathon:
The more interesting question is: why does Marathon abdicate? What I see in the logs:
I increased the session timeout (to 30 seconds). |
#4828 fixes mentioned issue:
|
After the fixes to the RootGroupTree are applied, I could launch 5000 apps. |
Blocking InstanceTracker are also in DeploymentActor. This is fixed in #4788 but to reduce threads we need to replace blocking calls in TaskStartActor |
I deployed 8000 apps successfully on my machine: This was only possible without task reconciliation. |
The root cause for creating too many threads is the reconciliation logic for health checks. After this patch is applied I could deploy 8000 apps successfully in the simulator with a reconciliation interval of 2 minutes (to force the original problem early). |
@jasongilanfarr please retest with D333 applied. |
We hit 10,258 apps with 10 instances each using the simulator against latest master. Took like 4 hours... but success. |
* Use async TaskTracker in SchedulerActor (#4828) Releated to #4802, #3031, #4693 * Do not block during reconciliation of health checks. The MarathonHealthCheckManager used the sync version of the InstanceTracker. * Use asynchonus call to TaskTracker * Use async InstanceTracker in DeploymentActor * Use async TaskTracker in TaskKiller * Fixes updating healthchecks * Fixes logging. * Handle spaces in arguments correctly. Test Plan: Start marathon via this script with spaces in arguments. Reviewers: lukas, jeschkies Reviewed By: lukas, jeschkies Subscribers: jenkins, marathon-team Differential Revision: https://phabricator.mesosphere.com/D324 * Kill tasks before start new
No description provided.
The text was updated successfully, but these errors were encountered: