Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Marathon has a cap at 2664 apps. #4802

Closed
jasongilanfarr opened this issue Dec 7, 2016 · 14 comments
Closed

Marathon has a cap at 2664 apps. #4802

jasongilanfarr opened this issue Dec 7, 2016 · 14 comments
Assignees

Comments

@jasongilanfarr
Copy link
Contributor

jasongilanfarr commented Dec 7, 2016

No description provided.

@jasongilanfarr jasongilanfarr added this to the Marathon 1.4 milestone Dec 7, 2016
@jasongilanfarr
Copy link
Contributor Author

Also of interest, we cap at 115,685 instances for a single app too.

@janisz
Copy link
Contributor

janisz commented Dec 7, 2016

What happened when you schedule more? Any hypothesis? ZK limits? Actors/threads limits? This is global limit on all apps or apps in one group?

@aquamatthias
Copy link
Contributor

Please define your setup + steps to reproduce, as well as your initial insights.

@jasongilanfarr
Copy link
Contributor Author

This is in the shakedown suite as well. All I'm doing is creating stupidly simple apps in a loop and checking separately the count of items in /v2/tasks and /v2/apps. My current suspect is launch queue and possibly instance tracker. Once we hit this magical cap, everything timesout with an ask timeout.

@janisz it's unlikely it's zk, when I tested the new persistence layer I could get to 100k apps.

@jasongilanfarr
Copy link
Contributor Author

jasongilanfarr commented Dec 9, 2016

Did a bunch more investigation trying to find the root cause and split off #4813 as a blocker. We likely want to just revert, but moved the discussion for that over there.

test script:

for i in {1..5000}; do
http POST :8888/v2/apps <<EOF
{
  "id" : "/simple-app-$i",
  "cmd": "sleep 1000",
  "instances": 1,
  "cpus": 0.01,
  "mem": 1,
  "disk": 0
}
EOF
done

We should also repeat this same thing but with say 10, 100, and 1000 instances per app. Pre-pods, I was able to get this to approximately the following numbers:

10K applications with 400,000 tasks. I don't see a huge reason why we can't get further.

I created https://phabricator.mesosphere.com/D305 with a bunch of random stuff that I was using for investigation. This is just my notes/temporary patches but had little effective change (the variance without nested groups was roughly +/- 300). My last experiment disabled scaling - which made marathon half-decent when you approached the cap and sped up deployment, but it didn't effect the cap a ton. I also disabled some timers out of curiosity (no effect).

I'm using the simulator and I've hacked it a little to resemble a 100 node cluster with basically unlimited resources and made it send offers every 5 seconds (instead of every second) as this will more closely resemble what mesos actually does. I also made it respond to LaunchRequests faster and removed all the probabilities in there.

In addition, I started messing around with some defaults - some I think are probably actually good, like making LaunchQueue ask timeout be 10s instead of 1s. This didn't have a big effect. Thinking maxTasksPerOffer may have something to do with the problem revealed very little.

I put Kamon in since we can dive in via JMX to find out about actor queueing/time processing/errors, etc and found it really handy to combine with the profile data.

A few things do stand out:

  • InstanceTracker could be reduced in complexity and "easily" have the actor removed. Its essentially a TrieMap[PathId, TrieMap[Instance.Id, Instance]] which can easily have things read the latest state and a processor could be simplified into KeyedLock[PathId] to serialize updates for a given pathid. This would remove the need for the future based code to get the latest value of the data in task tracker.
  • MarathonSchedulerActor could use a bit more thought - it really should be using the Akka Scheduler, but there's a lot in scale(), for example, which is inefficient and essentially blocks the actor from doing anything else once there are a sufficient number of applications - because all of this takes too long and there is a lot of "hidden" blocking code to actors.
  • LaunchQueue itself should probably not spawn an actor for every launch, this is something that could easily be represented as either a Future or a Source of TaskState (or the Marathon equivalent).

Some other things stand out as well, but I'm at a bit of a loss as to where we queue the updates to launch tokens, etc - there seems to be a lot of contention for updates from mesos (for example, has mesos ack'd a launch request?) and if this is also an actor, it should probably not be.

Other interesting symptoms, as we start to approach the scale cap, marathon is fighting with itself, launching tasks and killing them trying to reach a stable state. And once you get to the scale cap, more or less nothing works at all (but this is somewhat inconsistent, at the very least all deployment plans timeout).

@aquamatthias Could you please have someone (or multiple people) in Hamburg investigate this further and suggest some patches? I did about as much investigation as I was able to handle today. I'm happy to help tomorrow with the little time I have.

@aquamatthias
Copy link
Contributor

I reproduced the problem on my side with the simulator.

Before I touched the code base, I updated the simulator to not create a timer per task status update.
Patch is here: https://phabricator.mesosphere.com/D311

Problems that popped up:

  • Offer matching timeout: if all actors are currently processing offers, Marathon can be flooded.
  • QueuedInstanceInfo is returned with every InstanceUpdate
  • OfferProcessorImpl.saveTasks: under load I see a lot TimeoutReached messages
  1. and 2) are fixed in https://phabricator.mesosphere.com/D309
    To not run into 3) I adapted the default values ( See run the Simulator)

Setting for running the Simulator:

$> runMain mesosphere.mesos.simulation.SimulateMesosMain --master foo --max_tasks_per_offer 100  --launch_tokens 1000 --offer_matching_timeout 5000 --save_tasks_to_launch_timeout 10000

After applying the patches to Marathon and the Simulator, I did following tests:

  • start one app with 50K tasks
    This runs successfully. It becomes clear, that the offer matching time (InstanceOps/offer) slows down over time.
  • start 5k apps with 1 task
    I hit a java.lang.OutOfMemoryError: Unable to create new native thread on my OSX machine at about 4100 apps. The number of threads seem pretty constant at about 80 during this time.

@aquamatthias
Copy link
Contributor

The reason I run into thousands of threads in Marathon:

  • Marathon abdicates
  • We do not cleanly tear down all services
  • MarathonSchedulerActor tries to reconcile tasks: it calls InstanceTracker.specInstancesSync which calls Await.result
    Once we stop the process when Marathon abdicates, this should not happen any longer.

The more interesting question is: why does Marathon abdicate?
Answer: There are GC cycles that take longer than 10 seconds!
The leader member node is an ephemeral node which gets deleted server side if the session becomes invalid. The session is considered expired, if it is not renewed in ⅔ of the session timeout.

What I see in the logs:

)
[2016-12-12 08:25:53,733] INFO  Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:ForkJoinPool-2-worker-11-SendThread(localhost:2181))
[2016-12-12 08:25:53,734] INFO  Socket connection established to localhost/0:0:0:0:0:0:0:1:2181, initiating session (org.apache.zookeeper.ClientCnxn:ForkJoinPool-2-worker-11-SendThread(localhost:2181))
[2016-12-12 08:25:53,734] INFO  Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x158d36d984f00ca, negotiated timeout = 10000 (org.apache.zookeeper.ClientCnxn:ForkJoinPool-2-worker-11-SendThread(localhost:2181))
[2016-12-12 08:25:53,735] INFO  State change: RECONNECTED (org.apache.curator.framework.state.ConnectionStateManager:ForkJoinPool-2-worker-11-EventThread)
[2016-12-12 08:25:53,749] ERROR error while getting current leader (mesosphere.marathon.core.election.impl.CuratorElectionService:pool-8-thread-1)
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /marathon/leader-curator/_c_ffcbcd6f-3e56-42b4-8688-36b8e8395ea1-latch-0000000000

I increased the session timeout (to 30 seconds).
I do not run into this problem any longer.
While constantly creating apps - Marathon will now spend a lot of time in GC.

@janisz
Copy link
Contributor

janisz commented Dec 12, 2016

#4828 fixes mentioned issue:

MarathonSchedulerActor tries to reconcile tasks: it calls InstanceTracker.specInstancesSync which calls Await.result

janisz added a commit to janisz/marathon that referenced this issue Dec 12, 2016
janisz added a commit to janisz/marathon that referenced this issue Dec 13, 2016
janisz added a commit to janisz/marathon that referenced this issue Dec 13, 2016
@aquamatthias
Copy link
Contributor

After the fixes to the RootGroupTree are applied, I could launch 5000 apps.
I still see a jump in the number of threads during reconciliation.
The reconciliation logic inside the Simulator is probably too easy: it will send all updates in one batch.
The patch from Janisz will help to reduce this number - but this is only one place. I still see a jump in threads.

@janisz
Copy link
Contributor

janisz commented Dec 13, 2016

Blocking InstanceTracker are also in DeploymentActor. This is fixed in #4788 but to reduce threads we need to replace blocking calls in TaskStartActor

@aquamatthias
Copy link
Contributor

I deployed 8000 apps successfully on my machine:
Time: 1 hour
ThreadCount: 114

This was only possible without task reconciliation.
Next step would be to look into task reconciliation.

@aquamatthias
Copy link
Contributor

The root cause for creating too many threads is the reconciliation logic for health checks.
The combination of blocking code that is executed per application leads to this effect.
I fixed this in D333

After this patch is applied I could deploy 8000 apps successfully in the simulator with a reconciliation interval of 2 minutes (to force the original problem early).
Time: 1 hour
ThreadCount: 139

@aquamatthias
Copy link
Contributor

@jasongilanfarr please retest with D333 applied.

@jasongilanfarr
Copy link
Contributor Author

We hit 10,258 apps with 10 instances each using the simulator against latest master. Took like 4 hours... but success.

aquamatthias pushed a commit that referenced this issue Dec 21, 2016
jeschkies pushed a commit that referenced this issue Jan 2, 2017
* Use async TaskTracker in SchedulerActor (#4828)

Releated to #4802, #3031, #4693

* Do not block during reconciliation of health checks.

The MarathonHealthCheckManager used the sync version of the
InstanceTracker.

* Use asynchonus call to TaskTracker

* Use async InstanceTracker in DeploymentActor

* Use async TaskTracker in TaskKiller

* Fixes updating healthchecks

* Fixes logging.

* Handle spaces in arguments correctly.

Test Plan: Start marathon via this script with spaces in arguments.

Reviewers: lukas, jeschkies

Reviewed By: lukas, jeschkies

Subscribers: jenkins, marathon-team

Differential Revision: https://phabricator.mesosphere.com/D324

* Kill tasks before start new
@mesosphere mesosphere locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants