Marathon has a cap at 2664 apps. #4802

jasongilanfarr · 2016-12-07T20:51:57Z

No description provided.

jasongilanfarr · 2016-12-07T20:52:32Z

Also of interest, we cap at 115,685 instances for a single app too.

janisz · 2016-12-07T21:12:16Z

What happened when you schedule more? Any hypothesis? ZK limits? Actors/threads limits? This is global limit on all apps or apps in one group?

aquamatthias · 2016-12-08T16:28:59Z

Please define your setup + steps to reproduce, as well as your initial insights.

jasongilanfarr · 2016-12-08T16:56:03Z

This is in the shakedown suite as well. All I'm doing is creating stupidly simple apps in a loop and checking separately the count of items in /v2/tasks and /v2/apps. My current suspect is launch queue and possibly instance tracker. Once we hit this magical cap, everything timesout with an ask timeout.

@janisz it's unlikely it's zk, when I tested the new persistence layer I could get to 100k apps.

jasongilanfarr · 2016-12-09T00:47:45Z

Did a bunch more investigation trying to find the root cause and split off #4813 as a blocker. We likely want to just revert, but moved the discussion for that over there.

test script:

for i in {1..5000}; do
http POST :8888/v2/apps <<EOF
{
  "id" : "/simple-app-$i",
  "cmd": "sleep 1000",
  "instances": 1,
  "cpus": 0.01,
  "mem": 1,
  "disk": 0
}
EOF
done

We should also repeat this same thing but with say 10, 100, and 1000 instances per app. Pre-pods, I was able to get this to approximately the following numbers:

10K applications with 400,000 tasks. I don't see a huge reason why we can't get further.

I created https://phabricator.mesosphere.com/D305 with a bunch of random stuff that I was using for investigation. This is just my notes/temporary patches but had little effective change (the variance without nested groups was roughly +/- 300). My last experiment disabled scaling - which made marathon half-decent when you approached the cap and sped up deployment, but it didn't effect the cap a ton. I also disabled some timers out of curiosity (no effect).

I'm using the simulator and I've hacked it a little to resemble a 100 node cluster with basically unlimited resources and made it send offers every 5 seconds (instead of every second) as this will more closely resemble what mesos actually does. I also made it respond to LaunchRequests faster and removed all the probabilities in there.

In addition, I started messing around with some defaults - some I think are probably actually good, like making LaunchQueue ask timeout be 10s instead of 1s. This didn't have a big effect. Thinking maxTasksPerOffer may have something to do with the problem revealed very little.

I put Kamon in since we can dive in via JMX to find out about actor queueing/time processing/errors, etc and found it really handy to combine with the profile data.

A few things do stand out:

InstanceTracker could be reduced in complexity and "easily" have the actor removed. Its essentially a TrieMap[PathId, TrieMap[Instance.Id, Instance]] which can easily have things read the latest state and a processor could be simplified into KeyedLock[PathId] to serialize updates for a given pathid. This would remove the need for the future based code to get the latest value of the data in task tracker.
MarathonSchedulerActor could use a bit more thought - it really should be using the Akka Scheduler, but there's a lot in scale(), for example, which is inefficient and essentially blocks the actor from doing anything else once there are a sufficient number of applications - because all of this takes too long and there is a lot of "hidden" blocking code to actors.
LaunchQueue itself should probably not spawn an actor for every launch, this is something that could easily be represented as either a Future or a Source of TaskState (or the Marathon equivalent).

Some other things stand out as well, but I'm at a bit of a loss as to where we queue the updates to launch tokens, etc - there seems to be a lot of contention for updates from mesos (for example, has mesos ack'd a launch request?) and if this is also an actor, it should probably not be.

Other interesting symptoms, as we start to approach the scale cap, marathon is fighting with itself, launching tasks and killing them trying to reach a stable state. And once you get to the scale cap, more or less nothing works at all (but this is somewhat inconsistent, at the very least all deployment plans timeout).

@aquamatthias Could you please have someone (or multiple people) in Hamburg investigate this further and suggest some patches? I did about as much investigation as I was able to handle today. I'm happy to help tomorrow with the little time I have.

aquamatthias · 2016-12-09T17:22:02Z

I reproduced the problem on my side with the simulator.

Before I touched the code base, I updated the simulator to not create a timer per task status update.
Patch is here: https://phabricator.mesosphere.com/D311

Problems that popped up:

Offer matching timeout: if all actors are currently processing offers, Marathon can be flooded.
QueuedInstanceInfo is returned with every InstanceUpdate
OfferProcessorImpl.saveTasks: under load I see a lot TimeoutReached messages

and 2) are fixed in https://phabricator.mesosphere.com/D309
To not run into 3) I adapted the default values ( See run the Simulator)

Setting for running the Simulator:

$> runMain mesosphere.mesos.simulation.SimulateMesosMain --master foo --max_tasks_per_offer 100  --launch_tokens 1000 --offer_matching_timeout 5000 --save_tasks_to_launch_timeout 10000

After applying the patches to Marathon and the Simulator, I did following tests:

start one app with 50K tasks
This runs successfully. It becomes clear, that the offer matching time (InstanceOps/offer) slows down over time.
start 5k apps with 1 task
I hit a java.lang.OutOfMemoryError: Unable to create new native thread on my OSX machine at about 4100 apps. The number of threads seem pretty constant at about 80 during this time.

aquamatthias · 2016-12-12T13:50:01Z

The reason I run into thousands of threads in Marathon:

Marathon abdicates
We do not cleanly tear down all services
MarathonSchedulerActor tries to reconcile tasks: it calls InstanceTracker.specInstancesSync which calls Await.result
Once we stop the process when Marathon abdicates, this should not happen any longer.

The more interesting question is: why does Marathon abdicate?
Answer: There are GC cycles that take longer than 10 seconds!
The leader member node is an ephemeral node which gets deleted server side if the session becomes invalid. The session is considered expired, if it is not renewed in ⅔ of the session timeout.

What I see in the logs:

)
[2016-12-12 08:25:53,733] INFO  Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:ForkJoinPool-2-worker-11-SendThread(localhost:2181))
[2016-12-12 08:25:53,734] INFO  Socket connection established to localhost/0:0:0:0:0:0:0:1:2181, initiating session (org.apache.zookeeper.ClientCnxn:ForkJoinPool-2-worker-11-SendThread(localhost:2181))
[2016-12-12 08:25:53,734] INFO  Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x158d36d984f00ca, negotiated timeout = 10000 (org.apache.zookeeper.ClientCnxn:ForkJoinPool-2-worker-11-SendThread(localhost:2181))
[2016-12-12 08:25:53,735] INFO  State change: RECONNECTED (org.apache.curator.framework.state.ConnectionStateManager:ForkJoinPool-2-worker-11-EventThread)
[2016-12-12 08:25:53,749] ERROR error while getting current leader (mesosphere.marathon.core.election.impl.CuratorElectionService:pool-8-thread-1)
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /marathon/leader-curator/_c_ffcbcd6f-3e56-42b4-8688-36b8e8395ea1-latch-0000000000

I increased the session timeout (to 30 seconds).
I do not run into this problem any longer.
While constantly creating apps - Marathon will now spend a lot of time in GC.

Releated to mesosphere#4802, mesosphere#3031, mesosphere#4693

janisz · 2016-12-12T16:00:34Z

#4828 fixes mentioned issue:

MarathonSchedulerActor tries to reconcile tasks: it calls InstanceTracker.specInstancesSync which calls Await.result

Releated to mesosphere#4802, mesosphere#3031, mesosphere#4693

aquamatthias · 2016-12-13T10:58:35Z

After the fixes to the RootGroupTree are applied, I could launch 5000 apps.
I still see a jump in the number of threads during reconciliation.
The reconciliation logic inside the Simulator is probably too easy: it will send all updates in one batch.
The patch from Janisz will help to reduce this number - but this is only one place. I still see a jump in threads.

Releated to #4802, #3031, #4693

janisz · 2016-12-13T12:41:21Z

Blocking InstanceTracker are also in DeploymentActor. This is fixed in #4788 but to reduce threads we need to replace blocking calls in TaskStartActor

aquamatthias · 2016-12-13T13:41:18Z

I deployed 8000 apps successfully on my machine:
Time: 1 hour
ThreadCount: 114

This was only possible without task reconciliation.
Next step would be to look into task reconciliation.

aquamatthias · 2016-12-14T12:10:45Z

The root cause for creating too many threads is the reconciliation logic for health checks.
The combination of blocking code that is executed per application leads to this effect.
I fixed this in D333

After this patch is applied I could deploy 8000 apps successfully in the simulator with a reconciliation interval of 2 minutes (to force the original problem early).
Time: 1 hour
ThreadCount: 139

aquamatthias · 2016-12-14T12:13:59Z

@jasongilanfarr please retest with D333 applied.

jasongilanfarr · 2016-12-15T01:34:15Z

We hit 10,258 apps with 10 instances each using the simulator against latest master. Took like 4 hours... but success.

Releated to mesosphere#4802, mesosphere#3031, mesosphere#4693

Releated to #4802, #3031, #4693 (cherry picked from commit d5c9844)

* Use async TaskTracker in SchedulerActor (#4828) Releated to #4802, #3031, #4693 * Do not block during reconciliation of health checks. The MarathonHealthCheckManager used the sync version of the InstanceTracker. * Use asynchonus call to TaskTracker * Use async InstanceTracker in DeploymentActor * Use async TaskTracker in TaskKiller * Fixes updating healthchecks * Fixes logging. * Handle spaces in arguments correctly. Test Plan: Start marathon via this script with spaces in arguments. Reviewers: lukas, jeschkies Reviewed By: lukas, jeschkies Subscribers: jenkins, marathon-team Differential Revision: https://phabricator.mesosphere.com/D324 * Kill tasks before start new

jasongilanfarr added analyze labels Dec 7, 2016

jasongilanfarr added this to the Marathon 1.4 milestone Dec 7, 2016

jasongilanfarr mentioned this issue Dec 8, 2016

Massive Performance Regression with Nested Groups #4813

Closed

jasongilanfarr assigned aquamatthias Dec 9, 2016

aquamatthias added the in progress label Dec 9, 2016

aquamatthias assigned jasongilanfarr Dec 9, 2016

janisz added a commit to janisz/marathon that referenced this issue Dec 12, 2016

Use async InstanceTracker methods in SchedulerActor

affb8ae

Releated to mesosphere#4802, mesosphere#3031, mesosphere#4693

janisz mentioned this issue Dec 12, 2016

Use async InstanceTracker methods in SchedulerActor #4828

Merged

janisz added a commit to janisz/marathon that referenced this issue Dec 12, 2016

Use async InstanceTracker methods in SchedulerActor

05b028d

Releated to mesosphere#4802, mesosphere#3031, mesosphere#4693

janisz added a commit to janisz/marathon that referenced this issue Dec 13, 2016

Use async InstanceTracker methods in SchedulerActor

77733bb

Releated to mesosphere#4802, mesosphere#3031, mesosphere#4693

janisz added a commit to janisz/marathon that referenced this issue Dec 13, 2016

Use async InstanceTracker methods in SchedulerActor

68be871

Releated to mesosphere#4802, mesosphere#3031, mesosphere#4693

jdef pushed a commit that referenced this issue Dec 13, 2016

Use async InstanceTracker methods in SchedulerActor (#4828)

d5c9844

Releated to #4802, #3031, #4693

aquamatthias added ready for review and removed in progress labels Dec 14, 2016

jasongilanfarr closed this as completed Dec 15, 2016

jasongilanfarr removed the ready for review label Dec 15, 2016

janisz added a commit to janisz/marathon that referenced this issue Dec 16, 2016

Use async TaskTracker in SchedulerActor (mesosphere#4828)

303f768

Releated to mesosphere#4802, mesosphere#3031, mesosphere#4693

aquamatthias pushed a commit that referenced this issue Dec 21, 2016

Use async InstanceTracker methods in SchedulerActor (#4828)

eb5d483

Releated to #4802, #3031, #4693 (cherry picked from commit d5c9844)

janisz mentioned this issue Feb 6, 2017

Don't cache Group.toProto #5115

Merged

mesosphere locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Marathon has a cap at 2664 apps. #4802

Marathon has a cap at 2664 apps. #4802

jasongilanfarr commented Dec 7, 2016 •

edited by aquamatthias

jasongilanfarr commented Dec 7, 2016

janisz commented Dec 7, 2016 •

edited

aquamatthias commented Dec 8, 2016

jasongilanfarr commented Dec 8, 2016

jasongilanfarr commented Dec 9, 2016 •

edited

aquamatthias commented Dec 9, 2016

aquamatthias commented Dec 12, 2016

janisz commented Dec 12, 2016

aquamatthias commented Dec 13, 2016

janisz commented Dec 13, 2016

aquamatthias commented Dec 13, 2016

aquamatthias commented Dec 14, 2016

aquamatthias commented Dec 14, 2016

jasongilanfarr commented Dec 15, 2016

Marathon has a cap at 2664 apps. #4802

Marathon has a cap at 2664 apps. #4802

Comments

jasongilanfarr commented Dec 7, 2016 • edited by aquamatthias

jasongilanfarr commented Dec 7, 2016

janisz commented Dec 7, 2016 • edited

aquamatthias commented Dec 8, 2016

jasongilanfarr commented Dec 8, 2016

jasongilanfarr commented Dec 9, 2016 • edited

aquamatthias commented Dec 9, 2016

aquamatthias commented Dec 12, 2016

janisz commented Dec 12, 2016

aquamatthias commented Dec 13, 2016

janisz commented Dec 13, 2016

aquamatthias commented Dec 13, 2016

aquamatthias commented Dec 14, 2016

aquamatthias commented Dec 14, 2016

jasongilanfarr commented Dec 15, 2016

jasongilanfarr commented Dec 7, 2016 •

edited by aquamatthias

janisz commented Dec 7, 2016 •

edited

jasongilanfarr commented Dec 9, 2016 •

edited