Deployment cannot process if deployment data exceeds 1024kb #2101

nashasha1 · 2015-08-26T15:21:24Z

In 0.8.2 We create an app, marathon return success. But it never really create it. And it not in deployment nor in queue.
So we upgrade to 0.9.2
It shows error:
Could not modify Group with key: root:2015-07-31T08:50:18.342Z

We remove some zookeeper log/snapshot. It is ok.
But after we recover some apps, the error appear again.

meichstedt · 2015-08-26T16:45:49Z

Could you please provide some more information from the marathon logfile? There should be an exception logged with that message.

nashasha1 · 2015-08-26T17:50:13Z

Aug 26 09:21:17 ip-**--- marathon[27448]: [2015-08-26 09:21:17,252] WARN (mesosphere.marathon.api.MarathonExceptionMapper:30)
Aug 26 09:21:17 ip-**--- marathon[27448]: mesosphere.marathon.StoreCommandFailedException: Could not fetch AppDefinition with key: 5ecb8fba-afdf-4691-9f2f-f0111106b4ed
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at mesosphere.marathon.state.MarathonStore$$anonfun$mesosphere$marathon$state$MarathonStore$$exceptionTransform$1.applyOrElse(MarathonStore.scala:75)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at mesosphere.marathon.state.MarathonStore$$anonfun$mesosphere$marathon$state$MarathonStore$$exceptionTransform$1.applyOrElse(MarathonStore.scala:74)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:215)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.util.Try$.apply(Try.scala:191)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.util.Failure.recover(Try.scala:215)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at java.lang.Thread.run(Thread.java:745)
Aug 26 09:21:17 ip-**--- marathon[27448]: Caused by: mesosphere.marathon.StoreCommandFailedException: Could not load key app:5ecb8fba-afdf-4691-9f2f-f0111106b4ed
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at mesosphere.util.state.zk.ZKStore$$anonfun$exceptionTransform$1.applyOrElse(ZKStore.scala:83)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at mesosphere.util.state.zk.ZKStore$$anonfun$exceptionTransform$1.applyOrElse(ZKStore.scala:82)
Aug 26 09:21:17 ip-**--- marathon[27448]: #11... 10 more
Aug 26 09:21:17 ip-**--- marathon[27448]: Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /marathon/state/app:5ecb8fba-afdf-4691-9f2f-f0111106b4ed
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at com.twitter.zk.AsyncCallbackPromise$class.process(AsyncCallbackPromise.scala:18)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at com.twitter.zk.DataCallbackPromise.process(AsyncCallbackPromise.scala:53)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at com.twitter.zk.DataCallbackPromise.processResult(AsyncCallbackPromise.scala:56)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:564)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Aug 26 09:21:17 ip-**--- marathon[27448]: [2015-08-26 09:21:17,252] WARN Deployment failed for change: 2015-08-26T09:21:04.810Z (mesosphere.marathon.state.GroupManager:146)
Aug 26 09:21:17 ip-**--- marathon[27448]: mesosphere.marathon.StoreCommandFailedException: Could not modify Group with key: root:2015-07-30T00:07:26.834Z
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at mesosphere.marathon.state.MarathonStore$$anonfun$mesosphere$marathon$state$MarathonStore$$exceptionTransform$1.applyOrElse(MarathonStore.scala:75)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at mesosphere.marathon.state.MarathonStore$$anonfun$mesosphere$marathon$state$MarathonStore$$exceptionTransform$1.applyOrElse(MarathonStore.scala:74)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
Aug 26 09:21:17 ip-**--- marathon[27448]: #011at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:215)

nashasha1 · 2015-08-26T17:55:29Z

we found this when the number of our containers is larger than 1000.

So did marathon have some unusual behaviour when it manages 1000+ containers.

mingqi · 2015-08-27T02:53:56Z

I and @nashasha1 from same company. I post another logging with better format below. The exception happened when App was created or updated. The exception says Can not create entity group:root:2015-07-31T08:50:18.342Z and Could not modify Group with key: root:2015-07-31T08:50:18.342Z. App creating and updating happened on yesterday, that is 2015-08-26. It's very strange that Marathon tried to create and modify group with 2015-07-31 version. 2015-07-31 version was deleted long ago by Marathon automatically because --zk_max_versions was set 3 in our Marathon

Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,210] WARN (mesosphere.marathon.api.MarathonExceptionMapper:30)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: mesosphere.marathon.StoreCommandFailedException: Could not modify Group with key: root:2015-07-31T08:50:18.342Z
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at mesosphere.marathon.state.MarathonStore$$anonfun$mesosphere$marathon$state$MarathonStore$$exceptionTransform$1.applyOrElse(MarathonStore.scala:75)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at mesosphere.marathon.state.MarathonStore$$anonfun$mesosphere$marathon$state$MarathonStore$$exceptionTransform$1.applyOrElse(MarathonStore.scala:74)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:215)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.util.Try$.apply(Try.scala:191)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.util.Failure.recover(Try.scala:215)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at java.lang.Thread.run(Thread.java:745)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: Caused by: mesosphere.marathon.StoreCommandFailedException: Can not create entity group:root:2015-07-31T08:50:18.342Z
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at mesosphere.util.state.zk.ZKStore$$anonfun$exceptionTransform$1.applyOrElse(ZKStore.scala:83)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at mesosphere.util.state.zk.ZKStore$$anonfun$exceptionTransform$1.applyOrElse(ZKStore.scala:82)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011... 10 more
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /marathon/state/group:root:2015-07-31T08:50:18.342Z
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at com.twitter.zk.AsyncCallbackPromise$class.process(AsyncCallbackPromise.scala:18)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at com.twitter.zk.StringCallbackPromise.process(AsyncCallbackPromise.scala:23)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at com.twitter.zk.StringCallbackPromise.processResult(AsyncCallbackPromise.scala:25)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,210] WARN Deployment failed for change: 2015-08-26T10:04:58.043Z (mesosphere.marathon.state.GroupManager:146)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: mesosphere.marathon.StoreCommandFailedException: Could not modify Group with key: root:2015-07-31T08:50:18.342Z
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at mesosphere.marathon.state.MarathonStore$$anonfun$mesosphere$marathon$state$MarathonStore$$exceptionTransform$1.applyOrElse(MarathonStore.scala:75)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at mesosphere.marathon.state.MarathonStore$$anonfun$mesosphere$marathon$state$MarathonStore$$exceptionTransform$1.applyOrElse(MarathonStore.scala:74)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:215)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.util.Try$.apply(Try.scala:191)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.util.Failure.recover(Try.scala:215)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at java.lang.Thread.run(Thread.java:745)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: Caused by: mesosphere.marathon.StoreCommandFailedException: Can not create entity group:root:2015-07-31T08:50:18.342Z
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at mesosphere.util.state.zk.ZKStore$$anonfun$exceptionTransform$1.applyOrElse(ZKStore.scala:83)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at mesosphere.util.state.zk.ZKStore$$anonfun$exceptionTransform$1.applyOrElse(ZKStore.scala:82)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011... 10 more
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /marathon/state/group:root:2015-07-31T08:50:18.342Z
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at com.twitter.zk.AsyncCallbackPromise$class.process(AsyncCallbackPromise.scala:18)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at com.twitter.zk.StringCallbackPromise.process(AsyncCallbackPromise.scala:23)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at com.twitter.zk.StringCallbackPromise.processResult(AsyncCallbackPromise.scala:25)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: #011at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [INFO] [08/26/2015 10:05:09.214] [marathon-akka.actor.default-dispatcher-4] [akka://marathon/user/$b] POSTing to all endpoints.
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,215] INFO 36.110.16.178 - - [26/Aug/2015:10:04:58 +0000] "POST /v2/apps/ HTTP/1.1" 500 76 "-" "curl/7.37.1" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,320] INFO Opening socket connection to server 172.31.8.66/172.31.8.66:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn:975)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,320] INFO Socket connection established to 172.31.8.66/172.31.8.66:2181, initiating session (org.apache.zookeeper.ClientCnxn:852)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,322] INFO Session establishment complete on server 172.31.8.66/172.31.8.66:2181, sessionid = 0x14f68ce878b000f, negotiated timeout = 1800000 (org.apache.zookeeper.ClientCnxn:1235)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,328] INFO Setting framework ID to 20150713-183426-1459822508-5050-27691-0000 (mesosphere.marathon.MarathonSchedulerService:86)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [INFO] [08/26/2015 10:05:09.328] [pool-2-thread-1] [akka://marathon/user/$b] Sending POST to:http://172.31.11.226:8080/v1/apps/events/
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,328] INFO Setting framework ID to 20150713-183426-1459822508-5050-27691-0000 (mesosphere.marathon.MarathonSchedulerService:86)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [INFO] [08/26/2015 10:05:09.329] [pool-2-thread-1] [akka://marathon/user/$b] Sending POST to:https://puck.alauda.club:8443/v1/slark
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [INFO] [08/26/2015 10:05:09.329] [pool-2-thread-1] [akka://marathon/user/$b] Sending POST to:https://puck.alauda.club:8443/v1/mirana
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,330] INFO 172.31.3.143 - - [26/Aug/2015:10:05:08 +0000] "GET /v2/info HTTP/1.1" 200 977 "-" "ELB-HealthChecker/1.0" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,330] INFO 36.110.16.178 - - [26/Aug/2015:10:05:09 +0000] "GET /v2/apps//c30495b3-bf6c-4693-a13a-a0805a78804f HTTP/1.1" 200 1625 "http://54.223.169.203:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,331] INFO 172.31.7.243 - - [26/Aug/2015:10:05:08 +0000] "GET /v2/info HTTP/1.1" 200 977 "-" "ELB-HealthChecker/1.0" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,398] INFO 36.110.16.178 - - [26/Aug/2015:10:05:09 +0000] "GET /v2/deployments HTTP/1.1" 200 2 "http://54.223.169.203:8080/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.89 Safari/537.36" (mesosphere.chaos.http.ChaosRequestLog:15)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,429] INFO started processing 1 offers, launching at most 1 tasks per offer and 1000 tasks in total (mesosphere.marathon.tasks.IterativeOfferMatcher$:129)
Aug 26 10:05:09 ip-172-31-11-94 marathon[4884]: [2015-08-26 10:05:09,429] INFO Offer [20150720-062308-1577787308-5050-21885-O36819081]. Decline with default filter refuseSeconds (use --decline_offer_duration to configure) (mesosphere.marathon.tasks.IterativeOfferMatcher$:228)

mingqi · 2015-08-27T06:48:47Z

Another odd thing is the group version is always 2015-07-13T20:47:41.034Z which query by /v2/groups. I look into the content of Zookeeper node /marathon/state/group:root, the version indeed is 2015-07-13T20:47:41.034. It's seem the version hadn't changes from first created.

If the problem is related to big group. we have thousand Apps in root group. The size of zookeeper node /marathon/state/group:root is more than 500K now. is this problem related to big group? actually we don't use Marathon's group feature. Should we create separate group for each app to prevent large group? thanks

mingqi · 2015-08-28T22:38:02Z

any idea about this?

meichstedt · 2015-09-10T10:43:41Z

@mingqi: In general, 500k is not a size where I'd expect this kind of problem. See information on jute.maxbuffer in the ZooKeeper Admin Doc

But yes, we've run into problems with big node sizes – the problem, however, is not the size of the node itself, but the packet size when fetching the node. When trying to delete such a node via the zkCli, the output looks like this:

Welcome to ZooKeeper!
2015-08-13 10:47:41,229 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
JLine support is enabled
2015-08-13 10:47:41,299 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@852] - Socket connection established to localhost/127.0.0.1:2181, initiating session
2015-08-13 10:47:41,308 [myid:] - INFO  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1235] - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14f25f4aa2200df, negotiated timeout = 30000

WATCHER::

WatchedEvent state:SyncConnected type:None path:null
[zk: localhost:2181(CONNECTED) 0] rmr /universe
2015-08-13 10:47:45,908 [myid:] - WARN  [main-SendThread(localhost:2181):ClientCnxn$SendThread@1102] - Session 0x14f25f4aa2200df for server localhost/127.0.0.1:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Packet len4198383 is out of range!
    at org.apache.zookeeper.ClientCnxnSocket.readLength(ClientCnxnSocket.java:112)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:79)
    at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
    at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

WATCHER::

WatchedEvent state:Disconnected type:None path:null
Exception in thread "main" org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /universe/marathon-demo/state
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1500)
    at org.apache.zookeeper.ZKUtil.listSubTreeBFS(ZKUtil.java:114)
    at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:49)
    at org.apache.zookeeper.ZooKeeperMain.processZKCmd(ZooKeeperMain.java:703)
    at org.apache.zookeeper.ZooKeeperMain.processCmd(ZooKeeperMain.java:588)
    at org.apache.zookeeper.ZooKeeperMain.executeLine(ZooKeeperMain.java:360)
    at org.apache.zookeeper.ZooKeeperMain.run(ZooKeeperMain.java:323)
    at org.apache.zookeeper.ZooKeeperMain.main(ZooKeeperMain.java:282)
matthias@master-zuob ~ $

Notice the Packet len4198383 is out of range! in the stack trace – I have not seen that in the one you provided, so I don't think that you're running into node size problems (at least not yet). However, it seems to be a zk-specific problem. From a marathon point of view, storing state fails, which means that deploying must fail.

Questions

Have you updated any AppDefinitions since the root group's version timestamp?
How many child nodes does the root group have?
Can you try to manually (non-destructively) modify the zNode via the zkCli?
Can you please adjust the jute.maxbuffer size and see if that helps?

aameek · 2015-12-08T04:28:57Z

Any update/ideas on this? We are seeing a similar issue though our group size or packet size is not that big...

curl --silent -X GET http://marathon/v2/groups | python -mjson.tool | wc
39871 65801 1626350

aquamatthias · 2015-12-09T18:22:17Z

Hey @mingqi if the group reaches a size of more than 500KB, any update to any app or group will fail with a version of Marathon before 0.13.0. For every change of an AppDefinition we create and store a Deployment in ZK which stores the group before the change and the group after the change. If your group is greater than 500KB, the deployment object will be 2 times this size. The default jute.maxbuffer is 1MB and will exceed this threshold.

For that reason we introduced ZKCompression, which is available in Marathon 0.13. We successfully started 2000 Apps simultaneously with this version. Can you try, if this version solves your problem?

aquamatthias · 2015-12-09T18:27:35Z

@aameek the group is stored as protobuf. The size of the json object is very different. Can you look into zk, how big the object is?

mcclurmc · 2016-05-01T17:03:15Z

Just adding my 👍 to this issue. My team and I worked with @jgarcia-mesosphere over the last two weeks to narrow down a problem we were seeing in our dev cluster to this bug. The issue is that Marathon stores group defs in ZK, and the root group seems to store all the data for all child groups. When we got to a point where our /marathon/group:root znode went over about 500kb, Marathon wouldn't be able to process new deployments (we'd see either "conflicting deployment" or "futures timed out" errors in the logs). The max for any znode is 1mb, but Marathon is updating the znode with old group + deployment metadata + updated group, so the current znode doesn't even have to approach 1mb before you start seeing this problem.

Once we understood that Marathon was trying to update the group:root znode in a way that would surpass the hard limit of 1mb per znode, we were able to fiddle with our apps enough to destroy enough of them to get us below 500kb for that group:root znode. That got our dev cluster stable, but we have an upper limit on the number of app groups we can run concurrently now.

It looks like Marathon is scalable up to thousands of tasks, but those tasks need to have small definitions. We've built a "Heroku-like" system on top of Marathon for our devs, and that means that we've got a high number of apps per group, and each app has around 100 environment variables. So when your devs create multiple groups with this kind of configuration, you'll run into scalability issues sooner than you'd expect. For instance, we currently have 323 apps defined in our dev Marathon (only 120 of them actually running), but our group:root znode is over 400kb.

We're going to be rearchitecting our applications to fetch their environments themselves, instead of injecting them via the env section of the app definition. It would be good if Marathon would at least catch the exceptional case where it attempts to stuff more in the group:root znode than it can handle, and long-term to break up these potentially large znodes into parent/child znodes, or use a different data store, or something even more clever that I haven't thought of yet.

For anyone experiencing this issue: try deleting your smallest app or app group, then the next smallest, until you have control of your cluster again. That got us moving again.

aquamatthias · 2016-05-02T09:18:40Z

@mcclurmc thanks for the info. We are very aware of this limitation.

We introduced a command line parameter --max_apps, so you get an error, if you try to create more apps than that number. This configuration is mostly bandaid to fail fast and does not take the zk node size into account (depending on how much data you store on the app level).

Step 1) we will refactor how the group is stored (only references to the app)
Step 2) we will support more capable persistent stores.

jgarcia-mesosphere · 2016-05-02T18:00:31Z

To check for this, pull Marathon statistics from http://<marathon>/metrics and check for the "service.mesosphere.marathon.state.MarathonStore.DeploymentPlan.write-data-size" key to see the largest deploy your Marathon has tried to write.

kopax · 2016-06-14T17:46:40Z

Is this fixed ?

gkleiman · 2016-06-14T22:18:25Z

@kopax not yet

kopax · 2016-06-15T05:42:14Z

@gkleiman and will it be? This issues as almost a year now. Just need to know if I need to adopte a different strategy regarding my deployment.

jasongilanfarr · 2016-09-15T14:48:48Z

There is a new persistence layer in 1.4-SNAPSHOT that is currently enabled by default that stores in ZK in a much more scalable fashion. I was able to store 2,500 apps (easily scales more) with 450,000 tasks. The theoretical limit is about a million "objects" of a given type.

jasongilanfarr · 2016-09-15T15:22:16Z

Fixed by #4178

meichstedt added bug and removed bug labels Aug 26, 2015

nashasha1 mentioned this issue Sep 8, 2015

Please take a look at #2101 #2208

Closed

philipnrmn added the in progress label Sep 8, 2015

philipnrmn self-assigned this Sep 8, 2015

philipnrmn removed their assignment Sep 10, 2015

philipnrmn added question and removed in progress labels Sep 10, 2015

jgarcia-mesosphere added bug and removed question labels Apr 29, 2016

gkleiman mentioned this issue May 2, 2016

Size check for deployments #3848

Closed

jgarcia-mesosphere changed the title ~~Could not modify Group with key~~ Deployment cannot process if deployment data exceeds 1024kb May 2, 2016

jasongilanfarr closed this as completed Sep 15, 2016

mesosphere locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment cannot process if deployment data exceeds 1024kb #2101

Deployment cannot process if deployment data exceeds 1024kb #2101

nashasha1 commented Aug 26, 2015

meichstedt commented Aug 26, 2015

nashasha1 commented Aug 26, 2015

nashasha1 commented Aug 26, 2015

mingqi commented Aug 27, 2015

mingqi commented Aug 27, 2015

mingqi commented Aug 28, 2015

meichstedt commented Sep 10, 2015

aameek commented Dec 8, 2015

aquamatthias commented Dec 9, 2015

aquamatthias commented Dec 9, 2015

mcclurmc commented May 1, 2016

aquamatthias commented May 2, 2016

jgarcia-mesosphere commented May 2, 2016

kopax commented Jun 14, 2016

gkleiman commented Jun 14, 2016

kopax commented Jun 15, 2016 •

edited

jasongilanfarr commented Sep 15, 2016

jasongilanfarr commented Sep 15, 2016

Deployment cannot process if deployment data exceeds 1024kb #2101

Deployment cannot process if deployment data exceeds 1024kb #2101

Comments

nashasha1 commented Aug 26, 2015

meichstedt commented Aug 26, 2015

nashasha1 commented Aug 26, 2015

nashasha1 commented Aug 26, 2015

mingqi commented Aug 27, 2015

mingqi commented Aug 27, 2015

mingqi commented Aug 28, 2015

meichstedt commented Sep 10, 2015

Questions

aameek commented Dec 8, 2015

aquamatthias commented Dec 9, 2015

aquamatthias commented Dec 9, 2015

mcclurmc commented May 1, 2016

aquamatthias commented May 2, 2016

jgarcia-mesosphere commented May 2, 2016

kopax commented Jun 14, 2016

gkleiman commented Jun 14, 2016

kopax commented Jun 15, 2016 • edited

jasongilanfarr commented Sep 15, 2016

jasongilanfarr commented Sep 15, 2016

kopax commented Jun 15, 2016 •

edited