New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment cannot process if deployment data exceeds 1024kb #2101
Comments
Could you please provide some more information from the marathon logfile? There should be an exception logged with that message. |
Aug 26 09:21:17 ip-**--- marathon[27448]: [2015-08-26 09:21:17,252] WARN (mesosphere.marathon.api.MarathonExceptionMapper:30) |
we found this when the number of our containers is larger than 1000. So did marathon have some unusual behaviour when it manages 1000+ containers. |
I and @nashasha1 from same company. I post another logging with better format below. The exception happened when App was created or updated. The exception says
|
Another odd thing is the group version is always 2015-07-13T20:47:41.034Z which query by /v2/groups. I look into the content of Zookeeper node /marathon/state/group:root, the version indeed is 2015-07-13T20:47:41.034. It's seem the version hadn't changes from first created. If the problem is related to big group. we have thousand Apps in root group. The size of zookeeper node /marathon/state/group:root is more than 500K now. is this problem related to big group? actually we don't use Marathon's group feature. Should we create separate group for each app to prevent large group? thanks |
any idea about this? |
@mingqi: In general, 500k is not a size where I'd expect this kind of problem. See information on But yes, we've run into problems with big node sizes – the problem, however, is not the size of the node itself, but the packet size when fetching the node. When trying to delete such a node via the zkCli, the output looks like this:
Notice the Questions
|
Any update/ideas on this? We are seeing a similar issue though our group size or packet size is not that big... curl --silent -X GET http://marathon/v2/groups | python -mjson.tool | wc |
Hey @mingqi if the group reaches a size of more than 500KB, any update to any app or group will fail with a version of Marathon before 0.13.0. For every change of an AppDefinition we create and store a Deployment in ZK which stores the group before the change and the group after the change. If your group is greater than 500KB, the deployment object will be 2 times this size. The default For that reason we introduced ZKCompression, which is available in Marathon 0.13. We successfully started 2000 Apps simultaneously with this version. Can you try, if this version solves your problem? |
@aameek the group is stored as protobuf. The size of the json object is very different. Can you look into zk, how big the object is? |
Just adding my 👍 to this issue. My team and I worked with @jgarcia-mesosphere over the last two weeks to narrow down a problem we were seeing in our dev cluster to this bug. The issue is that Marathon stores group defs in ZK, and the root group seems to store all the data for all child groups. When we got to a point where our Once we understood that Marathon was trying to update the It looks like Marathon is scalable up to thousands of tasks, but those tasks need to have small definitions. We've built a "Heroku-like" system on top of Marathon for our devs, and that means that we've got a high number of apps per group, and each app has around 100 environment variables. So when your devs create multiple groups with this kind of configuration, you'll run into scalability issues sooner than you'd expect. For instance, we currently have 323 apps defined in our dev Marathon (only 120 of them actually running), but our We're going to be rearchitecting our applications to fetch their environments themselves, instead of injecting them via the For anyone experiencing this issue: try deleting your smallest app or app group, then the next smallest, until you have control of your cluster again. That got us moving again. |
@mcclurmc thanks for the info. We are very aware of this limitation. We introduced a command line parameter Step 1) we will refactor how the group is stored (only references to the app) |
To check for this, pull Marathon statistics from |
Is this fixed ? |
@kopax not yet |
@gkleiman and will it be? This issues as almost a year now. Just need to know if I need to adopte a different strategy regarding my deployment. |
There is a new persistence layer in 1.4-SNAPSHOT that is currently enabled by default that stores in ZK in a much more scalable fashion. I was able to store 2,500 apps (easily scales more) with 450,000 tasks. The theoretical limit is about a million "objects" of a given type. |
Fixed by #4178 |
In 0.8.2 We create an app, marathon return success. But it never really create it. And it not in deployment nor in queue.
So we upgrade to 0.9.2
It shows error:
Could not modify Group with key: root:2015-07-31T08:50:18.342Z
We remove some zookeeper log/snapshot. It is ok.
But after we recover some apps, the error appear again.
The text was updated successfully, but these errors were encountered: