New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High disk usage #16
Comments
Hello @mce! Sorry for the late response.. may I ask if that happens after sending some metrics data to the image or just by starting it up? I just started to dig into the space requirements of this image and, with the current storage schema of keeping 7 days at 10 seconds resolution and the number of metrics (upper, lower, means, percentiles and so, for a total of 35 metrics right now) that StatsD reports per real world metric, this gives us a total of 25.5 MB of storage needed for each real world metric and that space is allocated when the metric is created for the first time and remains the same for the lifetime of the metric.. is it possible that you are sending too many real world metrics that you are hitting the limit on your current storage? (about 1200 real world metrics). |
When attaching to the running container I found out
flushing to my terminal. I was unable to attach to a new tty (docker 1.2) to the running dashboard, but my host says:
How can I maintain and manage the container? |
@behrad oh God, I'm not precisely the best person to answer these questions :s, but let's see if I can help: First, there is no old data on Graphite, each metric has a single file that is created with the required size to store all the metrics information for the configured storage schemas, I guess there should be ways to wipe the values, but the files will remain there. On the other hand, you might have unused data, data from applications that are no longer relevant that you could remove entirely, apparently the right way to delete data is just to go to the storage folders (/var/lib/graphite/storage/whisper/ for our master branch, /opt/graphite/storage/whisper/ for out grafana-upgrade-1.9.1 branch) and delete the .wsp files yourself. I was able to jump into the container filesystem using So, I guess the first thing to do in your case would be to delete some metrics to gain free space, get your container back to life and investigate on the storage issue... is it possible that you set some sort of image size limit for your docker container? Finally, we recommend to use the docker image for development purposes only, because when something like this happens and things go wrong, well, just kill, remove and start a fresh copy! Since you are using it in prod, it is probably a good idea to mount the whisper storage folder and the supervisord logs folder in a volume visible from outside the container for backup and cleaning up purposes. Also, try to upgrade to what we have on the grafana-upgrade-1.9.1 branch which is more stable.. hope that helps! |
So, according to this post the default disk usage limit for each container is 10GB, maybe you are hitting that limit? |
Thank you again for your valuable info @ivantopo my problem is actually my docker version which doesn't support attaching with new tty :( And I'll sure try grafana 1.9 branch tomorrow |
The changes on the |
@ivantopo I am currently eating up to 35G with 3 nodes (around total 500 actors under timers/actor and 1000 under counters/actor). I'll however remove many of my actors from my kamon's config under production |
I'm also interested in this, in our case 400gb are full after ~5 days. It's on my list to improve the configuration (submit less data, aggregate data in graphite at shorter intervals), but I'm not sure when I'll have time for this. If anybody comes up with some advice I'd also be thankful. |
FYI, after recreating our container (this time it took 9 days for 400gb) and changed storage-schemas.conf to |
Hello @behrad! The retentions can be configured in the storage-schemas.conf file, you can found more detail on the meaning of each item in the carbon docs. Also, if you want to change the configured percentiles in StatsD you can change it's config.js file at wish. That being said, I'd like to comment on aggregations with Graphite. As you guys already know, graphite doesn't hold the entire dataset that lead to the values you see, but only summary datapoints for each interval; that's "nice" in the sense that it makes the storage usage fall into acceptable ranges and makes querying much faster but you pay the price when you want to aggregate. If you add, as Martin did, a second retention schema for 7 days with 1 minute resolution then all the metrics except for counts, sums, min and max are lies, graphite doesn't have the appropriate information to summarize percentiles and will simply average the values, turning your precious data into lies :(. That's why our default config doesn't have aggregations, but we included some aggregation schemas so that in case people decided to aggregate, then at least some metrics would be correct. If you want to have less percentiles I would recommend leaving 90, 99 and 99.9 and always plot them close with the max as well. Sorry that I can't offer much better solutions, but I guess that's the best you can do with the tool at hand, regards! |
And @magro, if your container's storage usage keeps growing over time it means that you are actually adding new entities to the mix every day. Is it possible that some of your actors have variable names that change over time? If your app is always the same and running on the same hosts, then the storage should also remain the same.. I don't know if graphite has some sort of cleanup tool to delete metrics that have not been updated recently, maybe a simple bash script would do the job. I added investigating this to my ever-growing todo list, if you find something please make me a favor and share it here :D, regards! |
@ivantopo Yes, it was erroneous metric creation because of variable metric name. (I work together with @magro.) Just in case somebody needs the clean-up stanza for copy and paste in his crontab:
will delete all the metrics, that were not updated for 3 days. |
👍 |
Thanks for stepping in and sharing this, @geekq! :-) The erroneous metrics were created by the system of another team (that I'm supporting since 2 weeks), they're using kamon 0.4.1-6354021533319790ba675c2b9e36fb439a8ea06f. The play trace name generator of that version already uses the requestHeader.tags(Routes.ROUTE_PATTERN) (falling back to request.uri), therefore I was wondering why the uri based trace names were used. Therefore the hypothesis would be that these HEAD requests "emulated" by play (by invoking the corresponding GET route) don't have set the Routes.ROUTE_PATTERN in the requestHeader.tags. Probably play should set the Routes.ROUTE_PATTERN for such emulated HEAD requests, might be worth a pull request :-) |
@ivantopo @dpsoft I just looked into the HEAD thing, and in 2.3.x def onRequestReceived(request: RequestHeader): (RequestHeader, Handler) = {
val notFoundHandler = Action.async(BodyParsers.parse.empty)(this.onHandlerNotFound)
val (routedRequest, handler) = onRouteRequest(request) map {
case handler: RequestTaggingHandler => (handler.tagRequest(request), handler)
case otherHandler => (request, otherHandler)
} getOrElse {
// We automatically permit HEAD requests against any GETs without the need to
// add an explicit mapping in Routes
val missingHandler: Handler = request.method match {
case HttpVerbs.HEAD =>
new HeadAction(onRouteRequest(request.copy(method = HttpVerbs.GET)).getOrElse(notFoundHandler))
case _ =>
notFoundHandler
}
(request, missingHandler)
}
(routedRequest, doFilter(rh => handler)(routedRequest))
} According to this and my understanding of the related aspectj annotations, This would have to be discussed on the play mailing list / issue tracker, but for me it's not that important to invest this time. |
Yeah, totally agree with you, this needs attention on our side. We should at least avoid generating undesired trace metrics when we hit that code path. |
@ivantopo Do you think it might cause issues that beforeRouteRequest is called twice, should I submit an issue for this? As I already said, I'd not expect kamon to handle this HEAD-does-not-have-a-tagged-request issue, but I'd think this should be discussed with the play guys first. Or shall I submit an issue for this to kamon? |
to be honest, I'm not fully aware of what is going on in there.. @dpsoft is the Play! guy :) and from a conversation we just had, it seems like this definitely need an issue on our side, so please go ahead and then we (mostly Diego) will jump into that :D |
@ivantopo @magro sorry for the late response, but I agree completely, I think that we are creating undesired trace metrics for HEAD requests, and also is a good idea ask to the play guys where is the expected behavior for HEAD requests or if exists some workaround in the case that no handler is found, and the it's a HEAD requests, looks up a handler for an equivalent GET request or something like that. |
Hi Guys I remember i deployed this same containers around 5 months back and it was not generating data like today. IIRC my 500GB volume lasted for 4 months or so. So i guessed there is something wrong with docker image. So i tried creating fresh docker image using the provided DockerFile and right now my monitoring service is UP and i am not seeing any high disk usage. Its been 7 hours and only 2GB data is generated by 23 collectd clients. So if you still seeing high disk usage , try creating docker image from file. #24 FYI |
Hello,
I am running the docker image on a digitalocean droplet. However after some time(5-10 minutes) disk usage becomes this:
/dev/vda1 30G 30G 0 100% /
I had the same issue on my laptop with the docker version 1.4.1 which makes me think there's a problem with the image.
Docker version is
Docker version 1.0.1, build 990021a
These two directory shares most of the disk by the way:
/var/lib/docker/aufs/diff
and/var/lib/docker/aufs/mnt
Any thoughts on what might be the problem?
The text was updated successfully, but these errors were encountered: