Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High disk usage #16

Closed
mce opened this issue Jan 9, 2015 · 21 comments
Closed

High disk usage #16

mce opened this issue Jan 9, 2015 · 21 comments

Comments

@mce
Copy link

mce commented Jan 9, 2015

Hello,

I am running the docker image on a digitalocean droplet. However after some time(5-10 minutes) disk usage becomes this: /dev/vda1 30G 30G 0 100% /

I had the same issue on my laptop with the docker version 1.4.1 which makes me think there's a problem with the image.

Docker version is Docker version 1.0.1, build 990021a

These two directory shares most of the disk by the way: /var/lib/docker/aufs/diff and /var/lib/docker/aufs/mnt

Any thoughts on what might be the problem?

@ivantopo
Copy link
Contributor

Hello @mce! Sorry for the late response.. may I ask if that happens after sending some metrics data to the image or just by starting it up?

I just started to dig into the space requirements of this image and, with the current storage schema of keeping 7 days at 10 seconds resolution and the number of metrics (upper, lower, means, percentiles and so, for a total of 35 metrics right now) that StatsD reports per real world metric, this gives us a total of 25.5 MB of storage needed for each real world metric and that space is allocated when the metric is created for the first time and remains the same for the lifetime of the metric.. is it possible that you are sending too many real world metrics that you are hitting the limit on your current storage? (about 1200 real world metrics).

@behrad
Copy link

behrad commented Feb 10, 2015

When attaching to the running container I found out

Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 84, in emit
    self.flush()
  File "/usr/lib/python2.7/dist-packages/supervisor/loggers.py", line 64, in flush
    self.stream.flush()
IOError: [Errno 28] No space left on device

flushing to my terminal. I was unable to attach to a new tty (docker 1.2) to the running dashboard, but my host says:

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        38G   28G  8.4G  77% /
tmpfs           7.8G     0  7.8G   0% /dev/shm

How can I maintain and manage the container?
How to remove old metrics data in production?
need you help @ivantopo 😍

@ivantopo
Copy link
Contributor

@behrad oh God, I'm not precisely the best person to answer these questions :s, but let's see if I can help:

First, there is no old data on Graphite, each metric has a single file that is created with the required size to store all the metrics information for the configured storage schemas, I guess there should be ways to wipe the values, but the files will remain there. On the other hand, you might have unused data, data from applications that are no longer relevant that you could remove entirely, apparently the right way to delete data is just to go to the storage folders (/var/lib/graphite/storage/whisper/ for our master branch, /opt/graphite/storage/whisper/ for out grafana-upgrade-1.9.1 branch) and delete the .wsp files yourself. I was able to jump into the container filesystem using docker exec -it CONTAINER_ID bash, but I'm not sure if that is available on your version of docker. I also found myself looking at the data in /var/lib/docker/aufs/diff/CONTAINER_ID, but I never deleted anything from there, I'm not even sure if doing so is "safe".

So, I guess the first thing to do in your case would be to delete some metrics to gain free space, get your container back to life and investigate on the storage issue... is it possible that you set some sort of image size limit for your docker container?

Finally, we recommend to use the docker image for development purposes only, because when something like this happens and things go wrong, well, just kill, remove and start a fresh copy! Since you are using it in prod, it is probably a good idea to mount the whisper storage folder and the supervisord logs folder in a volume visible from outside the container for backup and cleaning up purposes. Also, try to upgrade to what we have on the grafana-upgrade-1.9.1 branch which is more stable.. hope that helps!

@ivantopo
Copy link
Contributor

So, according to this post the default disk usage limit for each container is 10GB, maybe you are hitting that limit?

@behrad
Copy link

behrad commented Feb 10, 2015

Thank you again for your valuable info @ivantopo

my problem is actually my docker version which doesn't support attaching with new tty :(
should upgrade to 1.3 on my centOS, and yeah! actually I was wondering where that 10GB came from, which your post confirmed it 👍 I am not a docker expert :p

And I'll sure try grafana 1.9 branch tomorrow

@ivantopo
Copy link
Contributor

The changes on the grafana-upgrade-1.9.1 branch have already been merged to master and published to the docker registry so, I guess there is no reason to keep this issue open. If any related issue arises please let us know.

@behrad
Copy link

behrad commented Apr 7, 2015

with the current storage schema of keeping 7 days at 10 seconds resolution and the number of metrics (upper, lower, means, percentiles and so, for a total of 35 metrics right now) that StatsD reports per real world metric, this gives us a total of 25.5 MB of storage needed for each real world metric and that space is allocated when the metric is created for the first time and remains the same for the lifetime of the metric..

@ivantopo I am currently eating up to 35G with 3 nodes (around total 500 actors under timers/actor and 1000 under counters/actor).
How can I configure 7days window, and disable some metrics like *_75, 90, 95, 98, 99, ... that I'm not using

I'll however remove many of my actors from my kamon's config under production

@magro
Copy link

magro commented Apr 7, 2015

I'm also interested in this, in our case 400gb are full after ~5 days. It's on my list to improve the configuration (submit less data, aggregate data in graphite at shorter intervals), but I'm not sure when I'll have time for this. If anybody comes up with some advice I'd also be thankful.

@magro
Copy link

magro commented Apr 9, 2015

FYI, after recreating our container (this time it took 9 days for 400gb) and changed storage-schemas.conf to retentions = 10s:1d,1m:7d,1h:1m (as a quick fix).

@ivantopo
Copy link
Contributor

Hello @behrad! The retentions can be configured in the storage-schemas.conf file, you can found more detail on the meaning of each item in the carbon docs. Also, if you want to change the configured percentiles in StatsD you can change it's config.js file at wish.

That being said, I'd like to comment on aggregations with Graphite. As you guys already know, graphite doesn't hold the entire dataset that lead to the values you see, but only summary datapoints for each interval; that's "nice" in the sense that it makes the storage usage fall into acceptable ranges and makes querying much faster but you pay the price when you want to aggregate. If you add, as Martin did, a second retention schema for 7 days with 1 minute resolution then all the metrics except for counts, sums, min and max are lies, graphite doesn't have the appropriate information to summarize percentiles and will simply average the values, turning your precious data into lies :(. That's why our default config doesn't have aggregations, but we included some aggregation schemas so that in case people decided to aggregate, then at least some metrics would be correct.

If you want to have less percentiles I would recommend leaving 90, 99 and 99.9 and always plot them close with the max as well. Sorry that I can't offer much better solutions, but I guess that's the best you can do with the tool at hand, regards!

@ivantopo
Copy link
Contributor

And @magro, if your container's storage usage keeps growing over time it means that you are actually adding new entities to the mix every day. Is it possible that some of your actors have variable names that change over time? If your app is always the same and running on the same hosts, then the storage should also remain the same.. I don't know if graphite has some sort of cleanup tool to delete metrics that have not been updated recently, maybe a simple bash script would do the job. I added investigating this to my ever-growing todo list, if you find something please make me a favor and share it here :D, regards!

@geekq
Copy link

geekq commented Jul 24, 2015

@ivantopo Yes, it was erroneous metric creation because of variable metric name. (I work together with @magro.)

Just in case somebody needs the clean-up stanza for copy and paste in his crontab:

find /opt/graphite/storage/whisper/stats -type f -mtime +3 -delete && find /opt/graphite/storage/whisper/stats -type d -empty -delete

will delete all the metrics, that were not updated for 3 days.

@behrad
Copy link

behrad commented Jul 24, 2015

👍

@magro
Copy link

magro commented Jul 24, 2015

Thanks for stepping in and sharing this, @geekq! :-)

The erroneous metrics were created by the system of another team (that I'm supporting since 2 weeks), they're using kamon 0.4.1-6354021533319790ba675c2b9e36fb439a8ea06f. The play trace name generator of that version already uses the requestHeader.tags(Routes.ROUTE_PATTERN) (falling back to request.uri), therefore I was wondering why the uri based trace names were used.
IIRC these were all traces for HEAD requests and I just checked their routes file, that just don't contain entries for such HEAD requests.

Therefore the hypothesis would be that these HEAD requests "emulated" by play (by invoking the corresponding GET route) don't have set the Routes.ROUTE_PATTERN in the requestHeader.tags.

Probably play should set the Routes.ROUTE_PATTERN for such emulated HEAD requests, might be worth a pull request :-)

@ivantopo
Copy link
Contributor

Thanks for sharing that @geekq! And, with regards to the HEAD issue... mmmm, maybe @dpsoft can take a look at it?.. as you might guess, I'm not a Play fan :) If that is an actual problem let's open a new issue and discuss that over there!

@magro
Copy link

magro commented Jul 27, 2015

@ivantopo @dpsoft I just looked into the HEAD thing, and in 2.3.x GlobalSettings.onRequestReceived looks like this:

  def onRequestReceived(request: RequestHeader): (RequestHeader, Handler) = {
    val notFoundHandler = Action.async(BodyParsers.parse.empty)(this.onHandlerNotFound)
    val (routedRequest, handler) = onRouteRequest(request) map {
      case handler: RequestTaggingHandler => (handler.tagRequest(request), handler)
      case otherHandler => (request, otherHandler)
    } getOrElse {
      // We automatically permit HEAD requests against any GETs without the need to
      // add an explicit mapping in Routes
      val missingHandler: Handler = request.method match {
        case HttpVerbs.HEAD =>
          new HeadAction(onRouteRequest(request.copy(method = HttpVerbs.GET)).getOrElse(notFoundHandler))
        case _ =>
          notFoundHandler
      }
      (request, missingHandler)
    }

    (routedRequest, doFilter(rh => handler)(routedRequest))
  }

According to this and my understanding of the related aspectj annotations, kamon.play.instrumentation.RequestInstrumentation.beforeRouteRequest should be called twice for HEAD requests if there's no explicit route for HEAD (but anyways, this is not causing the issue). The issue AFAICS is that, for the case HttpVerbs.HEAD, the routedRequest is not tagged. And I could imagine that this is intentional because at least the ROUTE_VERB tag would be wrong according to the original request.

This would have to be discussed on the play mailing list / issue tracker, but for me it's not that important to invest this time.

@ivantopo
Copy link
Contributor

Yeah, totally agree with you, this needs attention on our side. We should at least avoid generating undesired trace metrics when we hit that code path.

@magro
Copy link

magro commented Jul 27, 2015

@ivantopo Do you think it might cause issues that beforeRouteRequest is called twice, should I submit an issue for this? As I already said, I'd not expect kamon to handle this HEAD-does-not-have-a-tagged-request issue, but I'd think this should be discussed with the play guys first. Or shall I submit an issue for this to kamon?

@ivantopo
Copy link
Contributor

to be honest, I'm not fully aware of what is going on in there.. @dpsoft is the Play! guy :) and from a conversation we just had, it seems like this definitely need an issue on our side, so please go ahead and then we (mostly Diego) will jump into that :D

@dpsoft
Copy link
Contributor

dpsoft commented Jul 27, 2015

@ivantopo @magro sorry for the late response, but I agree completely, I think that we are creating undesired trace metrics for HEAD requests, and also is a good idea ask to the play guys where is the expected behavior for HEAD requests or if exists some workaround in the case that no handler is found, and the it's a HEAD requests, looks up a handler for an equivalent GET request or something like that.

@ksingh7
Copy link

ksingh7 commented Aug 11, 2016

Hi Guys
I Like to share my way of fixing this problem. I cloned this repo today and have built up docker container. I am having 10 collectd clients pushing their metrics to this newdocker-grafana-graphite service. Just within 30 minutes my 500GB mount point got filled up. Which is kind of crazy.

I remember i deployed this same containers around 5 months back and it was not generating data like today. IIRC my 500GB volume lasted for 4 months or so.

So i guessed there is something wrong with docker image. So i tried creating fresh docker image using the provided DockerFile and right now my monitoring service is UP and i am not seeing any high disk usage. Its been 7 hours and only 2GB data is generated by 23 collectd clients.

So if you still seeing high disk usage , try creating docker image from file.

#24 FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants