Investigate higher disk i/o utilization on 1.4.2. #9201

rbetts · 2017-12-05T18:06:31Z

Not yet able to reproduce higher io util. WIP.

szibis · 2017-12-05T20:18:13Z

I have this issue I have the same high number of reads and writes. Previously an only high number of writes. This is very bad and I get 30% of IO wait right now.

I can add more IOPS but before 1.4.2, on 1.3.6 it was enough.

jwilder · 2017-12-05T20:38:31Z

@szibis How many CPU cores do you have available and also how many IOPS can your disks support?

szibis · 2017-12-05T21:39:11Z

Using m4 instances on AWS - 16 cores. Additional GP2 EBS with 3TB size. Traffic about 60k+ req writes per second.

Currently using twice much memory then on 1.3.6 - about 50-58GB.

jwilder · 2017-12-06T15:06:41Z

@szibis Since you have 16 cores or more, can you try setting:

[data]
cache-snapshot-memory-size = 1073741824
max-concurrent-compactions = 4

You may need to increase cache-max-memory-size = 2147483648 or higher if it's still the default. This won't resolve the issue fully, but should reduce disk util a bit. You can also experiment with setting max-concurrent-compactions as low as 2, but you'll need to monitor the compaction backlog to make sure it is still keeping up over time.

jwilder · 2017-12-07T17:07:51Z

@szibis I have merged a fix to master and 1.4 branch. If you are able to test it out, that would be great.

szibis · 2017-12-07T18:26:12Z

@jwilder This is available as nightly build or I need to build the package on my own?

jwilder · 2017-12-07T18:31:14Z

@szibis If you want to test the nightly, it should be in there tonight. That is based off of master/1.5 though. The current nightly is using fd11e20. Testing off of 1.4, you would need to build off of the 1.4 branch.

szibis · 2017-12-07T21:58:43Z

Using this version:

influxd version
InfluxDB v1.4.2 (git: 1.4 50063f9ecdab80f3b1afc5ad36dff0e2c5e9cdee)

I just set as you suggested

cache-snapshot-memory-size = 1073741824
max-concurrent-compactions = 4

On influxdb 1.3.6 before config changes

On influxdb 1.4.2 after fixes in code and config changes

Looks better on iowait. It is low now.

Previously before problems

Waiting for some long interval stats.

jwilder · 2017-12-07T22:03:10Z

@szibis I just realized that, if you set cache-snapshot-memory-size = 1073741824, you will also need to bump up cache-max-memory-size to something higher than 1073741824 (the default). Setting it to 2 or 3gb might be appropriate for you. That will prevent getting cache-max-memory-size limit errors on writes when the cache fills up.

If you are running the change in 1.4, you shouldn't need the config changes though. I suggested those as a workaround for 1.4.2 in lieu of the fixes. They won't hurt if you want to keep them though.

szibis · 2017-12-07T23:07:03Z

Yes, I use this bumped from the beginning.

cache-max-memory-size = 3048576000

I will keep them changed.

Thx, for fix and help.

lpic10 · 2017-12-08T10:55:39Z

@jwilder I also can see a big improvement over previous nightly builds. I had noticed this bump in disk i/o when going from version 1.3 to version 1.4+

jwilder · 2017-12-08T15:34:20Z

@lpic10 Thanks for the update. What do you have set for cache-snapshot-memory-size?

lpic10 · 2017-12-08T15:38:04Z

@lpic10 Thanks for the update. What do you have set for cache-snapshot-memory-size?

I don't have it set, I suppose it is the default value

lpic10 · 2017-12-13T15:27:44Z

@jwilder not sure related, but I've noticed a big increase in memory consumption since the update (blue mark)

Let me know if you want me to run a profile (or if I should comment or open a different issue).

jwilder · 2017-12-13T15:32:28Z

@lpic10 Can you try setting cache-snapshot-memory-size = "25m"? I increased the default it in the fix, but I believe that is the cause of the increased memory usage and will likely need to revert it.

szibis · 2017-12-13T15:36:44Z

@jwilder All return to bad state. I/Owaits are now high, number of reads are same as number ow writes with cache-snapshot-memory-size set to 25m OOM's are very frequent and memory usage is high much higher then on 1.3.x

Memory and OOM's:

IO/wait

IOPS

lpic10 · 2017-12-13T15:42:39Z

@jwilder I just changed here, tomorrow I give an update

jwilder · 2017-12-13T15:45:50Z

@szibis Can you grab some profiles via the /debug/pprof/all endpoint?

szibis · 2017-12-13T16:28:14Z

We have some bigger amount of measurements right now (1100+) because of some bad reporting issue, but it was on 1.3.x and it was working much better than now on 1.4.x. Removing them is now impossible, even one, by one. Standard number of measurements is under 200.

Every config change and restart takes with such big DB even 30+ minutes.
It is very hard to get this profiles on this box/boxes when InfluxDB is operating or try to operate.

pprof.contentions.delay.001.pb.gz
pprof.influxd.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
pprof.influxd.samples.cpu.001.pb.gz

jwilder · 2017-12-13T16:41:57Z

@szibis I've attached a build of the current 1.4 branch plus some changes I'm testing to resolve these issues. Would you be able to try this out and see if it improves your situation?
influxd.gz

I can push up a branch if you prefer to build the binary yourself.

szibis · 2017-12-13T17:00:00Z

Testing on one node.

szibis · 2017-12-14T09:53:28Z

Looks better

lpic10 · 2017-12-15T08:36:19Z

After changing to cache-snapshot-memory-size = "25m" made the memory usage look a bit more stable, but still much higher than before.

I'll give a try on next nightly with the latest fixes.

jwilder · 2017-12-15T18:29:48Z

@lpic10 @szibis @oiooj I've merge some fixes to 1.4 branch and master. If any of you are able to test those builds out, that would be really helpful. You should have cache-snapshot-memory-size left commented out in your config w/ these changes so that the default is used.

lpic10 · 2017-12-20T09:37:07Z

Trying latest nightly with cache-snapshot-memory-size commented, some conclusions for now:

it is rather stable (just few days, but no crashes, no oom);
memory consumption is still higher than when this issue was filled, but lower than the initial fix;
disk utilization is about half of what was when this issue was filled, but higher than the initial fix: it is about ~50 IOPS now, was >150 IOPS initially and ~30 IOPS after the initial fix

Unsurprisingly, changes that has reduced disk IO has increased need for memory, and vice-versa.

szibis · 2017-12-21T11:00:48Z

For me latest 1.4 branch is ok. Looks better and graphs from last two days.

Currently, i have only issues with long-running read queries taking a huge number of memory and CPU.

ii  influxdb                              1.4.2~50063f9-0                            amd64        Distributed time-series database

oiooj · 2017-12-21T13:54:51Z

Looks good. Compared with v1.2.4 only in disk.io.util increased by 2%.
I will upgrade when it release.

rbetts · 2018-01-02T18:04:59Z

@jwilder should this be closed?

rbetts assigned jwilder Dec 5, 2017

jwilder mentioned this issue Dec 6, 2017

Fix higher disk utilization regression #9204

Merged

4 tasks

ghost added the review label Dec 6, 2017

jwilder closed this as completed in #9204 Dec 7, 2017

ghost removed the review label Dec 7, 2017

rbetts added this to the 1.4.3 milestone Dec 11, 2017

jwilder reopened this Dec 13, 2017

jwilder mentioned this issue Dec 13, 2017

Disk utilization fixes #9225

Merged

4 tasks

jwilder closed this as completed Jan 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate higher disk i/o utilization on 1.4.2. #9201

Investigate higher disk i/o utilization on 1.4.2. #9201

rbetts commented Dec 5, 2017

szibis commented Dec 5, 2017

jwilder commented Dec 5, 2017

szibis commented Dec 5, 2017

jwilder commented Dec 6, 2017

jwilder commented Dec 7, 2017

szibis commented Dec 7, 2017 •

edited

Loading

jwilder commented Dec 7, 2017

szibis commented Dec 7, 2017

jwilder commented Dec 7, 2017 •

edited

Loading

szibis commented Dec 7, 2017

lpic10 commented Dec 8, 2017

jwilder commented Dec 8, 2017

lpic10 commented Dec 8, 2017 •

edited

Loading

lpic10 commented Dec 13, 2017

jwilder commented Dec 13, 2017

szibis commented Dec 13, 2017

lpic10 commented Dec 13, 2017

jwilder commented Dec 13, 2017

szibis commented Dec 13, 2017

jwilder commented Dec 13, 2017

szibis commented Dec 13, 2017

szibis commented Dec 14, 2017

lpic10 commented Dec 15, 2017

jwilder commented Dec 15, 2017

lpic10 commented Dec 20, 2017

szibis commented Dec 21, 2017

oiooj commented Dec 21, 2017

rbetts commented Jan 2, 2018

Investigate higher disk i/o utilization on 1.4.2. #9201

Investigate higher disk i/o utilization on 1.4.2. #9201

Comments

rbetts commented Dec 5, 2017

szibis commented Dec 5, 2017

jwilder commented Dec 5, 2017

szibis commented Dec 5, 2017

jwilder commented Dec 6, 2017

jwilder commented Dec 7, 2017

szibis commented Dec 7, 2017 • edited Loading

jwilder commented Dec 7, 2017

szibis commented Dec 7, 2017

jwilder commented Dec 7, 2017 • edited Loading

szibis commented Dec 7, 2017

lpic10 commented Dec 8, 2017

jwilder commented Dec 8, 2017

lpic10 commented Dec 8, 2017 • edited Loading

lpic10 commented Dec 13, 2017

jwilder commented Dec 13, 2017

szibis commented Dec 13, 2017

lpic10 commented Dec 13, 2017

jwilder commented Dec 13, 2017

szibis commented Dec 13, 2017

jwilder commented Dec 13, 2017

szibis commented Dec 13, 2017

szibis commented Dec 14, 2017

lpic10 commented Dec 15, 2017

jwilder commented Dec 15, 2017

lpic10 commented Dec 20, 2017

szibis commented Dec 21, 2017

oiooj commented Dec 21, 2017

rbetts commented Jan 2, 2018

szibis commented Dec 7, 2017 •

edited

Loading

jwilder commented Dec 7, 2017 •

edited

Loading

lpic10 commented Dec 8, 2017 •

edited

Loading