Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate higher disk i/o utilization on 1.4.2. #9201

Closed
rbetts opened this issue Dec 5, 2017 · 28 comments · Fixed by #9204
Closed

Investigate higher disk i/o utilization on 1.4.2. #9201

rbetts opened this issue Dec 5, 2017 · 28 comments · Fixed by #9204
Assignees
Milestone

Comments

@rbetts
Copy link
Contributor

rbetts commented Dec 5, 2017

  • Not yet able to reproduce higher io util. WIP.
@szibis
Copy link

szibis commented Dec 5, 2017

I have this issue I have the same high number of reads and writes. Previously an only high number of writes. This is very bad and I get 30% of IO wait right now.

I can add more IOPS but before 1.4.2, on 1.3.6 it was enough.

@jwilder
Copy link
Contributor

jwilder commented Dec 5, 2017

@szibis How many CPU cores do you have available and also how many IOPS can your disks support?

@szibis
Copy link

szibis commented Dec 5, 2017

Using m4 instances on AWS - 16 cores. Additional GP2 EBS with 3TB size. Traffic about 60k+ req writes per second.

Currently using twice much memory then on 1.3.6 - about 50-58GB.

@jwilder
Copy link
Contributor

jwilder commented Dec 6, 2017

@szibis Since you have 16 cores or more, can you try setting:

[data]
cache-snapshot-memory-size = 1073741824
max-concurrent-compactions = 4

You may need to increase cache-max-memory-size = 2147483648 or higher if it's still the default. This won't resolve the issue fully, but should reduce disk util a bit. You can also experiment with setting max-concurrent-compactions as low as 2, but you'll need to monitor the compaction backlog to make sure it is still keeping up over time.

@ghost ghost added the review label Dec 6, 2017
@ghost ghost removed the review label Dec 7, 2017
@jwilder
Copy link
Contributor

jwilder commented Dec 7, 2017

@szibis I have merged a fix to master and 1.4 branch. If you are able to test it out, that would be great.

@szibis
Copy link

szibis commented Dec 7, 2017

@jwilder This is available as nightly build or I need to build the package on my own?

@jwilder
Copy link
Contributor

jwilder commented Dec 7, 2017

@szibis If you want to test the nightly, it should be in there tonight. That is based off of master/1.5 though. The current nightly is using fd11e20. Testing off of 1.4, you would need to build off of the 1.4 branch.

@szibis
Copy link

szibis commented Dec 7, 2017

Using this version:

influxd version
InfluxDB v1.4.2 (git: 1.4 50063f9ecdab80f3b1afc5ad36dff0e2c5e9cdee)

I just set as you suggested

cache-snapshot-memory-size = 1073741824
max-concurrent-compactions = 4

On influxdb 1.3.6 before config changes
image

On influxdb 1.4.2 after fixes in code and config changes
image

Looks better on iowait. It is low now.
image
Previously before problems
image

Waiting for some long interval stats.

@jwilder
Copy link
Contributor

jwilder commented Dec 7, 2017

@szibis I just realized that, if you set cache-snapshot-memory-size = 1073741824, you will also need to bump up cache-max-memory-size to something higher than 1073741824 (the default). Setting it to 2 or 3gb might be appropriate for you. That will prevent getting cache-max-memory-size limit errors on writes when the cache fills up.

If you are running the change in 1.4, you shouldn't need the config changes though. I suggested those as a workaround for 1.4.2 in lieu of the fixes. They won't hurt if you want to keep them though.

@szibis
Copy link

szibis commented Dec 7, 2017

Yes, I use this bumped from the beginning.

cache-max-memory-size = 3048576000

I will keep them changed.

Thx, for fix and help.

@lpic10
Copy link

lpic10 commented Dec 8, 2017

@jwilder I also can see a big improvement over previous nightly builds. I had noticed this bump in disk i/o when going from version 1.3 to version 1.4+

cpu
diskio
diskio2
diskio3

@jwilder
Copy link
Contributor

jwilder commented Dec 8, 2017

@lpic10 Thanks for the update. What do you have set for cache-snapshot-memory-size?

@lpic10
Copy link

lpic10 commented Dec 8, 2017

@lpic10 Thanks for the update. What do you have set for cache-snapshot-memory-size?

I don't have it set, I suppose it is the default value

@rbetts rbetts added this to the 1.4.3 milestone Dec 11, 2017
@lpic10
Copy link

lpic10 commented Dec 13, 2017

@jwilder not sure related, but I've noticed a big increase in memory consumption since the update (blue mark)

influxdb-internals

Let me know if you want me to run a profile (or if I should comment or open a different issue).

@jwilder
Copy link
Contributor

jwilder commented Dec 13, 2017

@lpic10 Can you try setting cache-snapshot-memory-size = "25m"? I increased the default it in the fix, but I believe that is the cause of the increased memory usage and will likely need to revert it.

@szibis
Copy link

szibis commented Dec 13, 2017

@jwilder All return to bad state. I/Owaits are now high, number of reads are same as number ow writes with cache-snapshot-memory-size set to 25m OOM's are very frequent and memory usage is high much higher then on 1.3.x

Memory and OOM's:
image

IO/wait

image

image

IOPS

image

@lpic10
Copy link

lpic10 commented Dec 13, 2017

@jwilder I just changed here, tomorrow I give an update

@jwilder
Copy link
Contributor

jwilder commented Dec 13, 2017

@szibis Can you grab some profiles via the /debug/pprof/all endpoint?

@jwilder jwilder reopened this Dec 13, 2017
@szibis
Copy link

szibis commented Dec 13, 2017

We have some bigger amount of measurements right now (1100+) because of some bad reporting issue, but it was on 1.3.x and it was working much better than now on 1.4.x. Removing them is now impossible, even one, by one. Standard number of measurements is under 200.

Every config change and restart takes with such big DB even 30+ minutes.
It is very hard to get this profiles on this box/boxes when InfluxDB is operating or try to operate.

pprof.contentions.delay.001.pb.gz
pprof.influxd.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
pprof.influxd.samples.cpu.001.pb.gz

@jwilder
Copy link
Contributor

jwilder commented Dec 13, 2017

@szibis I've attached a build of the current 1.4 branch plus some changes I'm testing to resolve these issues. Would you be able to try this out and see if it improves your situation?
influxd.gz

I can push up a branch if you prefer to build the binary yourself.

@szibis
Copy link

szibis commented Dec 13, 2017

Testing on one node.

@jwilder jwilder mentioned this issue Dec 13, 2017
4 tasks
@szibis
Copy link

szibis commented Dec 14, 2017

Looks better

image

image

@lpic10
Copy link

lpic10 commented Dec 15, 2017

After changing to cache-snapshot-memory-size = "25m" made the memory usage look a bit more stable, but still much higher than before.

I'll give a try on next nightly with the latest fixes.

influxdb-internals2

@jwilder
Copy link
Contributor

jwilder commented Dec 15, 2017

@lpic10 @szibis @oiooj I've merge some fixes to 1.4 branch and master. If any of you are able to test those builds out, that would be really helpful. You should have cache-snapshot-memory-size left commented out in your config w/ these changes so that the default is used.

@lpic10
Copy link

lpic10 commented Dec 20, 2017

influxdb-internals3

Trying latest nightly with cache-snapshot-memory-size commented, some conclusions for now:

  • it is rather stable (just few days, but no crashes, no oom);
  • memory consumption is still higher than when this issue was filled, but lower than the initial fix;
  • disk utilization is about half of what was when this issue was filled, but higher than the initial fix: it is about ~50 IOPS now, was >150 IOPS initially and ~30 IOPS after the initial fix

Unsurprisingly, changes that has reduced disk IO has increased need for memory, and vice-versa.

@szibis
Copy link

szibis commented Dec 21, 2017

For me latest 1.4 branch is ok. Looks better and graphs from last two days.

Currently, i have only issues with long-running read queries taking a huge number of memory and CPU.

ii  influxdb                              1.4.2~50063f9-0                            amd64        Distributed time-series database

image

image

@oiooj
Copy link
Contributor

oiooj commented Dec 21, 2017

image

image

image

Looks good. Compared with v1.2.4 only in disk.io.util increased by 2%.
I will upgrade when it release.

@rbetts
Copy link
Contributor Author

rbetts commented Jan 2, 2018

@jwilder should this be closed?

@jwilder jwilder closed this as completed Jan 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants