Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard deletion triggers infinite memory usage due to cache lock contention #7226

Closed
stevenh opened this issue Aug 28, 2016 · 24 comments
Closed
Assignees
Milestone

Comments

@stevenh
Copy link
Contributor

stevenh commented Aug 28, 2016

Bug report

Influxdb: 1.0.0-beta3
OS Linux: 3.10.0-327.el7.x86_64

Steps to reproduce:

  1. Ingress data to influx at a decent rate
  2. Wait for 7 day retention policy to trigger
  3. Watch memory rise indefinitely

Expected behaviour:
Compaction should have little impact on memory

Actual behaviour:
Memory increases until it triggers out of memory

Additional info:
Looking into the details from pprof its clear that the issue is serious contention on the cache lock, enough so that all queries (read / write) to influx are stalling pretty much indefinitely while compaction takes place.

When incoming requests get behind, the problem spirals out of control with more and more memory used for query goroutines.

When we captured it we had 100k goroutines waiting on the cache lock using
tsm1.newIntegerAscendingCursor accounting for 129GB of active memory, this is very clear to see in heep.pdf below.

In our case we have a master slave setup for reads, with this node being the slave; so its only taking writes from influx-relay unless the primary is down, which it wasn't so all of this we believe is write not read traffic.

Once we realised this we disabled writes by taking this node out of influx-relay and it recovered in ~30mins even from 312GB. At the same point we saw a query complete which took 13555 minutes.

image

image

Machine Spec

  • 2 x Intel(R) Xeon(R) CPU E5-2643 0 @ 3.30GHz
  • 168GB RAM
  • Disk provided by a Purestorage SSD backed over 10Gbps iSCSI

Logs:
block.txt
goroutine2.txt
heap2.txt
vars.txt
iostat.txt
heap.pdf

image

image

@stevenh
Copy link
Contributor Author

stevenh commented Aug 28, 2016

Just put in a PR which should help with this #7228

@stevenh
Copy link
Contributor Author

stevenh commented Aug 29, 2016

Another thing that could help significantly is to introduce the concept of handlers. By doing this the amount of contention on the lock could be controlled, something like having 10 write handlers, 1 snapshot handler and 50 read handlers.

Each type would be processed by sending requests to a buffered channel which the relevant handlers read from and process. If the buffer is full the action could be rejected preventing excessive backlog build up, which in turn would ensure that memory usage stays bounded.

In addition to this sharding the cache and looking to eliminate the impact of GC should also be done, as the current cache is likely to cause significant GC pauses due to the use of pointers, see http://allegro.tech/2016/03/writing-fast-cache-service-in-go.html for details.

@jackzampolin
Copy link
Contributor

jackzampolin commented Sep 12, 2016

@stevenh Have you tested out 1.0 yet? There was a pull merged which may help with this: #7258. There is also some in flight work that might help: #7165

@stevenh
Copy link
Contributor Author

stevenh commented Sep 12, 2016

We're currently testing 1.0 yes but we haven't let a compaction happen yet.

Given the description of the problem in #7258 I don't believe that's the cause of this, as it indicates it was a irrecoverable situation, so given stopping the incoming data allowed recovery from our issue it seems unlikely to be the cause.

@stevenh
Copy link
Contributor Author

stevenh commented Sep 12, 2016

Also just read through #7165 and it doesn't seem to be related, neither of these seem to address the underlying issue of the high lock contention or the data race which will cause invalid cache size restriction.

@jwilder
Copy link
Contributor

jwilder commented Sep 13, 2016

@stevenh What size of writes are you sending? Are you batching writes or just sending many small ones? Can you provide some sample writes in line protocol format?

Also, compactions are not the same thing as retention policy expiring shards. Compactions happen all the time while writes are occurring on a shard. Snapshot compactions are the only ones that lock the cache, but they do not hold onto the locks while the compactions are run.

The large heap allocations from queries look like they may be querying against large number of series which may indicate a cardinality issue with your tags or a query that is not selective enough. There are also a number of goroutines gzipping responses which is odd. Perhaps a slow query reader?

You may want to try enabling some of these limits to help mitigate some of the issues as well as pinpoint any problem queries:

[http]
max-connection-limit = 20

[coordinator]
  max-concurrent-queries = 20
  query-timeout = "1m"
  log-queries-after = "30s"
  max-select-series = 100000

@liv3d
Copy link

liv3d commented Sep 15, 2016

@jwilder I'll add the settings to our setup later on today :)


In the meantime, I grabbed a backup of production data and put it against 1.0 in our dev environment and set the retention down to 7 days. After ~30 minutes the retention kicked in and deleted ~271,000 series (kind of expected) - from our graphing the retention took 8 minutes and 30 seconds to run, but during this time it completely stopped writing data to both the _internal database and other(s)

heap graph

tsm metrics

writes

I'm going to retest this shortly and grab heap, goroutine, blocks and vars while the node isn't graphing.

@liv3d
Copy link

liv3d commented Sep 15, 2016

block.txt
goroutine.txt
heap.txt
vars.txt

I managed to grab these while reducing the retention policy on the same backup we used during the test yesterday.

@jwilder
Copy link
Contributor

jwilder commented Sep 15, 2016

@liv3d I think I see part of the issue in one of those traces now. When the shard deletion kicks in, the shard is Closed and a write lock is held during this time. The Close takes a while and during that time stats collection gets blocked which is why you don't see any data in _internal during this time.

Not sure why other writes are getting block yet though.

@jwilder
Copy link
Contributor

jwilder commented Sep 16, 2016

@liv3d I have a possible fix for this in #7314. Would you be able to test that PR out in your environment to see if it resolves this issue?

@stevenh
Copy link
Contributor Author

stevenh commented Sep 16, 2016

With regards other writes getting blocked, the initial traces indicated that they weren't totally blocked they where just progressing VERY slowly due to the cache lock contention.

@liv3d
Copy link

liv3d commented Sep 16, 2016

@jwilder running that build, I get panics when querying _internal or doing show stats - these are attached at here

@jwilder
Copy link
Contributor

jwilder commented Sep 16, 2016

@liv3d I think that panic is due to a recent change in master unrelated to this. I'll rebase the PR off of the 1.0 branch while we track down the cause of that panic.

@jwilder
Copy link
Contributor

jwilder commented Sep 16, 2016

The panic is a regression due to #7296

@jwilder
Copy link
Contributor

jwilder commented Sep 16, 2016

@liv3d The fix for that panic is in master now. If you pull down the updated #7314 PR, it should have that fix in there now.

@liv3d
Copy link

liv3d commented Sep 16, 2016

@jwilder I'm not convinced that panic was unrelated. I tried show stats again and it still panic's.

[danof@influxdb-1.labs ~]$ influxd version
InfluxDB v1.0.0 (git: jw-delete-shard 0063ca4842a83eb0290c06c3e48b9783c63d1869)
[danof@influxdb-1.labs ~]$ influx -execute "SHOW STATS"
ERR: SHOW STATS [panic:assignment to entry in nil map]
Warning: It is possible this error is due to not setting a database.
Please set a database with the command "use <database>".
SHOW STATS [panic:assignment to entry in nil map]
[danof@influxdb-1.labs ~]$

My box on 1.0 GA does:

[danof@influxdb-2.labs ~]$ influxd version
InfluxDB v1.0.0 (git: master 37992377a55fbc138b2c01edd4deffed64b53989)
[danof@influxdb-2.labs ~]$ influx -execute "SHOW STATS"
name: cq
--------
queryFail   queryOk
0       13972
...

jwilder added a commit that referenced this issue Sep 16, 2016
When deleting a shard, the shard is locked and then removed from the
index.  Removal from the index can be slow if there are a lot of
series.  During this time, the shard is still expected to exist by
the meta store and tsdb store so stats collections, queries and writes
could all be run on this shard while it's locked.  This can cause everything
to lock up until the unindexing completes and the shard can be unlocked.

Fixes #7226
@jwilder
Copy link
Contributor

jwilder commented Sep 16, 2016

@liv3d That's actually a different panic in the stats collection for subscribers. It was introduced in #7177 and is now fixed with #7318. I rebased on top of that fix now. master has many changes for 1.1 right now so it's more unstable. Can you test again?

@liv3d
Copy link

liv3d commented Sep 16, 2016

The good news @jwilder is I can indeed run SHOW STATS. I'm going to grab some food and retest in a bit.

@liv3d
Copy link

liv3d commented Sep 16, 2016

@jwilder This looks way better from my quick alerting of retention:

Showing the series drop
Series drop

I'm also seeing the memory being roughly a 10th of what it was for 1.0 GA
Memory

@stevenh We'll need to do more testing during the week, but this actually looks good

@jwilder jwilder self-assigned this Sep 16, 2016
@jwilder
Copy link
Contributor

jwilder commented Sep 20, 2016

@liv3d @stevenh Have you had a chance to test further?

@liv3d
Copy link

liv3d commented Sep 21, 2016

@jwilder I've done some more testing and I'm pretty happy with the change. Hopefully this can make the next release
I did notice that during the shard retention that inserts go from ~0.09s to ~2.50s, however none of them failed

@jwilder jwilder changed the title Compaction triggers infinite memory usage due to cache lock contention Shard deletion triggers infinite memory usage due to cache lock contention Sep 21, 2016
@dswarbrick
Copy link

I'm a little bit late to the party on this one, but I've also been having problems dropping series / measurements on a roughly 75 GB database with just under 2M series. I tried to drop a measurement earlier today, which took a while, but eventually the command returned to the prompt. However, in the following minutes, influxd started to eat a ton of memory (32 GB + 4 GB swap) and eventually crashed. I've been very hesitant about deleting stuff because this behaviour certainly is not new.

Here's hoping that this patch improves things.

@jwilder
Copy link
Contributor

jwilder commented Sep 21, 2016

@dswarbrick #7165 may also help you out. If you are able to test it out, that would be very helpful.

@jwilder
Copy link
Contributor

jwilder commented Sep 21, 2016

That PR needs to be rebased w/ master though.

jwilder added a commit that referenced this issue Sep 23, 2016
When deleting a shard, the shard is locked and then removed from the
index.  Removal from the index can be slow if there are a lot of
series.  During this time, the shard is still expected to exist by
the meta store and tsdb store so stats collections, queries and writes
could all be run on this shard while it's locked.  This can cause everything
to lock up until the unindexing completes and the shard can be unlocked.

Fixes #7226
@timhallinflux timhallinflux modified the milestones: 1.1.0, 1.0.1 Dec 19, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants