New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data loss during high ingest rate #7330

Closed
phemmer opened this Issue Sep 20, 2016 · 9 comments

Comments

Projects
None yet
4 participants
@phemmer
Copy link
Contributor

phemmer commented Sep 20, 2016

Bug report

System info:
influxdb 1.0
Linux & MacOS

Steps to reproduce:

  1. Start second influxdb on port 8087
  2. from influx shell: create database test'
  3. from influx shell: create subscription "sub0" on "test"."autogen" destinations all 'http://localhost:8087'
  4. for ((i=0; i<30; i++)); do ( for ((n=0; n<300; n++)); do echo "insert test,foo=bar value=${i}i"; done | influx -database test ) & done; wait
  5. from influx -database test shell: select sum(value) from test
  6. from influx -database test -port 8087 shell: select sum(value) from test

Expected behavior:
130500
130500

Actual behavior:
130488
20188

Additional info:
The first number is only sometimes off, and only by a small amount.
The second number (from the second influxdb) is always off, by a huge amount.

This is causing a problem for me as kapacitor is missing large amounts of data.

@mark-rushakoff

This comment has been minimized.

Copy link
Member

mark-rushakoff commented Sep 20, 2016

If subscription writes are being dropped (which it sounds like they are in your case), you should see the subWriteDrop stat increase.

> show stats for 'write'
name: write
-----------
pointReq        pointReqHH      pointReqLocal   pointReqRemote  req     subWriteDrop    subWriteOk      writeError      writeOk writePartial    writeTimeout
265             0               265             0               33      0               33              0               33      0               0

> select subWriteDrop from _internal.."write" where time > now() - 1m
name: write
-----------
time                    subWriteDrop
2016-09-20T04:07:50Z    0
2016-09-20T04:08:00Z    0
2016-09-20T04:08:10Z    0
2016-09-20T04:08:20Z    0
2016-09-20T04:08:30Z    0
2016-09-20T04:08:40Z    0

You might be able to reduce subscription write drops by adjusting your batch size while keeping an eye on that stat.

@phemmer

This comment has been minimized.

Copy link
Contributor Author

phemmer commented Sep 20, 2016

Negative:

> select subWriteDrop from _internal.."write" where time > now() - 1m
name: write
-----------
time            subWriteDrop
1474348810000000000 0
1474348820000000000 0
1474348830000000000 0
1474348840000000000 0
1474348850000000000 0
1474348860000000000 0

(show stats for 'write' panics)

@phemmer

This comment has been minimized.

Copy link
Contributor Author

phemmer commented Sep 24, 2016

Anything further on this? Data loss is kinda a big deal.

@nathanielc

This comment has been minimized.

Copy link
Contributor

nathanielc commented Sep 26, 2016

@phemmer That is definitely not expected if the subWriteDrop count is 0. Are there any errors in the InfluxDB logs?

@phemmer

This comment has been minimized.

Copy link
Contributor Author

phemmer commented Sep 26, 2016

No errors.

@jwilder jwilder added this to the 1.0.2 milestone Sep 26, 2016

@jwilder

This comment has been minimized.

Copy link
Contributor

jwilder commented Sep 26, 2016

I was able to reproduce this. show stats for 'subscriber' shows dropped points. Points are getting dropped in the subscriber service here: https://github.com/influxdata/influxdb/blob/master/services/subscriber/service.go#L241

The sends into the cw.writeRequests channel are coming in faster than the receiver can read and process them off the channel. The test script uses 30 concurrent writers, but the reader appears to be single goroutine: https://github.com/influxdata/influxdb/blob/master/services/subscriber/service.go#L299

@jwilder jwilder added the kind/bug label Sep 26, 2016

@phemmer

This comment has been minimized.

Copy link
Contributor Author

phemmer commented Sep 26, 2016

While I'm not sure what the fix here is going to be, can we also make the max sizes of any chan buffers involved configurable. And also provide a way of reporting metrics on the sizes of those chan buffers.

@jwilder

This comment has been minimized.

Copy link
Contributor

jwilder commented Sep 27, 2016

The fix is to increase the number of readers processing points from the channel.

@jwilder jwilder added the performance label Sep 27, 2016

jwilder added a commit that referenced this issue Oct 4, 2016

Fix subscriber service dropping writes under high write load
The subscriber write goroutine would drop points if the write load
was higher than it could process.  This could happen with a just
a few writers to the server.

Instead, process the channel with multiple writers to avoid dropping
writes so easily.  This also adds some config options to control how
large the channel buffer is as well as how many goroutines are started.

Fixes #7330
@jwilder

This comment has been minimized.

Copy link
Contributor

jwilder commented Oct 5, 2016

Fixed via #7407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment