data loss during high ingest rate #7330

phemmer · 2016-09-20T00:21:58Z

Bug report

System info:
influxdb 1.0
Linux & MacOS

Steps to reproduce:

Start second influxdb on port 8087
from influx shell: create database test'
from influx shell: create subscription "sub0" on "test"."autogen" destinations all 'http://localhost:8087'
for ((i=0; i<30; i++)); do ( for ((n=0; n<300; n++)); do echo "insert test,foo=bar value=${i}i"; done | influx -database test ) & done; wait
from influx -database test shell: select sum(value) from test
from influx -database test -port 8087 shell: select sum(value) from test

Expected behavior:
130500
130500

Actual behavior:
130488
20188

Additional info:
The first number is only sometimes off, and only by a small amount.
The second number (from the second influxdb) is always off, by a huge amount.

This is causing a problem for me as kapacitor is missing large amounts of data.

The text was updated successfully, but these errors were encountered:

mark-rushakoff · 2016-09-20T04:18:03Z

If subscription writes are being dropped (which it sounds like they are in your case), you should see the subWriteDrop stat increase.

> show stats for 'write'
name: write
-----------
pointReq        pointReqHH      pointReqLocal   pointReqRemote  req     subWriteDrop    subWriteOk      writeError      writeOk writePartial    writeTimeout
265             0               265             0               33      0               33              0               33      0               0

> select subWriteDrop from _internal.."write" where time > now() - 1m
name: write
-----------
time                    subWriteDrop
2016-09-20T04:07:50Z    0
2016-09-20T04:08:00Z    0
2016-09-20T04:08:10Z    0
2016-09-20T04:08:20Z    0
2016-09-20T04:08:30Z    0
2016-09-20T04:08:40Z    0

You might be able to reduce subscription write drops by adjusting your batch size while keeping an eye on that stat.

phemmer · 2016-09-20T05:22:35Z

Negative:

> select subWriteDrop from _internal.."write" where time > now() - 1m
name: write
-----------
time            subWriteDrop
1474348810000000000 0
1474348820000000000 0
1474348830000000000 0
1474348840000000000 0
1474348850000000000 0
1474348860000000000 0

(show stats for 'write' panics)

phemmer · 2016-09-24T01:04:48Z

Anything further on this? Data loss is kinda a big deal.

nathanielc · 2016-09-26T15:28:54Z

@phemmer That is definitely not expected if the subWriteDrop count is 0. Are there any errors in the InfluxDB logs?

phemmer · 2016-09-26T15:41:58Z

No errors.

jwilder · 2016-09-26T23:04:04Z

I was able to reproduce this. show stats for 'subscriber' shows dropped points. Points are getting dropped in the subscriber service here: https://github.com/influxdata/influxdb/blob/master/services/subscriber/service.go#L241

The sends into the cw.writeRequests channel are coming in faster than the receiver can read and process them off the channel. The test script uses 30 concurrent writers, but the reader appears to be single goroutine: https://github.com/influxdata/influxdb/blob/master/services/subscriber/service.go#L299

phemmer · 2016-09-26T23:50:29Z

While I'm not sure what the fix here is going to be, can we also make the max sizes of any chan buffers involved configurable. And also provide a way of reporting metrics on the sizes of those chan buffers.

jwilder · 2016-09-27T17:15:16Z

The fix is to increase the number of readers processing points from the channel.

The subscriber write goroutine would drop points if the write load was higher than it could process. This could happen with a just a few writers to the server. Instead, process the channel with multiple writers to avoid dropping writes so easily. This also adds some config options to control how large the channel buffer is as well as how many goroutines are started. Fixes #7330

jwilder · 2016-10-05T19:22:39Z

Fixed via #7407

jwilder added this to the 1.0.2 milestone Sep 26, 2016

jwilder added the kind/bug label Sep 26, 2016

jwilder added the area/performance label Sep 27, 2016

jwilder mentioned this issue Oct 4, 2016

Fix subscriber service dropping writes under high write load #7407

Merged

3 tasks

jwilder closed this as completed Oct 5, 2016

jwilder mentioned this issue Nov 16, 2017

InfluxDB sends points to subscriber out of order #8932

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data loss during high ingest rate #7330

data loss during high ingest rate #7330

phemmer commented Sep 20, 2016

mark-rushakoff commented Sep 20, 2016

phemmer commented Sep 20, 2016 •

edited

Loading

phemmer commented Sep 24, 2016

nathanielc commented Sep 26, 2016

phemmer commented Sep 26, 2016

jwilder commented Sep 26, 2016

phemmer commented Sep 26, 2016 •

edited

Loading

jwilder commented Sep 27, 2016

jwilder commented Oct 5, 2016

data loss during high ingest rate #7330

data loss during high ingest rate #7330

Comments

phemmer commented Sep 20, 2016

Bug report

mark-rushakoff commented Sep 20, 2016

phemmer commented Sep 20, 2016 • edited Loading

phemmer commented Sep 24, 2016

nathanielc commented Sep 26, 2016

phemmer commented Sep 26, 2016

jwilder commented Sep 26, 2016

phemmer commented Sep 26, 2016 • edited Loading

jwilder commented Sep 27, 2016

jwilder commented Oct 5, 2016

phemmer commented Sep 20, 2016 •

edited

Loading

phemmer commented Sep 26, 2016 •

edited

Loading