Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010

RedShift1 · 2019-03-29T08:15:08Z

While running InfluxDB 1.7.5, some time after start while ingesting data, the Influx daemon stops responding. Writes via the HTTP endpoint time out, can't run any SELECT queries and through the influx cli, any commands that perform reads (show measurements, show tag keys, etc...) hang. There are no log messages, no CPU usage, no memory exhaustion, etc... when this happens.

Stopping influxdb leads to a hard shutdown:

Mar 29 07:30:21 influx1 systemd: Stopping InfluxDB is an open-source, distributed, time series database...
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179652Z lvl=info msg="Signal received, initializing clean shutdown..." log_id=0ESuf_v0000
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179766Z lvl=info msg="Waiting for clean shutdown..." log_id=0ESuf_v0000
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179892Z lvl=info msg="Listener closed" log_id=0ESuf_v0000 service=snapshot
Mar 29 07:30:51 influx1 influxd: ts=2019-03-29T07:30:51.179939Z lvl=info msg="Time limit reached, initializing hard shutdown" log_id=0ESuf_v0000
Mar 29 07:30:51 influx1 systemd: Stopped InfluxDB is an open-source, distributed, time series database.

Note that this instance is in the process of backfilling with out-of-order data. I've now downgraded to 1.7.4 and so far it has not hung.

The text was updated successfully, but these errors were encountered:

aaronjwood · 2019-03-29T08:36:26Z

I second this. When I upgraded Influx from 1.7.4-alpine to 1.7.5-alpine all of my writes gave a 500. Additionally, I saw a bunch of these logs when I was bringing things up and down a few times:

Mar 29 01:00:12 myhost docker[30915]: ts=2019-03-29T08:00:12.110995Z lvl=info msg="Write failed" log_id=0EU5vEvG000 service=write shard=748 error="store is closed"

I tried backing up my current Influx dir and starting fresh like I had never spun up Influx before. Even with that things broke in the same way. Downgrading to 1.7.4-alpine brings things back to a working state.

conet · 2019-03-29T18:13:01Z

I had the same experience a couple of minutes after upgrading from 1.7.4 queries started hanging and data stopped being written, basically the entire service seemed to be blocked. Even internal stats stopped being written. I rolled back to 1.7.4, this is the second upgrade since 1.7.0 that failed me 😞 .

jmurrayufo · 2019-03-29T18:58:16Z

Ran into same issue here. Spin up an influx DB, set up some CQ and RPs on the database, and everything seems to work fine until a new host writes new data in (the tags contain host names). At that point everything hangs until I restart the service. Every new host will hang the system.

e-dard · 2019-04-01T16:35:39Z

@RedShift1 @conet @jmurrayufo @aaronjwood could you please provide details of the index you're running, and if possible a stack dump of goroutines? You can SIGQUIT the process once it deadlocks.

RedShift1 · 2019-04-01T16:39:17Z

Running the default index (did not change the default configuration file), I will try to make a stack dump of goroutines.

e-dard · 2019-04-02T09:42:11Z

Update from our side.

We think we know which change introduced this deadlock. The best course of action if you encounter it would be to rollback to 1.7.4.

However, we would really appreciate seeing a stack trace from someone who is deadlocked on this issue. You can send the process a SIGQUIT via kill -s SIGQUIT <process_id> or if Influxd is in the foreground with ctrl \.

conet · 2019-04-03T11:59:04Z

Here it is: influxdb-stacktace.log

benbjohnson · 2019-04-03T15:12:46Z

@conet Thank you for the stack trace. Are you seeing any panics in your log before the SIGQUIT?

conet · 2019-04-03T15:37:16Z

@benbjohnson I can't find any occurrences of the word panic in the logs preceding the deadlock (or anywhere in the logs). I can approximate the moment of the deadlock based on the fact that write request start to fail with a status of 500.

karel-rehor · 2019-04-03T15:53:55Z

I've come across, what appears to be the same issue.

influx release: 1.7.5 data and meta

ts=2019-04-03T14:54:05.632738Z lvl=info msg="InfluxDB Meta starting" log_id=0E_vw1G0000 version=1.7.5-c1.7.5 branch=1.7 commit=dae1326b1dde5c36d1dfdb754787acbe8b7447f8 tags=unknown
...
ts=2019-04-03T14:54:07.875194Z lvl=info msg="InfluxDB starting" log_id=0E_vwA0l000 version=1.7.5-c1.7.5 branch=1.7 commit=dae1326b1dde5c36d1dfdb754787acbe8b7447f8

telegraf release: checked with both 1.9.4 and 1.10.2

Usecase

I'm updating the installation script to use 1.7.5. It creates a test cluster for influx enterprise using docker instances and adds the rest of the TICK stack. Current script found here -

Local version attached.
setup_test_environment_enterprise_version.sh.txt

After seeing httpd responding with 204 to telegraf, I run the first basic selenium test - connection_wizard.

Roughly every second run at the end of the connection_wizard test httpd starts responding with 500 on telegraf writes.

Other Chronograf tests using data explorer or dashboards cannot be executed because queries to influxdb fail.

Console logs (taken from docker) attached.
console-tele-1_9_4.log
console-tele-1_10_2.log

Screenshots

johnbatty · 2019-04-04T11:47:14Z

I hit this when spinning up a new InfluxDB instance on a new machine using docker: docker run ... influxdb. I managed to resolve by explicitly specifying the version tag influxdb:1.7.4. However anyone running the InfluxDB docker container with tag latest (or no tag), 1.7, or 1.7-alpine will be liable to run into this. You might want to consider changing these tags to refer to 1.7.4 rather than 1.7.5 until the issue is fixed.

benbjohnson · 2019-04-04T15:24:32Z

@conet We added what we believe is a fix (#13150). Would you be able to build a369cf4 and verify that it fixes the issue?

conet · 2019-04-04T15:55:45Z

@benbjohnson if you give me a rpm I can test that, building from sources... is too much to ask.

benbjohnson · 2019-04-04T19:57:16Z

@conet No problem. I built an RPM for that commit and posted it here:

https://s3-us-west-1.amazonaws.com/support-sftp/home/influxdb13010/influxdb-v1.7.6_a369cf4.x86_64.rpm

conet · 2019-04-04T21:09:25Z

@benbjohnson I'd say that the fix is good, running for an hour now without any issues, previously it would hang in the first couple of minutes, nevertheless I'll leave it running and get with another report tomorrow.

benbjohnson · 2019-04-04T21:27:39Z

@conet Great, thanks for taking a look so quickly. I appreciate it.

conet · 2019-04-05T06:17:57Z

The fix is still good after more than 10 hours, please go ahead and release it.

benbjohnson · 2019-04-05T13:53:20Z

Thanks @conet!

conet · 2019-04-09T16:12:22Z

When is 1.7.6 going to be released? 1.7.5 is still advertised as the latest stable and the issues about 1.7.5 keep piling up, some examples: #13256 and this comment.

conet · 2019-04-09T16:12:55Z

At least remove 1.7.5 so that others will not be affected by this.

bndw · 2019-04-10T16:04:56Z

We just burned a lot of time debugging this in production. Please update your Dockerhub and Github releases with at least a disclaimer.

timhallinflux · 2019-04-11T16:55:55Z

We did land a notification of this in the release notes here: https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/

Happy to take feedback as to where "else" you would like to find this notification if the doc location is not somewhere you are visiting.

DerMika · 2019-04-16T08:07:37Z

We are experiencing an error ERR: max-concurrent-queries limit exceeded(20, 20) in a prototype situation where it is highly unlikely that there are 20 concurrent queries running.

We have also configured INFLUXDB_COORDINATOR_QUERY_TIMEOUT=5s

There is nothing showing in the logs.

Could this also be caused by this bug?

benbjohnson · 2019-04-16T13:42:59Z

@DerMika Yes, it is likely that the lock issue is causing queries to block and back up.

DerMika · 2019-04-16T13:50:23Z

Thanks, I don't seem to have the problem after downgrading to 1.7.4.

bndw · 2019-04-16T17:09:21Z

Please update your Dockerhub and Github releases with at least a disclaimer.

@timhallinflux

timhallinflux · 2019-04-17T01:03:28Z

Thanks @bndw 1.7.6 is being published now.
https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/

bndw · 2019-04-17T18:41:22Z

@timhallinflux is there an ETA for this landing in Dockerhub?

timhallinflux · 2019-04-18T03:12:51Z

should be there now. There is usually a <24 hour delay once we create the build and releases appearing in Docker hub.

dandv · 2019-07-12T19:28:04Z

I see random "Internal server errors" while ingesting data with 1.7.7. If I retry writing points, the operation succeeds. /var/log/influxdb/ is empty.

dgnorton added kind/bug 1.x labels Apr 1, 2019

e-dard assigned benbjohnson Apr 1, 2019

benbjohnson mentioned this issue Apr 1, 2019

Add nil check for tagKeyValueEntry.setIDs() #13053

Merged

2 tasks

benbjohnson closed this as completed Apr 5, 2019

This was referenced Apr 9, 2019

failed to store statistics: timeout 1.2.0 #8036

Closed

Influxdb in docker stucked #13213

Closed

1.7.5 version 500 timeout #13256

Closed

500 Internal Server Error when running Flux in InfluxDB 1.7.5 #13136

Closed

thomaspeitz mentioned this issue Apr 10, 2019

Change Log / {Debian,Ubuntu} packages missing for v1.7.5 #13038

Closed

rob42 mentioned this issue Apr 10, 2019

Docker build shows influxDB configuration error. SignalK/signalk-java#18

Closed

nitper mentioned this issue Apr 12, 2019

Timeout when running influx_inspect buildtsi #12890

Closed

timhallinflux mentioned this issue Apr 12, 2019

Repositories are not up-to-date #13350

Closed

Zanthras mentioned this issue Apr 17, 2019

Blocked until restarted 'influxd.exe' #13425

Closed

timhallinflux mentioned this issue Apr 18, 2019

All write requests get HTTP 500 after adding a new tag value and kapacitor is running? #13451

Closed

mstojcevich mentioned this issue May 6, 2019

influxdb: 1.7.5 -> 1.7.6 NixOS/nixpkgs#61064

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010

Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010

RedShift1 commented Mar 29, 2019

aaronjwood commented Mar 29, 2019

conet commented Mar 29, 2019

jmurrayufo commented Mar 29, 2019

e-dard commented Apr 1, 2019

RedShift1 commented Apr 1, 2019

e-dard commented Apr 2, 2019 •

edited

Loading

conet commented Apr 3, 2019

benbjohnson commented Apr 3, 2019

conet commented Apr 3, 2019 •

edited

Loading

karel-rehor commented Apr 3, 2019

johnbatty commented Apr 4, 2019

benbjohnson commented Apr 4, 2019

conet commented Apr 4, 2019

benbjohnson commented Apr 4, 2019

conet commented Apr 4, 2019

benbjohnson commented Apr 4, 2019

conet commented Apr 5, 2019

benbjohnson commented Apr 5, 2019

conet commented Apr 9, 2019

conet commented Apr 9, 2019

bndw commented Apr 10, 2019

timhallinflux commented Apr 11, 2019

DerMika commented Apr 16, 2019 •

edited

Loading

benbjohnson commented Apr 16, 2019

DerMika commented Apr 16, 2019

bndw commented Apr 16, 2019

timhallinflux commented Apr 17, 2019 •

edited

Loading

bndw commented Apr 17, 2019

timhallinflux commented Apr 18, 2019

dandv commented Jul 12, 2019

Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010

Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010

Comments

RedShift1 commented Mar 29, 2019

aaronjwood commented Mar 29, 2019

conet commented Mar 29, 2019

jmurrayufo commented Mar 29, 2019

e-dard commented Apr 1, 2019

RedShift1 commented Apr 1, 2019

e-dard commented Apr 2, 2019 • edited Loading

conet commented Apr 3, 2019

benbjohnson commented Apr 3, 2019

conet commented Apr 3, 2019 • edited Loading

karel-rehor commented Apr 3, 2019

johnbatty commented Apr 4, 2019

benbjohnson commented Apr 4, 2019

conet commented Apr 4, 2019

benbjohnson commented Apr 4, 2019

conet commented Apr 4, 2019

benbjohnson commented Apr 4, 2019

conet commented Apr 5, 2019

benbjohnson commented Apr 5, 2019

conet commented Apr 9, 2019

conet commented Apr 9, 2019

bndw commented Apr 10, 2019

timhallinflux commented Apr 11, 2019

DerMika commented Apr 16, 2019 • edited Loading

benbjohnson commented Apr 16, 2019

DerMika commented Apr 16, 2019

bndw commented Apr 16, 2019

timhallinflux commented Apr 17, 2019 • edited Loading

bndw commented Apr 17, 2019

timhallinflux commented Apr 18, 2019

dandv commented Jul 12, 2019

e-dard commented Apr 2, 2019 •

edited

Loading

conet commented Apr 3, 2019 •

edited

Loading

DerMika commented Apr 16, 2019 •

edited

Loading

timhallinflux commented Apr 17, 2019 •

edited

Loading