Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010

Closed
RedShift1 opened this issue Mar 29, 2019 · 30 comments
Closed

Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010

RedShift1 opened this issue Mar 29, 2019 · 30 comments
Assignees

Comments

@RedShift1
Copy link

While running InfluxDB 1.7.5, some time after start while ingesting data, the Influx daemon stops responding. Writes via the HTTP endpoint time out, can't run any SELECT queries and through the influx cli, any commands that perform reads (show measurements, show tag keys, etc...) hang. There are no log messages, no CPU usage, no memory exhaustion, etc... when this happens.

Stopping influxdb leads to a hard shutdown:

Mar 29 07:30:21 influx1 systemd: Stopping InfluxDB is an open-source, distributed, time series database...
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179652Z lvl=info msg="Signal received, initializing clean shutdown..." log_id=0ESuf_v0000
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179766Z lvl=info msg="Waiting for clean shutdown..." log_id=0ESuf_v0000
Mar 29 07:30:21 influx1 influxd: ts=2019-03-29T07:30:21.179892Z lvl=info msg="Listener closed" log_id=0ESuf_v0000 service=snapshot
Mar 29 07:30:51 influx1 influxd: ts=2019-03-29T07:30:51.179939Z lvl=info msg="Time limit reached, initializing hard shutdown" log_id=0ESuf_v0000
Mar 29 07:30:51 influx1 systemd: Stopped InfluxDB is an open-source, distributed, time series database.

Note that this instance is in the process of backfilling with out-of-order data. I've now downgraded to 1.7.4 and so far it has not hung.

@aaronjwood
Copy link

I second this. When I upgraded Influx from 1.7.4-alpine to 1.7.5-alpine all of my writes gave a 500. Additionally, I saw a bunch of these logs when I was bringing things up and down a few times:

Mar 29 01:00:12 myhost docker[30915]: ts=2019-03-29T08:00:12.110995Z lvl=info msg="Write failed" log_id=0EU5vEvG000 service=write shard=748 error="store is closed"

I tried backing up my current Influx dir and starting fresh like I had never spun up Influx before. Even with that things broke in the same way. Downgrading to 1.7.4-alpine brings things back to a working state.

@conet
Copy link

conet commented Mar 29, 2019

I had the same experience a couple of minutes after upgrading from 1.7.4 queries started hanging and data stopped being written, basically the entire service seemed to be blocked. Even internal stats stopped being written. I rolled back to 1.7.4, this is the second upgrade since 1.7.0 that failed me 😞 .

@jmurrayufo
Copy link

Ran into same issue here. Spin up an influx DB, set up some CQ and RPs on the database, and everything seems to work fine until a new host writes new data in (the tags contain host names). At that point everything hangs until I restart the service. Every new host will hang the system.

@e-dard
Copy link
Contributor

e-dard commented Apr 1, 2019

@RedShift1 @conet @jmurrayufo @aaronjwood could you please provide details of the index you're running, and if possible a stack dump of goroutines? You can SIGQUIT the process once it deadlocks.

@RedShift1
Copy link
Author

Running the default index (did not change the default configuration file), I will try to make a stack dump of goroutines.

@e-dard
Copy link
Contributor

e-dard commented Apr 2, 2019

Update from our side.

We think we know which change introduced this deadlock. The best course of action if you encounter it would be to rollback to 1.7.4.

However, we would really appreciate seeing a stack trace from someone who is deadlocked on this issue. You can send the process a SIGQUIT via kill -s SIGQUIT <process_id> or if Influxd is in the foreground with ctrl \.

@conet
Copy link

conet commented Apr 3, 2019

Here it is: influxdb-stacktace.log

@benbjohnson
Copy link
Contributor

@conet Thank you for the stack trace. Are you seeing any panics in your log before the SIGQUIT?

@conet
Copy link

conet commented Apr 3, 2019

@benbjohnson I can't find any occurrences of the word panic in the logs preceding the deadlock (or anywhere in the logs). I can approximate the moment of the deadlock based on the fact that write request start to fail with a status of 500.

@karel-rehor
Copy link
Contributor

I've come across, what appears to be the same issue.

influx release: 1.7.5 data and meta

ts=2019-04-03T14:54:05.632738Z lvl=info msg="InfluxDB Meta starting" log_id=0E_vw1G0000 version=1.7.5-c1.7.5 branch=1.7 commit=dae1326b1dde5c36d1dfdb754787acbe8b7447f8 tags=unknown
...
ts=2019-04-03T14:54:07.875194Z lvl=info msg="InfluxDB starting" log_id=0E_vwA0l000 version=1.7.5-c1.7.5 branch=1.7 commit=dae1326b1dde5c36d1dfdb754787acbe8b7447f8

telegraf release: checked with both 1.9.4 and 1.10.2

Usecase

I'm updating the installation script to use 1.7.5. It creates a test cluster for influx enterprise using docker instances and adds the rest of the TICK stack. Current script found here -

Local version attached.
setup_test_environment_enterprise_version.sh.txt

After seeing httpd responding with 204 to telegraf, I run the first basic selenium test - connection_wizard.

Roughly every second run at the end of the connection_wizard test httpd starts responding with 500 on telegraf writes.

Other Chronograf tests using data explorer or dashboards cannot be executed because queries to influxdb fail.

Console logs (taken from docker) attached.
console-tele-1_9_4.log
console-tele-1_10_2.log

Screenshots
influxdb-telegraf-no-response

@johnbatty
Copy link

I hit this when spinning up a new InfluxDB instance on a new machine using docker: docker run ... influxdb. I managed to resolve by explicitly specifying the version tag influxdb:1.7.4. However anyone running the InfluxDB docker container with tag latest (or no tag), 1.7, or 1.7-alpine will be liable to run into this. You might want to consider changing these tags to refer to 1.7.4 rather than 1.7.5 until the issue is fixed.

@benbjohnson
Copy link
Contributor

@conet We added what we believe is a fix (#13150). Would you be able to build a369cf4 and verify that it fixes the issue?

@conet
Copy link

conet commented Apr 4, 2019

@benbjohnson if you give me a rpm I can test that, building from sources... is too much to ask.

@benbjohnson
Copy link
Contributor

@conet No problem. I built an RPM for that commit and posted it here:

https://s3-us-west-1.amazonaws.com/support-sftp/home/influxdb13010/influxdb-v1.7.6_a369cf4.x86_64.rpm

@conet
Copy link

conet commented Apr 4, 2019

@benbjohnson I'd say that the fix is good, running for an hour now without any issues, previously it would hang in the first couple of minutes, nevertheless I'll leave it running and get with another report tomorrow.

@benbjohnson
Copy link
Contributor

@conet Great, thanks for taking a look so quickly. I appreciate it.

@conet
Copy link

conet commented Apr 5, 2019

The fix is still good after more than 10 hours, please go ahead and release it.

@benbjohnson
Copy link
Contributor

Thanks @conet!

@conet
Copy link

conet commented Apr 9, 2019

When is 1.7.6 going to be released? 1.7.5 is still advertised as the latest stable and the issues about 1.7.5 keep piling up, some examples: #13256 and this comment.

@conet
Copy link

conet commented Apr 9, 2019

At least remove 1.7.5 so that others will not be affected by this.

@bndw
Copy link

bndw commented Apr 10, 2019

We just burned a lot of time debugging this in production. Please update your Dockerhub and Github releases with at least a disclaimer.

@timhallinflux
Copy link
Contributor

We did land a notification of this in the release notes here: https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/

Happy to take feedback as to where "else" you would like to find this notification if the doc location is not somewhere you are visiting.

@DerMika
Copy link

DerMika commented Apr 16, 2019

We are experiencing an error ERR: max-concurrent-queries limit exceeded(20, 20) in a prototype situation where it is highly unlikely that there are 20 concurrent queries running.

We have also configured INFLUXDB_COORDINATOR_QUERY_TIMEOUT=5s

There is nothing showing in the logs.

Could this also be caused by this bug?

@benbjohnson
Copy link
Contributor

@DerMika Yes, it is likely that the lock issue is causing queries to block and back up.

@DerMika
Copy link

DerMika commented Apr 16, 2019

Thanks, I don't seem to have the problem after downgrading to 1.7.4.

@bndw
Copy link

bndw commented Apr 16, 2019

Please update your Dockerhub and Github releases with at least a disclaimer.

@timhallinflux

@timhallinflux
Copy link
Contributor

timhallinflux commented Apr 17, 2019

Thanks @bndw 1.7.6 is being published now.
https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/

@bndw
Copy link

bndw commented Apr 17, 2019

@timhallinflux is there an ETA for this landing in Dockerhub?

@timhallinflux
Copy link
Contributor

should be there now. There is usually a <24 hour delay once we create the build and releases appearing in Docker hub.

@dandv
Copy link
Contributor

dandv commented Jul 12, 2019

I see random "Internal server errors" while ingesting data with 1.7.7. If I retry writing points, the operation succeeds. /var/log/influxdb/ is empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests