-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Influxdb 1.7.5 stops responding while ingesting data, 1.7.4 does not #13010
Comments
I second this. When I upgraded Influx from 1.7.4-alpine to 1.7.5-alpine all of my writes gave a 500. Additionally, I saw a bunch of these logs when I was bringing things up and down a few times:
I tried backing up my current Influx dir and starting fresh like I had never spun up Influx before. Even with that things broke in the same way. Downgrading to 1.7.4-alpine brings things back to a working state. |
I had the same experience a couple of minutes after upgrading from |
Ran into same issue here. Spin up an influx DB, set up some CQ and RPs on the database, and everything seems to work fine until a new host writes new data in (the tags contain host names). At that point everything hangs until I restart the service. Every new host will hang the system. |
@RedShift1 @conet @jmurrayufo @aaronjwood could you please provide details of the index you're running, and if possible a stack dump of goroutines? You can |
Running the default index (did not change the default configuration file), I will try to make a stack dump of goroutines. |
Update from our side. We think we know which change introduced this deadlock. The best course of action if you encounter it would be to rollback to However, we would really appreciate seeing a stack trace from someone who is deadlocked on this issue. You can send the process a |
Here it is: influxdb-stacktace.log |
@conet Thank you for the stack trace. Are you seeing any panics in your log before the |
@benbjohnson I can't find any occurrences of the word |
I've come across, what appears to be the same issue. influx release: 1.7.5 data and meta
telegraf release: checked with both 1.9.4 and 1.10.2 Usecase I'm updating the installation script to use 1.7.5. It creates a test cluster for influx enterprise using docker instances and adds the rest of the TICK stack. Current script found here - Local version attached. After seeing httpd responding with 204 to telegraf, I run the first basic selenium test - connection_wizard. Roughly every second run at the end of the connection_wizard test httpd starts responding with 500 on telegraf writes. Other Chronograf tests using data explorer or dashboards cannot be executed because queries to influxdb fail. Console logs (taken from docker) attached. |
I hit this when spinning up a new InfluxDB instance on a new machine using docker: |
@benbjohnson if you give me a rpm I can test that, building from sources... is too much to ask. |
@conet No problem. I built an RPM for that commit and posted it here: |
@benbjohnson I'd say that the fix is good, running for an hour now without any issues, previously it would hang in the first couple of minutes, nevertheless I'll leave it running and get with another report tomorrow. |
@conet Great, thanks for taking a look so quickly. I appreciate it. |
The fix is still good after more than 10 hours, please go ahead and release it. |
Thanks @conet! |
When is |
At least remove |
We just burned a lot of time debugging this in production. Please update your Dockerhub and Github releases with at least a disclaimer. |
We did land a notification of this in the release notes here: https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/ Happy to take feedback as to where "else" you would like to find this notification if the doc location is not somewhere you are visiting. |
We are experiencing an error We have also configured There is nothing showing in the logs. Could this also be caused by this bug? |
@DerMika Yes, it is likely that the lock issue is causing queries to block and back up. |
Thanks, I don't seem to have the problem after downgrading to 1.7.4. |
|
Thanks @bndw 1.7.6 is being published now. |
@timhallinflux is there an ETA for this landing in Dockerhub? |
should be there now. There is usually a <24 hour delay once we create the build and releases appearing in Docker hub. |
I see random "Internal server errors" while ingesting data with 1.7.7. If I retry writing points, the operation succeeds. |
While running InfluxDB 1.7.5, some time after start while ingesting data, the Influx daemon stops responding. Writes via the HTTP endpoint time out, can't run any SELECT queries and through the influx cli, any commands that perform reads (show measurements, show tag keys, etc...) hang. There are no log messages, no CPU usage, no memory exhaustion, etc... when this happens.
Stopping influxdb leads to a hard shutdown:
Note that this instance is in the process of backfilling with out-of-order data. I've now downgraded to 1.7.4 and so far it has not hung.
The text was updated successfully, but these errors were encountered: