Blocked until restarted 'influxd.exe' #13425

Tancen · 2019-04-16T02:17:51Z

OS: windows 7 x64
InfluxDB version : 1.7.5-1

Console 1:

D:\influxdb\influxdb-1.7.5-1>influx -username root -password 123456
Connected to http://localhost:8086 version 1.7.5
InfluxDB shell version: 1.7.5
Enter an InfluxQL query

use nkdata_3
Using database nkdata_3
INSERT DR_E_RAW_HOUR_20190415_1,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0
POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=0
.000000
ERR: {"error":"timeout"}

INSERT DR_E_RAW_HOUR_20190415_1,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0
POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=0
.000000
ERR: {"error":"timeout"}

INSERT DR_E_RAW_HOUR_20190415_1,STRUUID={cf31e80b-971f-4742-b2bf-1c399ce012ae}
,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0 LOAD_TIME=2019-04-15 18:48:45.007
,POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=
0.000000
ERR: {"error":"unable to parse 'DR_E_RAW_HOUR_20190415_1,STRUUID={cf31e80b-971f-
4742-b2bf-1c399ce012ae},MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0 LOAD_TIME=
2019-04-15 18:48:45.007,POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_
1=0.000000,GROUP_Q_E_2=0.000000': invalid number"}

INSERT DR_E_RAW_HOUR_20190415_1,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0
POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=0
.000000
ERR: {"error":"timeout"}

Console 2:

D:\influxdb\influxdb-1.7.5-1>influx -username root -password 123456
Connected to http://localhost:8086 version 1.7.5
InfluxDB shell version: 1.7.5
Enter an InfluxQL query

use nkdata_3
Using database nkdata_3
show
ERR: error parsing query: found EOF, expected CONTINUOUS, DATABASES, DIAGNOSTICS
, FIELD, GRANTS, MEASUREMENT, MEASUREMENTS, QUERIES, RETENTION, SERIES, SHARD, S
HARDS, STATS, SUBSCRIPTIONS, TAG, USERS at line 1, char 6
show MEASUREMENTS

When restarted 'influxd.exe'
Console 1:

INSERT DR_E_RAW_HOUR_20190415_1,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0
POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=0
.000000
ERR: {"error":"partial write: field type conflict: input field "POS_P_E_TOTAL"
on measurement "DR_E_RAW_HOUR_20190415_1" is type float, already exists as ty
pe string dropped=1"}

Console 2:

show MEASUREMENTS
name: measurements
name

DR_E_RAW_HOUR_20190415_1
load

ghost · 2019-04-16T02:58:07Z

I also encountered the same problem, after some sensitive operations, some databases will fall into an unavailable state. I have encountered this problem twice. The first time I use the influx_inspect tool to export the data of a certain database in the influxdb running state, then the database will be inserted into the query or the query will be in a deadlock state, while other databases have no effect. The second time is the write test, try to change the value type, after the command is executed, the error message is time out, after which the database is in an unavailable state.

ghost · 2019-04-16T03:01:00Z

I also encountered the same problem, after some sensitive operations, some databases will fall into an unavailable state. I have encountered this problem twice. The first time I use the influx_inspect tool to export the data of a certain database in the influxdb running state, then the database will be inserted into the query or the query will be in a deadlock state, while other databases have no effect. The second time is the write test, try to change the value type, after the command is executed, the error message is time out, after which the database is in an unavailable state.

However, the above situation did not always occur, and it was unsuccessful when many attempts were made to reproduce. I guess it is related to the synchronization modification when tsm or tsi files are merged.

Zanthras · 2019-04-17T00:58:42Z

edit: probably same as #13010

I believe I am hitting the same issue as well. I setup a new influx server on 1.7.5 and setup a single telegraf agent to add some metrics. Some time later it stopped responding to any query. Looking at the influx http access logs i see the exact point it started failing.

x.x.x.x - - [16/Apr/2019:14:21:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" c4796750-607c-11e9-89c7-00505681237f 4721
x.x.x.x - - [16/Apr/2019:14:21:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" ca6f4669-607c-11e9-89c8-00505681237f 12943
x.x.x.x - - [16/Apr/2019:14:21:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" d0655e02-607c-11e9-89c9-00505681237f 4292
x.x.x.x - - [16/Apr/2019:14:21:30 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" d65b03e4-607c-11e9-89ca-00505681237f 4118
x.x.x.x - - [16/Apr/2019:14:21:40 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" dc50ed11-607c-11e9-89cb-00505681237f 3962
x.x.x.x - - [16/Apr/2019:14:21:50 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" e246ca03-607c-11e9-89cc-00505681237f 7066
x.x.x.x - - [16/Apr/2019:14:22:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" e83cb16e-607c-11e9-89cd-00505681237f 3873
x.x.x.x - - [16/Apr/2019:14:22:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" ee328f21-607c-11e9-89ce-00505681237f 5843
x.x.x.x - - [16/Apr/2019:14:22:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" f42868bc-607c-11e9-89cf-00505681237f 10001554
x.x.x.x - - [16/Apr/2019:14:22:30 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" fa1f3cba-607c-11e9-89d0-00505681237f 10002427
x.x.x.x - - [16/Apr/2019:14:22:40 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 001541c3-607d-11e9-89d1-00505681237f 10001571
x.x.x.x - - [16/Apr/2019:14:23:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 0bfff675-607d-11e9-89d2-00505681237f 10007079
x.x.x.x - - [16/Apr/2019:14:23:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 120a45ba-607d-11e9-89d3-00505681237f 10009110
x.x.x.x - - [16/Apr/2019:14:23:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 17ec5657-607d-11e9-89d4-00505681237f 10010260

Restarting influx would fix the problem for a bit, but it kept coming back. Curious to see if it was telegraf i downgraded the the telegraf version without restarting influx. This did nothing, so pretty sure its not related to what telegraf was sending. I did packet capture of the requests coming from telegraf and they looked completely normal.

x.x.x.x - - [16/Apr/2019:15:15:40 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 6781bc77-6084-11e9-80e6-00505681237f 10009953
x.x.x.x - - [16/Apr/2019:15:15:50 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 6d779d9a-6084-11e9-80e7-00505681237f 10009764
x.x.x.x - - [16/Apr/2019:15:16:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 736d9c53-6084-11e9-80e8-00505681237f 10011813
x.x.x.x - - [16/Apr/2019:15:16:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 796372cf-6084-11e9-80e9-00505681237f 10011408
x.x.x.x - - [16/Apr/2019:15:16:30 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 854f00d8-6084-11e9-80ea-00505681237f 10010158
x.x.x.x - - [16/Apr/2019:15:16:50 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 913ac67d-6084-11e9-80eb-00505681237f 10011445
x.x.x.x - - [16/Apr/2019:15:16:57 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 95ec286c-6084-11e9-80ec-00505681237f 10005155
x.x.x.x - - [16/Apr/2019:15:17:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" a31cbc47-6084-11e9-80ed-00505681237f 10001548
x.x.x.x - - [16/Apr/2019:15:17:30 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" a9127342-6084-11e9-80ee-00505681237f 10002787
x.x.x.x - - [16/Apr/2019:15:17:40 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" af08712c-6084-11e9-80ef-00505681237f 10003098
x.x.x.x - - [16/Apr/2019:15:17:50 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" b4fe92f1-6084-11e9-80f0-00505681237f 10001541
x.x.x.x - - [16/Apr/2019:15:18:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" baf40a66-6084-11e9-80f1-00505681237f 10013277
x.x.x.x - - [16/Apr/2019:15:18:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" c0e9ee78-6084-11e9-80f3-00505681237f 10008575
x.x.x.x - - [16/Apr/2019:15:18:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" c6dfec5a-6084-11e9-80f4-00505681237f 10010578

Non access logs show just repeat timeouts.
Apr 16 15:18:20 influxserver influxd[29406]: ts=2019-04-16T20:18:20.010291Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:18:30 influxserver influxd[29406]: ts=2019-04-16T20:18:30.013048Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:18:40 influxserver influxd[29406]: ts=2019-04-16T20:18:40.014316Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:18:50 influxserver influxd[29406]: ts=2019-04-16T20:18:50.012605Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:19:00 influxserver influxd[29406]: ts=2019-04-16T20:19:00.011822Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:19:05 influxserver influxd[29406]: ts=2019-04-16T20:19:05.009505Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:19:30 influxserver influxd[29406]: ts=2019-04-16T20:19:30.004496Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:19:40 influxserver influxd[29406]: ts=2019-04-16T20:19:40.004559Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd

I downgraded influxdb to 1.7.2(deployed elsewhere here with the same configuration) and havent seen a repeat of the issue.

timhallinflux · 2019-04-17T01:25:39Z

1.7.6 should be available now.
https://docs.influxdata.com/influxdb/v1.7/about_the_project/releasenotes-changelog/

stale · 2019-07-23T04:32:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Jul 23, 2019

Tancen closed this as completed Jul 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocked until restarted 'influxd.exe' #13425

Blocked until restarted 'influxd.exe' #13425

Tancen commented Apr 16, 2019

ghost commented Apr 16, 2019

ghost commented Apr 16, 2019

Zanthras commented Apr 17, 2019 •

edited

Loading

timhallinflux commented Apr 17, 2019 •

edited

Loading

stale bot commented Jul 23, 2019

Blocked until restarted 'influxd.exe' #13425

Blocked until restarted 'influxd.exe' #13425

Comments

Tancen commented Apr 16, 2019

ghost commented Apr 16, 2019

ghost commented Apr 16, 2019

Zanthras commented Apr 17, 2019 • edited Loading

timhallinflux commented Apr 17, 2019 • edited Loading

stale bot commented Jul 23, 2019

Zanthras commented Apr 17, 2019 •

edited

Loading

timhallinflux commented Apr 17, 2019 •

edited

Loading