Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocked until restarted 'influxd.exe' #13425

Closed
Tancen opened this issue Apr 16, 2019 · 5 comments
Closed

Blocked until restarted 'influxd.exe' #13425

Tancen opened this issue Apr 16, 2019 · 5 comments
Labels

Comments

@Tancen
Copy link

Tancen commented Apr 16, 2019

OS: windows 7 x64
InfluxDB version : 1.7.5-1

Console 1:

D:\influxdb\influxdb-1.7.5-1>influx -username root -password 123456
Connected to http://localhost:8086 version 1.7.5
InfluxDB shell version: 1.7.5
Enter an InfluxQL query

use nkdata_3
Using database nkdata_3
INSERT DR_E_RAW_HOUR_20190415_1,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0
POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=0
.000000
ERR: {"error":"timeout"}

INSERT DR_E_RAW_HOUR_20190415_1,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0
POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=0
.000000
ERR: {"error":"timeout"}

INSERT DR_E_RAW_HOUR_20190415_1,STRUUID={cf31e80b-971f-4742-b2bf-1c399ce012ae}
,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0 LOAD_TIME=2019-04-15 18:48:45.007
,POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=
0.000000
ERR: {"error":"unable to parse 'DR_E_RAW_HOUR_20190415_1,STRUUID={cf31e80b-971f-
4742-b2bf-1c399ce012ae},MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0 LOAD_TIME=
2019-04-15 18:48:45.007,POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_
1=0.000000,GROUP_Q_E_2=0.000000': invalid number"}

INSERT DR_E_RAW_HOUR_20190415_1,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0
POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=0
.000000
ERR: {"error":"timeout"}

Console 2:

D:\influxdb\influxdb-1.7.5-1>influx -username root -password 123456
Connected to http://localhost:8086 version 1.7.5
InfluxDB shell version: 1.7.5
Enter an InfluxQL query

use nkdata_3
Using database nkdata_3
show
ERR: error parsing query: found EOF, expected CONTINUOUS, DATABASES, DIAGNOSTICS
, FIELD, GRANTS, MEASUREMENT, MEASUREMENTS, QUERIES, RETENTION, SERIES, SHARD, S
HARDS, STATS, SUBSCRIPTIONS, TAG, USERS at line 1, char 6
show MEASUREMENTS

When restarted 'influxd.exe'
Console 1:

INSERT DR_E_RAW_HOUR_20190415_1,MP_ID=326733358851,DATA_FLAG=1,UPLOADSTATUS=0
POS_P_E_TOTAL=0.000000,REV_P_E_TOTAL=0.000000,GROUP_Q_E_1=0.000000,GROUP_Q_E_2=0
.000000
ERR: {"error":"partial write: field type conflict: input field "POS_P_E_TOTAL"
on measurement "DR_E_RAW_HOUR_20190415_1" is type float, already exists as ty
pe string dropped=1"}

Console 2:

show MEASUREMENTS
name: measurements
name


DR_E_RAW_HOUR_20190415_1
load

@ghost
Copy link

ghost commented Apr 16, 2019

I also encountered the same problem, after some sensitive operations, some databases will fall into an unavailable state. I have encountered this problem twice. The first time I use the influx_inspect tool to export the data of a certain database in the influxdb running state, then the database will be inserted into the query or the query will be in a deadlock state, while other databases have no effect. The second time is the write test, try to change the value type, after the command is executed, the error message is time out, after which the database is in an unavailable state.

@ghost
Copy link

ghost commented Apr 16, 2019

I also encountered the same problem, after some sensitive operations, some databases will fall into an unavailable state. I have encountered this problem twice. The first time I use the influx_inspect tool to export the data of a certain database in the influxdb running state, then the database will be inserted into the query or the query will be in a deadlock state, while other databases have no effect. The second time is the write test, try to change the value type, after the command is executed, the error message is time out, after which the database is in an unavailable state.

However, the above situation did not always occur, and it was unsuccessful when many attempts were made to reproduce. I guess it is related to the synchronization modification when tsm or tsi files are merged.

@Zanthras
Copy link

Zanthras commented Apr 17, 2019

edit: probably same as #13010

I believe I am hitting the same issue as well. I setup a new influx server on 1.7.5 and setup a single telegraf agent to add some metrics. Some time later it stopped responding to any query. Looking at the influx http access logs i see the exact point it started failing.

x.x.x.x - - [16/Apr/2019:14:21:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" c4796750-607c-11e9-89c7-00505681237f 4721
x.x.x.x - - [16/Apr/2019:14:21:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" ca6f4669-607c-11e9-89c8-00505681237f 12943
x.x.x.x - - [16/Apr/2019:14:21:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" d0655e02-607c-11e9-89c9-00505681237f 4292
x.x.x.x - - [16/Apr/2019:14:21:30 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" d65b03e4-607c-11e9-89ca-00505681237f 4118
x.x.x.x - - [16/Apr/2019:14:21:40 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" dc50ed11-607c-11e9-89cb-00505681237f 3962
x.x.x.x - - [16/Apr/2019:14:21:50 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" e246ca03-607c-11e9-89cc-00505681237f 7066
x.x.x.x - - [16/Apr/2019:14:22:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" e83cb16e-607c-11e9-89cd-00505681237f 3873
x.x.x.x - - [16/Apr/2019:14:22:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 204 0 "-" "Telegraf/1.10.2" ee328f21-607c-11e9-89ce-00505681237f 5843
x.x.x.x - - [16/Apr/2019:14:22:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" f42868bc-607c-11e9-89cf-00505681237f 10001554
x.x.x.x - - [16/Apr/2019:14:22:30 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" fa1f3cba-607c-11e9-89d0-00505681237f 10002427
x.x.x.x - - [16/Apr/2019:14:22:40 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 001541c3-607d-11e9-89d1-00505681237f 10001571
x.x.x.x - - [16/Apr/2019:14:23:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 0bfff675-607d-11e9-89d2-00505681237f 10007079
x.x.x.x - - [16/Apr/2019:14:23:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 120a45ba-607d-11e9-89d3-00505681237f 10009110
x.x.x.x - - [16/Apr/2019:14:23:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 17ec5657-607d-11e9-89d4-00505681237f 10010260

Restarting influx would fix the problem for a bit, but it kept coming back. Curious to see if it was telegraf i downgraded the the telegraf version without restarting influx. This did nothing, so pretty sure its not related to what telegraf was sending. I did packet capture of the requests coming from telegraf and they looked completely normal.

x.x.x.x - - [16/Apr/2019:15:15:40 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 6781bc77-6084-11e9-80e6-00505681237f 10009953
x.x.x.x - - [16/Apr/2019:15:15:50 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 6d779d9a-6084-11e9-80e7-00505681237f 10009764
x.x.x.x - - [16/Apr/2019:15:16:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 736d9c53-6084-11e9-80e8-00505681237f 10011813
x.x.x.x - - [16/Apr/2019:15:16:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 796372cf-6084-11e9-80e9-00505681237f 10011408
x.x.x.x - - [16/Apr/2019:15:16:30 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 854f00d8-6084-11e9-80ea-00505681237f 10010158
x.x.x.x - - [16/Apr/2019:15:16:50 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 913ac67d-6084-11e9-80eb-00505681237f 10011445
x.x.x.x - - [16/Apr/2019:15:16:57 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.10.2" 95ec286c-6084-11e9-80ec-00505681237f 10005155
x.x.x.x - - [16/Apr/2019:15:17:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" a31cbc47-6084-11e9-80ed-00505681237f 10001548
x.x.x.x - - [16/Apr/2019:15:17:30 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" a9127342-6084-11e9-80ee-00505681237f 10002787
x.x.x.x - - [16/Apr/2019:15:17:40 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" af08712c-6084-11e9-80ef-00505681237f 10003098
x.x.x.x - - [16/Apr/2019:15:17:50 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" b4fe92f1-6084-11e9-80f0-00505681237f 10001541
x.x.x.x - - [16/Apr/2019:15:18:00 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" baf40a66-6084-11e9-80f1-00505681237f 10013277
x.x.x.x - - [16/Apr/2019:15:18:10 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" c0e9ee78-6084-11e9-80f3-00505681237f 10008575
x.x.x.x - - [16/Apr/2019:15:18:20 -0500] "POST /write?consistency=any&db=network&rp=HighResolution HTTP/1.1" 500 20 "-" "Telegraf/1.9.5" c6dfec5a-6084-11e9-80f4-00505681237f 10010578

Non access logs show just repeat timeouts.
Apr 16 15:18:20 influxserver influxd[29406]: ts=2019-04-16T20:18:20.010291Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:18:30 influxserver influxd[29406]: ts=2019-04-16T20:18:30.013048Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:18:40 influxserver influxd[29406]: ts=2019-04-16T20:18:40.014316Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:18:50 influxserver influxd[29406]: ts=2019-04-16T20:18:50.012605Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:19:00 influxserver influxd[29406]: ts=2019-04-16T20:19:00.011822Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:19:05 influxserver influxd[29406]: ts=2019-04-16T20:19:05.009505Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:19:30 influxserver influxd[29406]: ts=2019-04-16T20:19:30.004496Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd
Apr 16 15:19:40 influxserver influxd[29406]: ts=2019-04-16T20:19:40.004559Z lvl=error msg="[500] - "timeout"" log_id=0Eqw_hMl000 service=httpd

I downgraded influxdb to 1.7.2(deployed elsewhere here with the same configuration) and havent seen a repeat of the issue.

@timhallinflux
Copy link
Contributor

timhallinflux commented Apr 17, 2019

@stale
Copy link

stale bot commented Jul 23, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jul 23, 2019
@Tancen Tancen closed this as completed Jul 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants