Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed deadline after restart #26

Open
tilsche opened this issue Jul 6, 2021 · 1 comment
Open

Missed deadline after restart #26

tilsche opened this issue Jul 6, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@tilsche
Copy link
Contributor

tilsche commented Jul 6, 2021

Not sure what's happening here, but 10 minutes after a restart, things went boom.

Jul 06 03:55:22 igel metricq-db-hta[6010]: [2021-07-06 03:55:22.397019183 CEST][ WARN]: [taurus.BC31.power] skipped 1 NaNs of 1 values
Jul 06 03:57:29 igel metricq-db-hta[6010]: [2021-07-06 03:57:29.046435000 CEST][metricq][ERROR]: [Data connection] write failed: stream truncated
Jul 06 03:57:29 igel metricq-db-hta[6010]: [2021-07-06 03:57:29.046518851 CEST][metricq][ERROR]: data channel error: stream truncated
Jul 06 03:58:13 igel metricq-db-hta[6010]: [2021-07-06 03:58:13.279092252 CEST][metricq][ INFO]: sink data queue consume finalize
Jul 06 03:58:13 igel metricq-db-hta[6010]: [2021-07-06 03:58:13.279154269 CEST][metricq][ INFO]: sink history queue consume finalize
Jul 06 03:58:13 igel metricq-db-hta[6010]: [2021-07-06 03:58:13.314637191 CEST][ERROR]: Unhandled exception: ConnectionHandler::onError: write failed
Jul 06 03:58:13 igel systemd[1]: metricq-db-hta-lzr-pdu.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jul 06 03:58:13 igel systemd[1]: metricq-db-hta-lzr-pdu.service: Unit entered failed state.
Jul 06 03:58:13 igel systemd[1]: metricq-db-hta-lzr-pdu.service: Failed with result 'exit-code'.
Jul 06 03:58:23 igel systemd[1]: metricq-db-hta-lzr-pdu.service: Service hold-off time over, scheduling restart.
Jul 06 03:58:23 igel systemd[1]: Stopped MetricQ HTA DB for LZR PDU metrics.
Jul 06 03:58:23 igel systemd[1]: Started MetricQ HTA DB for LZR PDU metrics.
Jul 06 03:58:23 igel metricq-db-hta[29979]: [2021-07-06 03:58:23.541688204 CEST][metricq][ INFO]: connecting to management server: amqps://***:***@rabbitmq.metricq.zih.tu-dresden.de/
Jul 06 03:58:23 igel metricq-db-hta[29979]: [2021-07-06 03:58:23.960014611 CEST][ INFO]: Couldn't parse logging section of the config: [json.exception.out_of_range.403] key 'logging' not found
Jul 06 03:58:23 igel metricq-db-hta[29979]: [2021-07-06 03:58:23.975337388 CEST][ INFO]: setting up HTA::Directory
Jul 06 03:58:24 igel metricq-db-hta[29979]: [2021-07-06 03:58:24.036267692 CEST][metricq][ INFO]: setting up data queue, messages 260543, consumers 0
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.282117242 CEST][ WARN]: [LZR.E98.1806B.B84.L3] skipped 1 non-monotonic of 1 values
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.287115435 CEST][metricq][ WARN]: Missed deadline 2021-07-06T03:58:25+0200, it is now 2021-07-06T04:07:45+0200
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.287188862 CEST][metricq][ WARN]: Missed deadline 2021-07-06T03:58:26+0200, it is now 2021-07-06T04:07:45+0200
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.287211809 CEST][metricq][ WARN]: Missed deadline 2021-07-06T03:58:27+0200, it is now 2021-07-06T04:07:45+0200
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.287222528 CEST][metricq][ WARN]: Missed deadline 2021-07-06T03:58:28+0200, it is now 2021-07-06T04:07:45+0200
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.287232664 CEST][metricq][ WARN]: Missed deadline 2021-07-06T03:58:29+0200, it is now 2021-07-06T04:07:45+0200
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.287242326 CEST][metricq][ WARN]: Missed deadline 2021-07-06T03:58:30+0200, it is now 2021-07-06T04:07:45+0200

[...]

Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.292719771 CEST][metricq][ WARN]: Missed deadline 2021-07-06T04:07:42+0200, it is now 2021-07-06T04:07:45+0200
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.292729085 CEST][metricq][ WARN]: Missed deadline 2021-07-06T04:07:43+0200, it is now 2021-07-06T04:07:45+0200
Jul 06 04:07:45 igel metricq-db-hta[29979]: [2021-07-06 04:07:45.292738762 CEST][metricq][ WARN]: Missed deadline 2021-07-06T04:07:44+0200, it is now 2021-07-06T04:07:45+0200
Jul 06 04:07:49 igel systemd[1]: metricq-db-hta-lzr-pdu.service: Main process exited, code=dumped, status=11/SEGV
Jul 06 04:07:49 igel systemd[1]: metricq-db-hta-lzr-pdu.service: Unit entered failed state.
Jul 06 04:07:49 igel systemd[1]: metricq-db-hta-lzr-pdu.service: Failed with result 'core-dump'.
@tilsche tilsche added the bug Something isn't working label Jul 6, 2021
@tilsche
Copy link
Contributor Author

tilsche commented Jul 6, 2021

I suspect the file open phase is too long and thus the stats timers fail... and then something more goes wrong for the segfault.

Core dump to debug on igel.

TIME                            PID   UID   GID SIG COREFILE EXE
Tue 2021-07-06 04:07:45 CEST  29979     0     0  11 present  /home/service/metricq-db-hta/build/metricq-db-hta
Tue 2021-07-06 08:09:05 CEST  46469     0     0  11 present  /home/service/metricq-db-hta/build/metricq-db-hta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant