New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abort the agent if a single shutdown step takes more than 60 seconds. #17060
Conversation
@Ferroin FYI (just because we set TimeoutStopSec=150 on systemd systems). |
On environment without root@hades:/home/thiago/Netdata/tests_netdata# ps aux| grep netdata
netdata 29244 0.5 0.4 477520 67648 ? SNl 15:18 0:08 netdata
netdata 29247 0.0 0.1 107888 17908 ? SNl 15:18 0:00 /usr/sbin/netdata --special-spawn-server
root 32366 0.0 0.0 4124 2364 pts/1 S+ 15:45 0:00 grep netdata
root@hades:/home/thiago/Netdata/tests_netdata# date
Mon Feb 26 15:45:05 UTC 2024 These are few logs I have on it: time=2024-02-26T15:34:18.984+00:00 comm=netdata source=daemon level=info tid=29520 thread=DAEMON_COMMAND msg="Shutting down command loop complete."
.....
time=2024-02-26T15:34:27.664+00:00 comm=netdata source=daemon level=info tid=29241 thread=netdata msg="Waiting 1 threads to finish..."
time=2024-02-26T15:34:27.765+00:00 comm=netdata source=daemon level=error tid=29241 thread=netdata msg="Main thread h2o takes too long to exit. Giving up..."
time=2024-02-26T15:34:27.765+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="RRD: 'host:hades' stopping data collection..."
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="RRD: 'host:potinho' stopping data collection..."
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="RRD: 'host:archlinux' stopping data collection..."
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="METADATA: Sending a shutdown command"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="METADATA: Waiting for shutdown ACK"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29457 thread=METASYNC msg="Shutting down metadata thread"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29457 thread=METASYNC msg="Thread 29457: Cleaning prepared statement on 0x280cd18"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29457 thread=METASYNC msg="Thread 29457: Cleaning prepared statement on 0x280ba58"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="METADATA: Shutdown complete"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29241 thread=netdata msg="Closing context SQLite database"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29241 thread=netdata msg="Closing SQLite database"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29241 thread=netdata msg="No statements pending to finalize" Please, take a look that between the first messages processes were stopping until I ran command date, we have almost 12 minutes. |
After a chat with @vkalintiris , we observed that the previous issue does not have relationship with this PR. We used [4:58](https://netdata-cloud.slack.com/archives/D01JE74U54J/p1708966735076449)
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7f901e556940 (LWP 18019) "netdata" 0x00007f901d91991e in epoll_pwait () from /lib64/libc.so.6
2 Thread 0x7f901e556940 (LWP 18031) "iou-sqp-18019" 0x0000000000000000 in ?? ()
3 Thread 0x7f9018fff6c0 (LWP 18032) "netdata" 0x00007f901d891566 in __futex_abstimed_wait_cancelable64 ()
from /lib64/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f9018fff6c0 (LWP 18032))]
#0 0x00007f901d891566 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f901d891566 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
#1 0x00007f901d893d2e in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2 0x00007f901e624689 in uv_cond_wait () from /usr/lib64/libuv.so.1
#3 0x0000000000704d8f in wait_children (arg=<optimized out>)
at /home/thiago/Netdata/tests_netdata/src/spawn/spawn_server.c:123
#4 0x00007f901d894d57 in start_thread () from /lib64/libc.so.6
#5 0x00007f901d9196a8 in clone3 () from /lib64/libc.so.6 This PR helped to explicit it. When I used current master, it does not appear so quickly. It was also observed that we have a random issue. It is not appearing every time I run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After to debug, and to confirm that the issue #17060 (comment) does not have relationship with this PR, LGTM!
@thiagoftsm can you approve again this one? I added d5390ab which will use timed wait only for builds that have sentry enabled. Otherwise, we'll follow the same behavior as the one we currently have on master. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything is working after last commit, LGTM!
…netdata#17060) * Add timed-wait for completion. * Abort if any shutdown step takes more than 60 seconds to complete. * Timeout only on sentry builds.
Summary
SSIA.
Test Plan
sleep(x > 60)
in between the shutdown steps.