Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort the agent if a single shutdown step takes more than 60 seconds. #17060

Merged
merged 3 commits into from Feb 27, 2024

Conversation

vkalintiris
Copy link
Contributor

Summary

SSIA.

Test Plan
  • CI jobs
  • Start/stop the agent with/without an explicit sleep(x > 60) in between the shutdown steps.

@vkalintiris
Copy link
Contributor Author

@Ferroin FYI (just because we set TimeoutStopSec=150 on systemd systems).

@thiagoftsm
Copy link
Contributor

On environment without systemd, netdata is not stopping:

root@hades:/home/thiago/Netdata/tests_netdata# ps aux| grep netdata
netdata  29244  0.5  0.4 477520 67648 ?        SNl  15:18   0:08 netdata
netdata  29247  0.0  0.1 107888 17908 ?        SNl  15:18   0:00 /usr/sbin/netdata --special-spawn-server
root     32366  0.0  0.0   4124  2364 pts/1    S+   15:45   0:00 grep netdata
root@hades:/home/thiago/Netdata/tests_netdata# date
Mon Feb 26 15:45:05 UTC 2024

These are few logs I have on it:

time=2024-02-26T15:34:18.984+00:00 comm=netdata source=daemon level=info tid=29520 thread=DAEMON_COMMAND msg="Shutting down command loop complete."

.....

time=2024-02-26T15:34:27.664+00:00 comm=netdata source=daemon level=info tid=29241 thread=netdata msg="Waiting 1 threads to finish..."
time=2024-02-26T15:34:27.765+00:00 comm=netdata source=daemon level=error tid=29241 thread=netdata msg="Main thread h2o takes too long to exit. Giving up..."
time=2024-02-26T15:34:27.765+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="RRD: 'host:hades' stopping data collection..."
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="RRD: 'host:potinho' stopping data collection..."
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="RRD: 'host:archlinux' stopping data collection..."
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="METADATA: Sending a shutdown command"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="METADATA: Waiting for shutdown ACK"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29457 thread=METASYNC msg="Shutting down metadata thread"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29457 thread=METASYNC msg="Thread 29457: Cleaning prepared statement on 0x280cd18"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29457 thread=METASYNC msg="Thread 29457: Cleaning prepared statement on 0x280ba58"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=debug tid=29241 thread=netdata msg="METADATA: Shutdown complete"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29241 thread=netdata msg="Closing context SQLite database"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29241 thread=netdata msg="Closing SQLite database"
time=2024-02-26T15:34:27.766+00:00 comm=netdata source=daemon level=info tid=29241 thread=netdata msg="No statements pending to finalize"

Please, take a look that between the first messages processes were stopping until I ran command date, we have almost 12 minutes.

@thiagoftsm
Copy link
Contributor

After a chat with @vkalintiris , we observed that the previous issue does not have relationship with this PR. We used gdb to dig into it, and we observed that the issue has relationship with spawn server:

[4:58](https://netdata-cloud.slack.com/archives/D01JE74U54J/p1708966735076449)
(gdb) info threads
  Id   Target Id                                         Frame 
* 1    Thread 0x7f901e556940 (LWP 18019) "netdata"       0x00007f901d91991e in epoll_pwait () from /lib64/libc.so.6
  2    Thread 0x7f901e556940 (LWP 18031) "iou-sqp-18019" 0x0000000000000000 in ?? ()
  3    Thread 0x7f9018fff6c0 (LWP 18032) "netdata"       0x00007f901d891566 in __futex_abstimed_wait_cancelable64 ()
   from /lib64/libc.so.6

(gdb) thread 3
[Switching to thread 3 (Thread 0x7f9018fff6c0 (LWP 18032))]
#0  0x00007f901d891566 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f901d891566 in __futex_abstimed_wait_cancelable64 () from /lib64/libc.so.6
#1  0x00007f901d893d2e in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2  0x00007f901e624689 in uv_cond_wait () from /usr/lib64/libuv.so.1
#3  0x0000000000704d8f in wait_children (arg=<optimized out>)
    at /home/thiago/Netdata/tests_netdata/src/spawn/spawn_server.c:123
#4  0x00007f901d894d57 in start_thread () from /lib64/libc.so.6
#5  0x00007f901d9196a8 in clone3 () from /lib64/libc.so.6

This PR helped to explicit it. When I used current master, it does not appear so quickly.

It was also observed that we have a random issue. It is not appearing every time I run.

thiagoftsm
thiagoftsm previously approved these changes Feb 26, 2024
Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After to debug, and to confirm that the issue #17060 (comment) does not have relationship with this PR, LGTM!

@vkalintiris
Copy link
Contributor Author

@thiagoftsm can you approve again this one? I added d5390ab which will use timed wait only for builds that have sentry enabled. Otherwise, we'll follow the same behavior as the one we currently have on master.

Copy link
Contributor

@thiagoftsm thiagoftsm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything is working after last commit, LGTM!

@vkalintiris vkalintiris merged commit 700d77b into netdata:master Feb 27, 2024
129 of 133 checks passed
ilyam8 pushed a commit to ilyam8/netdata that referenced this pull request Feb 28, 2024
…netdata#17060)

* Add timed-wait for completion.

* Abort if any shutdown step takes more than 60 seconds to complete.

* Timeout only on sentry builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants