Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% cpu load on all nodes #3237

Closed
shrikus opened this issue Feb 23, 2020 · 6 comments
Closed

100% cpu load on all nodes #3237

shrikus opened this issue Feb 23, 2020 · 6 comments
Labels

Comments

@shrikus
Copy link

@shrikus shrikus commented Feb 23, 2020

We have 8+ running Grin nodes (3.0.0) that every 4-20 days hang the entire server with 100% cpu usage (all available cores). Logs do not show anything significant, strace -f is also not showing any data about the threads being loaded. top -H -p displays this:

image

htop:

image

@bluimes

This comment has been minimized.

Copy link

@bluimes bluimes commented Feb 23, 2020

I can vouch for this issue

Main reason the system gets unstable. ITX machine, dedicated node.

image

@antiochp

This comment has been minimized.

Copy link
Member

@antiochp antiochp commented Feb 24, 2020

Thanks for the detailed report. We are investigating.

@antiochp

This comment has been minimized.

Copy link
Member

@antiochp antiochp commented Feb 24, 2020

I see a node with similar behavior -

Screen Shot 2020-02-24 at 10 00 39 AM

It suggests the peer_write threads are getting jammed up somewhere and taking up a bunch of the cpu and not letting it go.
I'll see if I can get a stacktrace of what is going on with these threads here.

@antiochp

This comment has been minimized.

Copy link
Member

@antiochp antiochp commented Feb 24, 2020

Still attempting to reproduce on a node configured to give decent stack traces.

My gut feeling is this is related to blocking IO and we are retrying (likely writing to a socket) in a tight loop in a situation that is not recoverable in the peer_write thread.
i.e a non recoverable error is encountered and we keep retrying even though the socket will never recover.

Related - #3154
We may not have been seeing this pre #3154 due to blocking/non-blocking ambiguity.

Note: just speculation, no evidence to indicate this is actually the underlying issue.


Specifically our call to write_message wrapped in try_break! which attempts to "retry" on WouldBlock and I suspect we only see WouldBlock in non-blocking IO when we are in a non-recoverable state (i.e. connection has dropped). And if our end of the tcp stream is not aware that the connection has dropped then we may continue to retry indefinitely (albeit with a small 10ms sleep in the loop).

Pinging @hashmap for another pair of eyes on this.

@antiochp

This comment has been minimized.

Copy link
Member

@antiochp antiochp commented Feb 25, 2020

I also bet that if the write_peer thread get locked up like this we can no longer reliably request compact blocks when we receive new block headers during broadcast.
In which case our node is extremely likely to start falling behind and getting out of sync (see gitter chat and various other seemingly unrelated bug reports).

@antiochp

This comment has been minimized.

Copy link
Member

@antiochp antiochp commented Feb 25, 2020

Bunch of threads stuck in a loop doing the following -

(lldb) thread backtrace 21
  thread #21, name = 'peer_write', stop reason = signal SIGSTOP
    frame #0: 0x00007f9b96ac1ed9 libpthread.so.0`__pthread_cond_timedwait + 649
    frame #1: 0x00005644005a7181 grin`std::sys::unix::condvar::Condvar::wait_timeout::h9b1adf8fddc8b04c at condvar.rs:99
    frame #2: 0x0000564400595242 grin`std::thread::park_timeout::h1be532d836107941 [inlined] std::sys_common::condvar::Condvar::wait_timeout::h47c7d44ad248cf0c at condvar.rs:51
    frame #3: 0x0000564400595236 grin`std::thread::park_timeout::h1be532d836107941 [inlined] std::sync::condvar::Condvar::wait_timeout::hf9a83dd977ab4265 at condvar.rs:405
    frame #4: 0x000056440059521c grin`std::thread::park_timeout::h1be532d836107941 at mod.rs:1003
    frame #5: 0x00005644005a167d grin`std::sync::mpsc::blocking::WaitToken::wait_max_until::h29132948868d6846 at blocking.rs:75
    frame #6: 0x0000564400343ecc grin`std::sync::mpsc::sync::Packet$LT$T$GT$::recv::hef7c654ce2ea359f + 572
    frame #7: 0x000056440033c5aa grin`std::sync::mpsc::Receiver$LT$T$GT$::recv_timeout::h55d849c5bb0607a6 + 2250
    frame #8: 0x000056440033f6b3 grin`std::sys_common::backtrace::__rust_begin_short_backtrace::h83fa8a0c37ae5ebc + 275
    frame #9: 0x0000564400346785 grin`std::panicking::try::do_call::h7cd3f34e61222b1a (.llvm.2480943539170131300) + 37
    frame #10: 0x00005644005ab0fa grin`__rust_maybe_catch_panic at lib.rs:78
    frame #11: 0x0000564400358609 grin`core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hbe008f988e800a1d + 137
    frame #12: 0x000056440059406f grin`_$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h7bfd6e98bcf6c67e at boxed.rs:1022
    frame #13: 0x00005644005aa1f0 grin`std::sys::unix::thread::Thread::new::thread_start::hbc75223688f972da [inlined] _$LT$alloc..boxed..Box$LT$F$GT$$u20$as$u20$core..ops..function..FnOnce$LT$A$GT$$GT$::call_once::h9979661149bbc525 at boxed.rs:1022
    frame #14: 0x00005644005aa1e4 grin`std::sys::unix::thread::Thread::new::thread_start::hbc75223688f972da [inlined] std::sys_common::thread::start_thread::had0bb4166c065276 at thread.rs:13
    frame #15: 0x00005644005aa16a grin`std::sys::unix::thread::Thread::new::thread_start::hbc75223688f972da at thread.rs:80
    frame #16: 0x00007f9b96abb6db libpthread.so.0`start_thread + 219
    frame #17: 0x00007f9b965cc88f libc.so.6`clone + 63

So this is the point where the peer_write thread is reading a msg off the mpsc channel receiver.
We do this via recv_timeout(CHANNEL_TIMEOUT) but if this errors out we immediately retry.
So this likely explains the tight loop and cpu usage.

https://doc.rust-lang.org/std/sync/mpsc/struct.Receiver.html#method.recv_timeout

Need to look at the code again but I think we may not be handling error conditions correctly when we call recv_timeout, specifically the case where "no more messages can ever be received on this channel."

If the corresponding Sender has disconnected, or it disconnects while this call is blocking, this call will wake up and return Err to indicate that no more messages can ever be received on this channel. However, since channels are buffered, messages sent before the disconnect will still be properly received.

https://doc.rust-lang.org/std/sync/mpsc/enum.RecvTimeoutError.html

Timeout
[−]
This channel is currently empty, but the Sender(s) have not yet disconnected, so data may yet become available.

Disconnected
[−]
The channel's sending half has become disconnected, and there will never be any more data received on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants
You can’t perform that action at this time.