Unexpectedly high memory usage seemingly due to a "stuck" client #2870

nfrisby · 2021-01-14T18:22:38Z

This Issue arose from the debugging/triage efforts of Issue IntersectMBO/cardano-node#2235. A server has a space-leak that seems to be related to problematic clients. Both clients have followed the chain up to just before the first Allegra block. And both repeatedly disconnect and reconnect, without having made any progress. We have theories about why the clients are doing that (one is too old to understand Allegra; maybe the other is running OOM in the epoch boundary computation or similar), but it appears that they are inducing unacceptable resource usage in the server.

This Issue is to attempt to reproduce that interaction in a minimal controlled setup, and debug it.

From this perspective, the relevant facts are as follows.

We don't yet know what release the server we have the most anecdata from is running.
The V_2 client is sending the same FindIntersect once per second. The V_5 client is sending it about once per 100/3 seconds and also fetching some blocks for 10 seconds each time before it disconnects.
(It may be that only one of them is causing the memory leak.)
Both peer-to-peer connections are killed once per FindIntersect (V_2 by the server, V_5 by the client). And the FindIntersect is always the same: the expected points for a client whose current chain ends at the block before Allegra started.
There is also a space leak of “PINNED” memory, which is not related to the number of clients the server has. It is possible that that is the only space leak in play.
And the GitHub Issue has logs of network traffic that show bursts of high outbound (ie the repeated bulk sync?).

The text was updated successfully, but these errors were encountered:

nfrisby · 2021-01-14T18:24:03Z

Thanks @karknu for patiently catching me up so I could write the above. Do you or @mrBliss see any corrections or omissions?

rphair · 2021-01-15T05:27:45Z

Just committing again here to keep our production stake pool relay which still demonstrates the problem running as-is for another 12 hours in case any information needs to be dug from it. I'll be watching for email on this issue and @nfrisby you & other devs can also contact me on Telegram as @karknu did. Here is the software revision on that node:

relay-sgp1$ cardano-node --version
cardano-node 1.24.2 - linux-x86_64 - ghc-8.10
git rev 400d18092ce604352cf36fe5f105b0d7c78be074

karknu · 2021-01-18T09:50:29Z

#2880 is an example of how to fix the problem. Care should be taken to go through all other miniprotocol and make sure that if they allocate any state it should be cleaned up even in case of an exception.

rphair · 2021-01-18T11:10:35Z

Three days after blocking incoming TCP connections from any nodes generating the HardForkEncoderDisabledEra messages (with no more obsolete peers arriving), and making no other changes, the memory problem has not returned.

karknu · 2021-01-19T06:24:18Z

Fixed in #2880.

We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.

nfrisby added bug Something isn't working consensus issues related to ouroboros-consensus labels Jan 14, 2021

jutaro pinned this issue Jan 15, 2021

rphair mentioned this issue Jan 15, 2021

[BUG] - Increased memory and cpu usage on 1.24.2 because of "HardForkEncoderDisabledEra" IntersectMBO/cardano-node#2235

Closed

nfrisby unpinned this issue Jan 18, 2021

karknu closed this as completed Jan 19, 2021

nfrisby mentioned this issue Jan 22, 2021

Introduce the bracketWithRegistry combinator #2892

Closed

nfrisby added a commit that referenced this issue Jan 22, 2021

consensus: use bracketWithRegistry for ChainSync servers

025ef08

We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.

nfrisby added a commit that referenced this issue Jan 22, 2021

consensus: use bracketWithRegistry for ChainSync servers

caf1c7c

We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.

nfrisby added a commit that referenced this issue Jan 26, 2021

consensus: use bracketWithRegistry for ChainSync servers

c8132f2

We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.

nfrisby added a commit that referenced this issue Jan 26, 2021

consensus: use bracketWithRegistry for ChainSync servers

a5c9769

We intend for this change to make it more difficult to repeat the mistake underlying Issue #2870.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpectedly high memory usage seemingly due to a "stuck" client #2870

Unexpectedly high memory usage seemingly due to a "stuck" client #2870

nfrisby commented Jan 14, 2021

nfrisby commented Jan 14, 2021 •

edited

rphair commented Jan 15, 2021

karknu commented Jan 18, 2021

rphair commented Jan 18, 2021

karknu commented Jan 19, 2021

Unexpectedly high memory usage seemingly due to a "stuck" client #2870

Unexpectedly high memory usage seemingly due to a "stuck" client #2870

Comments

nfrisby commented Jan 14, 2021

nfrisby commented Jan 14, 2021 • edited

rphair commented Jan 15, 2021

karknu commented Jan 18, 2021

rphair commented Jan 18, 2021

karknu commented Jan 19, 2021

nfrisby commented Jan 14, 2021 •

edited