Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpectedly high memory usage seemingly due to a "stuck" client #2870

Closed
nfrisby opened this issue Jan 14, 2021 · 5 comments
Closed

Unexpectedly high memory usage seemingly due to a "stuck" client #2870

nfrisby opened this issue Jan 14, 2021 · 5 comments
Labels
bug Something isn't working consensus issues related to ouroboros-consensus

Comments

@nfrisby
Copy link
Contributor

nfrisby commented Jan 14, 2021

This Issue arose from the debugging/triage efforts of Issue IntersectMBO/cardano-node#2235. A server has a space-leak that seems to be related to problematic clients. Both clients have followed the chain up to just before the first Allegra block. And both repeatedly disconnect and reconnect, without having made any progress. We have theories about why the clients are doing that (one is too old to understand Allegra; maybe the other is running OOM in the epoch boundary computation or similar), but it appears that they are inducing unacceptable resource usage in the server.

This Issue is to attempt to reproduce that interaction in a minimal controlled setup, and debug it.

From this perspective, the relevant facts are as follows.

  • We don't yet know what release the server we have the most anecdata from is running.

  • The V_2 client is sending the same FindIntersect once per second. The V_5 client is sending it about once per 100/3 seconds and also fetching some blocks for 10 seconds each time before it disconnects.

  • (It may be that only one of them is causing the memory leak.)

  • Both peer-to-peer connections are killed once per FindIntersect (V_2 by the server, V_5 by the client). And the FindIntersect is always the same: the expected points for a client whose current chain ends at the block before Allegra started.

  • There is also a space leak of “PINNED” memory, which is not related to the number of clients the server has. It is possible that that is the only space leak in play.

  • And the GitHub Issue has logs of network traffic that show bursts of high outbound (ie the repeated bulk sync?).

@nfrisby nfrisby added bug Something isn't working consensus issues related to ouroboros-consensus labels Jan 14, 2021
@nfrisby
Copy link
Contributor Author

nfrisby commented Jan 14, 2021

Thanks @karknu for patiently catching me up so I could write the above. Do you or @mrBliss see any corrections or omissions?

@rphair
Copy link

rphair commented Jan 15, 2021

Just committing again here to keep our production stake pool relay which still demonstrates the problem running as-is for another 12 hours in case any information needs to be dug from it. I'll be watching for email on this issue and @nfrisby you & other devs can also contact me on Telegram as @karknu did. Here is the software revision on that node:

relay-sgp1$ cardano-node --version
cardano-node 1.24.2 - linux-x86_64 - ghc-8.10
git rev 400d18092ce604352cf36fe5f105b0d7c78be074

@karknu
Copy link
Contributor

karknu commented Jan 18, 2021

#2880 is an example of how to fix the problem. Care should be taken to go through all other miniprotocol and make sure that if they allocate any state it should be cleaned up even in case of an exception.

@rphair
Copy link

rphair commented Jan 18, 2021

Three days after blocking incoming TCP connections from any nodes generating the HardForkEncoderDisabledEra messages (with no more obsolete peers arriving), and making no other changes, the memory problem has not returned.

@nfrisby nfrisby unpinned this issue Jan 18, 2021
@karknu
Copy link
Contributor

karknu commented Jan 19, 2021

Fixed in #2880.

@karknu karknu closed this as completed Jan 19, 2021
nfrisby added a commit that referenced this issue Jan 22, 2021
We intend for this change to make it more difficult to repeat the mistake
underlying Issue #2870.
nfrisby added a commit that referenced this issue Jan 22, 2021
We intend for this change to make it more difficult to repeat the mistake
underlying Issue #2870.
nfrisby added a commit that referenced this issue Jan 26, 2021
We intend for this change to make it more difficult to repeat the mistake
underlying Issue #2870.
nfrisby added a commit that referenced this issue Jan 26, 2021
We intend for this change to make it more difficult to repeat the mistake
underlying Issue #2870.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working consensus issues related to ouroboros-consensus
Projects
None yet
Development

No branches or pull requests

3 participants