Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct memory / data source leak if teacher becomes inaccessible during reconnect #9477

Closed
artemananiev opened this issue Oct 25, 2023 · 3 comments · Fixed by #9780
Closed

Comments

@artemananiev
Copy link
Member

This problem has been recently reported in perfnet during longevity tests. From the logs, here is what happened:

  • A node falls behind network and starts a reconnect as a learner
  • While reconnect is in progress, the teacher becomes inaccessible (killed, OOME, whatever)
  • The learner destroys the incomplete reconnect state and starts a reconnect from a different node

What we observe is sometimes the reconnect state is not fully destroyed. For example, only 1 of 5 virtual maps is released. It results in data source leak (data sources are closed, when the corresponding virtual root node is destroyed), which in turn leads to direct memory and perhaps other leaks.

@artemananiev
Copy link
Member Author

In the current logs, there is a log message about data source being closed:

Closing Data Source uniqueTokenStore

but database files are still on disk. It indicates that something happened (an exception?) in MerkleDbDataSource.close(), but this exception is not logged anywhere.

@poulok
Copy link
Member

poulok commented Oct 25, 2023

This was observed in release 43.

@OlegMazurov
Copy link
Contributor

Here are the factors that lead to direct memory / data source leak in the learner when it aborts a reconnect session:

  • VirtualRootNode is instantiated with a basic constructor in LearnerThread.handleCustomRootInitialLesson(). As a consequence, it's instantiated with VirtualPipeline' being null. Although VirtualRootNode.postInit()is called when the node is added toVirtualMap, it's a no-op because learnerTreeViewis notnull`.
  • VirtualRootNode.setupWithOriginalNode() creates a copy of the original data source and makes it primary
  • After a VirtualRootNode has been successfully synchronized, postInit() is invoked again from VirtualRootNode.endLearnerReconnect(). This time learnerTreeView is set to null and a VirtualPipeline is instantiated for the node.
  • If reconnect is aborted for any reason, LearningSynchronizer.abort() releases newRoot, which is supposed to remove all new copies of VirtualMaps with their internal structures including all new data sources.
  • Those VirtualRootNodes that have not yet been synchronized still have pipeline == null. VirtualRootNode.destroyNode() becomes a no-op and the data source is not closed creating the leak.
  • Any VirtualRootNode that has been synchronized, correctly closes its data source. However, the database table is not deleted by MerkleDb.closeDataSource() because it's designated as primary (primaryTables.contains(tableId) == true).

artemananiev added a commit that referenced this issue Nov 10, 2023
…e during reconnect (#9780)

Fixes: #9477
Reviewed-by: Oleg Mazurov <oleg.mazurov@swirldslabs.com>
Signed-off-by: Artem Ananev <artem.ananev@swirldslabs.com>
ilko-iliev-lime pushed a commit that referenced this issue Nov 10, 2023
…e during reconnect (#9780)

Fixes: #9477
Reviewed-by: Oleg Mazurov <oleg.mazurov@swirldslabs.com>
Signed-off-by: Artem Ananev <artem.ananev@swirldslabs.com>
nickpoorman pushed a commit that referenced this issue Nov 22, 2023
…e during reconnect (#9780)

Fixes: #9477
Reviewed-by: Oleg Mazurov <oleg.mazurov@swirldslabs.com>
Signed-off-by: Artem Ananev <artem.ananev@swirldslabs.com>
Signed-off-by: Nick Poorman <nick@swirldslabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment