Direct memory / data source leak if teacher becomes inaccessible during reconnect #9477

artemananiev · 2023-10-25T18:26:12Z

This problem has been recently reported in perfnet during longevity tests. From the logs, here is what happened:

A node falls behind network and starts a reconnect as a learner
While reconnect is in progress, the teacher becomes inaccessible (killed, OOME, whatever)
The learner destroys the incomplete reconnect state and starts a reconnect from a different node

What we observe is sometimes the reconnect state is not fully destroyed. For example, only 1 of 5 virtual maps is released. It results in data source leak (data sources are closed, when the corresponding virtual root node is destroyed), which in turn leads to direct memory and perhaps other leaks.

artemananiev · 2023-10-25T18:31:39Z

In the current logs, there is a log message about data source being closed:

Closing Data Source uniqueTokenStore

but database files are still on disk. It indicates that something happened (an exception?) in MerkleDbDataSource.close(), but this exception is not logged anywhere.

poulok · 2023-10-25T18:34:49Z

This was observed in release 43.

OlegMazurov · 2023-11-09T00:01:29Z

Here are the factors that lead to direct memory / data source leak in the learner when it aborts a reconnect session:

VirtualRootNode is instantiated with a basic constructor in LearnerThread.handleCustomRootInitialLesson(). As a consequence, it's instantiated with VirtualPipeline' being null. Although VirtualRootNode.postInit()is called when the node is added toVirtualMap, it's a no-op because learnerTreeViewis notnull`.
VirtualRootNode.setupWithOriginalNode() creates a copy of the original data source and makes it primary
After a VirtualRootNode has been successfully synchronized, postInit() is invoked again from VirtualRootNode.endLearnerReconnect(). This time learnerTreeView is set to null and a VirtualPipeline is instantiated for the node.
If reconnect is aborted for any reason, LearningSynchronizer.abort() releases newRoot, which is supposed to remove all new copies of VirtualMaps with their internal structures including all new data sources.
Those VirtualRootNodes that have not yet been synchronized still have pipeline == null. VirtualRootNode.destroyNode() becomes a no-op and the data source is not closed creating the leak.
Any VirtualRootNode that has been synchronized, correctly closes its data source. However, the database table is not deleted by MerkleDb.closeDataSource() because it's designated as primary (primaryTables.contains(tableId) == true).

…e during reconnect (#9780) Fixes: #9477 Reviewed-by: Oleg Mazurov <oleg.mazurov@swirldslabs.com> Signed-off-by: Artem Ananev <artem.ananev@swirldslabs.com>

…e during reconnect (#9780) Fixes: #9477 Reviewed-by: Oleg Mazurov <oleg.mazurov@swirldslabs.com> Signed-off-by: Artem Ananev <artem.ananev@swirldslabs.com> Signed-off-by: Nick Poorman <nick@swirldslabs.com>

artemananiev added Platform Reconnect Platform Virtual Map Platform Data Structures labels Oct 25, 2023

artemananiev mentioned this issue Oct 25, 2023

Add more logs to debug virtual map reconnect issues #9479

Closed

OlegMazurov mentioned this issue Nov 4, 2023

Reconnect protocol violation causes havoc on the learner side. #9678

Closed

artemananiev mentioned this issue Nov 8, 2023

9477: Direct memory / data source leak if teacher becomes inaccessible during reconnect #9780

Merged

artemananiev self-assigned this Nov 8, 2023

artemananiev closed this as completed in #9780 Nov 10, 2023

artemananiev mentioned this issue Nov 10, 2023

Backport the fix for #9477 to release 0.44 #9828

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct memory / data source leak if teacher becomes inaccessible during reconnect #9477

Direct memory / data source leak if teacher becomes inaccessible during reconnect #9477

artemananiev commented Oct 25, 2023

artemananiev commented Oct 25, 2023

poulok commented Oct 25, 2023

OlegMazurov commented Nov 9, 2023

Direct memory / data source leak if teacher becomes inaccessible during reconnect #9477

Direct memory / data source leak if teacher becomes inaccessible during reconnect #9477

Comments

artemananiev commented Oct 25, 2023

artemananiev commented Oct 25, 2023

poulok commented Oct 25, 2023

OlegMazurov commented Nov 9, 2023