Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption in node creation with many threads #2316

Merged
merged 4 commits into from
Mar 7, 2022

Conversation

hakonsbm
Copy link
Contributor

Creating nodes of different models while using many threads uses an excessive amount of memory. For example, creating 10 populations of 5000 nodes each and different models for each population:

Threads Memory [MB]
1 135
256 2255

This is because node creation relies on memory management with sli::pool:

std::vector< sli::pool > memory_;

By replacing the sli::pool with an std::vector, the memory consumption is kept at a more manageable level:

Threads Memory [MB]
1 129
256 306

@hakonsbm hakonsbm added S: Normal Handle this with default priority T: Maintenance Work to keep up the quality of the code and documentation. I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) labels Feb 25, 2022
@hakonsbm hakonsbm added this to PRs in progress in Kernel via automation Feb 25, 2022
@heplesser
Copy link
Contributor

@hakonsbm That is an impressive reduction in memory use. It would be interesting to see construction and simulation times for realistic network sizes as well. And could memory consumption be reduced even further using BlockVector without noticeable performance loss?

@hakonsbm
Copy link
Contributor Author

hakonsbm commented Mar 1, 2022

@heplesser For connection and simulation the memory_ vector is actually not used. The node pointer is added to local_nodes_, which is the one used for connection and simulation:

Node* node = model.allocate( t );
node->set_node_id_( node_id );
node->set_nc_( nc_ptr );
node->set_model_id( model.get_model_id() );
node->set_thread( t );
node->set_vp( kernel().vp_manager.thread_to_vp( t ) );
node->set_local_device_id( num_thread_local_devices_[ t ] - 1 );
node->set_initialized();
local_nodes_[ t ].add_local_node( *node );

Therefore, the connection and simulation times should be unaffected, which a run of hpc_benchmark with a scale of 20 supports.

Using BlockVector instead of std::vector actually uses more memory (696 MB with BlockVector vs 306 MB with std::vector). That's because BlockVector initializes 1024 elements when created, and that is done 256 times (once for each thread) for every model, leading to a big overhead.

@heplesser
Copy link
Contributor

@hakonsbm Good to know that building and simulation time is not affected. But I think the memory_ is used in the sense that this is where the neuron objects are stored, so if memory layout had an effect on performance, we might have seen changes. But given that vector stores objects contiguously, this is presumably the optimal memory layout.

Copy link
Contributor

@heplesser heplesser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good, just a few comments/questions.

* Initialize the pool allocator with the Node specific values.
*/
virtual void init_memory_( sli::pool& ) = 0;

/**
* Allocate a new object at the specified memory position.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is outdated.

@@ -250,22 +232,17 @@ class Model
/**
* Memory for all nodes sorted by threads.
*/
std::vector< sli::pool > memory_;
std::vector< std::vector< Node* > > memory_;
};


inline Node*
Model::allocate( thread t )
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are allocate() and allocate_() actually good names? Wouldn't create or clone be more readable?


#pragma omp parallel
{
const index t = kernel().vp_manager.get_thread_id();

try
{
model.reserve_additional( t, max_new_per_thread );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it make sense to keep reserve_additional to avoid unnecessary automatic vector resizings?

Copy link
Contributor

@jougs jougs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the changes here. Many thanks!

More fundamentally, I'm wondering if we wouldn't be better of with simple thread-local vectors of nodes of a certain type in the GenericModel instead of packing them into custom pool-based allocators that have never been really tested or designed for performance... But that's probably for another time.

nestkernel/model.h Outdated Show resolved Hide resolved
Kernel automation moved this from PRs in progress to PRs approved Mar 7, 2022
Co-authored-by: Jochen Martin Eppler <jougs@gmx.net>
@heplesser heplesser merged commit bd79d7c into nest:master Mar 7, 2022
Kernel automation moved this from PRs approved to Done (PRs and issues) Mar 7, 2022
@hakonsbm hakonsbm deleted the sli_pool_replacement branch March 8, 2022 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I: No breaking change Previously written code will work as before, no one should note anything changing (aside the fix) S: Normal Handle this with default priority T: Maintenance Work to keep up the quality of the code and documentation.
Projects
Kernel
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants