Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RankMap #12630

Merged
merged 2 commits into from May 17, 2019
Merged

Add RankMap #12630

merged 2 commits into from May 17, 2019

Conversation

friedmud
Copy link
Contributor

Added a new utility object: RankMap and used it to fix a bug in MemoryUsage and add some new capability to WorkBalance. Also added a HardwareIDAux that can paint the assignment of elements to compute nodes in the cluster into a field.

Here's an example using WorkBalance and VectorPostprocessorVisualizationAux to look at the amount of internode communication using two different partitioners:

dacff183-1cf3-4c05-bd55-f9cd47329444

And an example of using HardwareIDAux to diagnose how MPI process placement on the nodes effects METIS and a Hierarchcial partitioner (for better descriptions of both of these cases look at the documentation added here).

21ef5069-2d44-45c5-9e2f-7a8eef9b52e5

I also think this object will be useful in helping with placement of apps within a MultiApp in the future...

closes #12629

@friedmud friedmud force-pushed the rank_map_12629 branch 3 times, most recently from 703cac9 to 94ea949 Compare December 27, 2018 21:00
@@ -0,0 +1,65 @@
#include "RankMap.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Header

*/
const std::vector<processor_id_type> & ranks(unsigned int hardware_id) const
{
return _hardware_id_to_ranks.at(hardware_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure you want to use at()? It throws and doesn't terminate MPI "well" in parallel. If you intend for this to be a MOOSE utility it probably should be changed to provide a better error message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will be widely used outside of the MOOSE team so it's fine if you want to leave it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with it throwing... but you can tell me if you'd rather have me change it.

It has to be .at() or some find() thing though because the function is const.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we've switched a few of these to this:

auto item = map.find()
if (item == map.end())
   mooseError("Item not found");
return *item; // or item->first

If you ever hit one of these at() in parallel it can be infuriating. If a non-zero ranks gets there first, your application will just quit and you won't get any errors in the error log, nothing...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - I'll change it 👍

{
current_id = next_id++;

world_rank_to_hardware_id.emplace_hint(it, world_rank, current_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dschwen
Copy link
Member

dschwen commented Jan 10, 2019

CSV differ errors out with KeyError: 'partition_harwdware_id_surface_area'. Did you forget to update a gold file?

@friedmud friedmud force-pushed the rank_map_12629 branch 2 times, most recently from c458789 to ecd632a Compare January 10, 2019 00:42
@moosebuild
Copy link
Contributor

moosebuild commented Jan 10, 2019

Job Documentation on 86c1dd8 wanted to post the following:

View the site here

This comment will be updated on new commits.

@@ -47,18 +47,24 @@ validParams<WorkBalance>()
WorkBalance::WorkBalance(const InputParameters & parameters)
: GeneralVectorPostprocessor(parameters),
_system(getParam<MooseEnum>("system")),
_rank_map(_app.rankMap()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to design this with composition or inheritance? I'd prefer not to have to add a VPP right in the core of MOOSE when most people probably won't use it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could make an unique_ptr to RankMap instead. Then when it's requested through the accessor you can build it? That would avoid having that object active for every simulation on the planet.

@stale
Copy link

stale bot commented Mar 3, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale PRs that have reached or exceeded 90 days with no activity label Mar 3, 2019
@stale
Copy link

stale bot commented Mar 4, 2019

Closing due to 30 days of inactivity. See http://mooseframework.org/moose/framework_development/patch_to_code.html

@stale stale bot closed this Mar 4, 2019
@friedmud friedmud reopened this May 8, 2019
@stale stale bot removed the stale PRs that have reached or exceeded 90 days with no activity label May 8, 2019
@friedmud friedmud force-pushed the rank_map_12629 branch 5 times, most recently from cc06377 to 021d92b Compare May 8, 2019 19:59
…titioning and fix a big in MemoryUsageReporter closes idaholab#12629
Copy link
Member

@permcody permcody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments and a few things that probably should be cleaned up. Thanks

// Each process on the same node will end up with the same world_rank
processor_id_type world_rank = 0;

#ifdef LIBMESH_HAVE_MPI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting is supported in the libMesh communicator object. You can drop this ifdef and and the raw MPI here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - I opened a ticket so we can eventually pull this back out:
libMesh/libmesh#2128

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI_Comm_split_type is not in libMesh

// Assign a contiguous unique numerical id to each shared memory group
unsigned int next_id = 0;

for (auto pid = beginIndex(world_ranks); pid < world_ranks.size(); pid++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beginIndex -> MooseIndex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

PerfID _construct_timer;

/// Map of hardware_id -> ranks on that node
std::map<unsigned int, std::vector<processor_id_type>> _hardware_id_to_ranks;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible unordered_map here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* Returns the "hardware ID" (a unique ID given to each physical compute node in the job)
* for a given processor ID (rank)
*/
unsigned int hardwareID(processor_id_type pid) const { return _rank_to_hardware_id[pid]; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably should have a mooseAssert() - yes I realize that it will always be sized by num_processors, but if somebody passes something stupid like a DOF number to it... Just sayin' 😉

}

// Free up the memory
MPI_Comm_free(&shmem_raw_comm);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go too if you use the communicator wrapper.

framework/src/utils/RankMap.C Show resolved Hide resolved
framework/src/utils/RankMap.C Show resolved Hide resolved
@@ -47,18 +47,24 @@ validParams<WorkBalance>()
WorkBalance::WorkBalance(const InputParameters & parameters)
: GeneralVectorPostprocessor(parameters),
_system(getParam<MooseEnum>("system")),
_rank_map(_app.rankMap()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you could make an unique_ptr to RankMap instead. Then when it's requested through the accessor you can build it? That would avoid having that object active for every simulation on the planet.

const std::vector<unsigned int> & rankHardwareIds() const { return _rank_to_hardware_id; }

protected:
PerfID _construct_timer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been making these const. May as well, they never change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

We should update the documentation to recommend this.

friedmud added a commit to friedmud/moose that referenced this pull request May 9, 2019
@permcody permcody merged commit 1e51abf into idaholab:next May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue in MemoryUsageReporter
4 participants