Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement convenience workflow for using distributed mesh #10623

Merged
merged 4 commits into from
Feb 20, 2018

Conversation

rwcarlsen
Copy link
Contributor

A few parts of this implementation feel a bit hacky - suggestions on improving it are certainly welcome.

Fixes #10458. Note that the mesh splitting won't work right until we
get another libmesh update in :-(

@moosebuild
Copy link
Contributor

moosebuild commented Jan 26, 2018

Job Documentation on 59b8479 wanted to post the following:

View the site here

This comment will be updated on new commits.

// splits s with whitespace and commas as delimiters and converts the split elements to integers
// returned in a vector.
std::vector<unsigned int>
splitstr(const std::string & s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How convenient... I'll switch.

@@ -562,20 +600,45 @@ MooseApp::runInputFile()

_action_warehouse.executeAllActions();

if (isParamValid("mesh_only"))
if (getParam<std::string>("split_mesh") != "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend that you break out this new logic into a subroutine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you probably don't want to run all Actions when doing the splits. Building the equation system might be really expensive for instance. A much better design would be to run the subset of Actions needed to build the mesh and then do the splits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll piggy back on mesh-only logic.

"",
"comma-separated list of numbers of chunks to split the mesh into");

params.addCommandLineParam<std::string>("split_file",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this parameter should be pluralized? split_files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, but the user only passes a single file name as the argument to that flag - so it is a bit weird no matter how you do the pluralization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm ok with split_file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In MOOSE we call these file_bases. Perhaps that would be a better name (more in line with what users see in the Outputs block.

Copy link
Contributor Author

@rwcarlsen rwcarlsen Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but there is actually a file that gets created that is named exactly what the user types in there. As currently implemented, it is left to the user to type in the ".cpr" extension. This was done to make is easy for tab completion to work if the user is trying to use an already split mesh. I think it would be confusing to call it [bla]-base if we expect the user to type in a file extension.

* Returns the name of the mesh file read to produce this mesh if any or an empty string
* otherwise.
*/
virtual std::string getFileName() const { return ""; }
Copy link
Member

@permcody permcody Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable new API to add, but I know we have several other mesh formats that read files spread all over (there maybe another in the framework, there are several in the modules, etc). You should probably take a look and make them all override this new interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look. Not sure what to do about the meshes that read multiple files though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done.

* Removes any file extension from the fiven string s (i.e. any ".[extension]" suffix of s) and
* returns the result.
*/
std::string stripExtension(const std::string & s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't strip extensions - it splits file name from prefix dirpaths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the method called stripExtensions doesn't actually strip extensions 😛 ?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - the function you linked (splitFileName) doesn't strip extensions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@friedmud friedmud self-assigned this Jan 26, 2018
Copy link
Contributor

@friedmud friedmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - just a few things.

// splits s with whitespace and commas as delimiters and converts the split elements to integers
// returned in a vector.
std::vector<unsigned int>
splitstr(const std::string & s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have MooseUtils::tokenize() that does this.

@@ -562,20 +600,45 @@ MooseApp::runInputFile()

_action_warehouse.executeAllActions();

if (isParamValid("mesh_only"))
if (getParam<std::string>("split_mesh") != "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not isParamValid()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That always returns true for parameters that have a default. I suppose split_mesh doesn't need a default though - I'll take a look.

auto & mesh = _action_warehouse.mesh();

if (mesh->getFileName() == "" && getParam<std::string>("split_file") == "")
mooseError("must specify an output mesh file name (--split-file) when splitting "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

processor_id_type n = splits[i];
Moose::out << "splitting " << n << " ways..." << std::endl;

auto cpr = libMesh::split_mesh(*mesh, n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the reset to 1 partition happen inside split_mesh()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: maybe the libMesh interface should really just take the vector of splits... and do this loop internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I thought about that, but this allows the callers of the API more flexibility on how they want to handle the checkpoint object - they can switch it into ascii mode, etc. without having to pass a bajillion options into the splitting API. The goal was to make the libmesh splitting API do as close to one thing as possible and still preserve flexibility for callers.

for (std::size_t i = 0; i < splits.size(); i++)
{
processor_id_type n = splits[i];
Moose::out << "splitting " << n << " ways..." << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize s?

Copy link
Member

@permcody permcody left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good.

if (_app.parameters().get<bool>("use_split"))
{
auto split_file = _app.parameters().get<std::string>("split_file");
if (_type != "FileMesh")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think this is right. We have several other FileMesh derived types. I know you haven't instantiated the object here yet so using a virtual method or some other normal trick doesn't work but we might be able to make this all more robust with the old validParams trick (embedding a "base" type in validParams).

auto mesh = _app.actionWarehouse().mesh();

if (mesh->getFileName() == "" && _app.parameters().get<std::string>("split_file") == "")
mooseError("must specify an output mesh file name (--split-file) when splitting "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't in the style guide but, we typically write sentences (start with a capital letter) in all of our error messaging throughout MOOSE - please change.


auto splitstr = _app.parameters().get<std::string>("split_mesh");
std::vector<unsigned int> splits;
bool success = MooseUtils::tokenizeAndConvert(splitstr, splits, ", ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Moose::out << "splitting " << n << " ways..." << std::endl;

auto cpr = libMesh::split_mesh(*mesh, n);
Moose::out << " * writing " << cpr->current_processor_ids().size() << " files per process..."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: OK, but we really should revisit the whole console stream interface so that we can use it more consistently. I know you can't here because your messages would be buffered.

for (std::size_t i = 0; i < splits.size(); i++)
{
processor_id_type n = splits[i];
Moose::out << "splitting " << n << " ways..." << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More lower case messages coming out here. We might need to have a discussion on this.

{
_syntax.registerTaskName("split_mesh", true);
_syntax.addDependency("split_mesh", "setup_mesh_complete");
_action_warehouse.setFinalTask("split_mesh");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -527,6 +542,12 @@ MooseApp::setupOptions()
_syntax.addDependency("mesh_only", "setup_mesh_complete");
_action_warehouse.setFinalTask("mesh_only");
}
else if (getParam<std::string>("split_mesh") != "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does isParamValid() not work here and in other usages of this new parameter?

* Removes any file extension from the fiven string s (i.e. any ".[extension]" suffix of s) and
* returns the result.
*/
std::string stripExtension(const std::string & s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

rwcarlsen added a commit to rwcarlsen/moose that referenced this pull request Feb 2, 2018
This is necessary because checkpoint files will be in different
locations and named differently depending on mesh splitting and
processor number.  The naive file check we did before won't work.  So
for now we skip the check for cpr/cpa files.

This is necessary for an upcoming libmesh PR that is necessary for the
upcoming idaholab#10623 moose PR.
@rwcarlsen rwcarlsen force-pushed the split-mesh branch 2 times, most recently from 000ecc5 to f46b090 Compare February 2, 2018 19:40
@rwcarlsen rwcarlsen force-pushed the split-mesh branch 2 times, most recently from 82279e7 to 874c694 Compare February 12, 2018 19:37
@rwcarlsen
Copy link
Contributor Author

This should be ready for another look now that the new libmesh stuff is in.

new_pars.set<MeshFileName>("file") = split_file;
new_pars.set<MooseApp *>("_moose_app") = _moose_object_pars.get<MooseApp *>("_moose_app");
new_pars.set<MooseEnum>("parallel_type") = _moose_object_pars.get<MooseEnum>("parallel_type");
new_pars.set<MooseEnum>("parallel_type") = "distributed";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct, two sets in a row for the same parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops

{
if (split_file == "")
mooseError("cannot use split mesh with non-file mesh without specifying --split-file on "
"command line");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is OK, it would be cool if we could use the virtual function but that might be a little tricky before the object is created.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For generated mesh, we don't know the name of the pre-split files. There is no way we could divine it. The autosplit generated mesh and run in one go is a separate thing that will have to be implemented later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @rwcarlsen here - this is the right thing to do.

@@ -152,6 +153,34 @@ SetupMeshAction::setupMesh(MooseMesh * mesh)
void
SetupMeshAction::act()
{
// switch non-file meshes to be a file-mesh if using a pre-split mesh configuration.
if (_app.parameters().get<bool>("use_split"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need to run this logic twice (or more). You only need to run it during the "setup_mesh" task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you are saying here. Could you clarify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the code below this if statement (e.g. if _current_task == ...). This Action gets fired multiple time to perform different tasks. You only need to run this new logic for the setup task, not the init task.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, it's not about efficiency or anything, just getting it right logically so we don't look back at this later and wonder why this block runs for every task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - I see - good catch. It is easy to forget that multiple tasks can use/run the same action.

permcody
permcody previously approved these changes Feb 13, 2018
@rwcarlsen
Copy link
Contributor Author

Just waiting for a review update from @friedmud.

@friedmud
Copy link
Contributor

Grabbing this now...

Fixes idaholab#10458.  Note that the mesh splitting won't work right until we
get another libmesh update in :-(
Copy link
Contributor

@friedmud friedmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add .md documentation on how to use this.

@friedmud
Copy link
Contributor

When you didn't specify --split-file you get an error from every processor:

mpiexec -n 4 ../../../moose_test-opt -i simple_diffusion.i  --split-mesh "2,4"





*** ERROR ***
Output mesh file name must be specified (with --split-file) when splitting non-file-based meshes






*** ERROR ***
Output mesh file name must be specified (with --split-file) when splitting non-file-based meshes






*** ERROR ***
Output mesh file name must be specified (with --split-file) when splitting non-file-based meshes


application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2




*** ERROR ***
Output mesh file name must be specified (with --split-file) when splitting non-file-based meshes


application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

If you're running on thousands of procs that is going to suck. Let's just output that error on proc 0.

@friedmud
Copy link
Contributor

When we specify a filename using --split-file should we append .cpr on the name of it? It's a little funky to just get a directory out named "stuff" or whatever... but it could be ok

@permcody
Copy link
Member

So I've often thought about what to do about parallel mooseErrors. It's actually a really big pain because we have mooseErrors sprinkled all over the code and more often than not, they are in common code paths so this probably isn't isolated. How many times have you submitted an input file with a simple problem only to get a thousand lines of mooseErrors from every rank?

Outputting on processor zero is actually reasonable when we know it's on a common code path. Maybe we can give this some thought and be smart about when errors come out? I guess for now for this very common and specific error we can add one more if (processor_id() == 0) branch, but we should think about making errors more useful as we continue to evolve MOOSE.

@friedmud
Copy link
Contributor

Actually: we should reject the filename if you don't put .cpr on the end of it!

It turns out we can't read it!

[gastdr][~/projects/moose/test/tests/kernels/simple_diffusion]> mpiexec -n 4 ../../../moose_test-opt -i simple_diffusion.i  --split-file junk --use-split

 ERROR: Unrecognized file extension: junk
   I understand the following:

     *.e    -- Sandia's ExodusII format
     *.exd  -- Sandia's ExodusII format
     *.gmv  -- LANL's General Mesh Viewer format
     *.mat  -- Matlab triangular ASCII file
     *.n    -- Sandia's Nemesis format
     *.nem  -- Sandia's Nemesis format
     *.off  -- OOGL OFF surface format
     *.ucd  -- AVS's ASCII UCD format
     *.unv  -- I-deas Universal format
     *.vtu  -- Paraview VTK format
     *.inp  -- Abaqus .inp format
     *.xda  -- libMesh ASCII format
     *.xdr  -- libMesh binary format
     *.gz   -- any above format gzipped
     *.bz2  -- any above format bzip2'ed
     *.xz   -- any above format xzipped
     *.cpa  -- libMesh Checkpoint ASCII format
     *.cpr  -- libMesh Checkpoint binary format

[0] src/mesh/namebased_io.C, line 261, compiled nodate at notime




*** ERROR ***
 ERROR: Unrecognized file extension: junk
   I understand the following:

     *.e    -- Sandia's ExodusII format
     *.exd  -- Sandia's ExodusII format
     *.gmv  -- LANL's General Mesh Viewer format
     *.mat  -- Matlab triangular ASCII file
     *.n    -- Sandia's Nemesis format
     *.nem  -- Sandia's Nemesis format
     *.off  -- OOGL OFF surface format
     *.ucd  -- AVS's ASCII UCD format
     *.unv  -- I-deas Universal format
     *.vtu  -- Paraview VTK format
     *.inp  -- Abaqus .inp format
     *.xda  -- libMesh ASCII format
     *.xdr  -- libMesh binary format
     *.gz   -- any above format gzipped
     *.bz2  -- any above format bzip2'ed
     *.xz   -- any above format xzipped
     *.cpa  -- libMesh Checkpoint ASCII format
     *.cpr  -- libMesh Checkpoint binary format



application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

@friedmud
Copy link
Contributor

@permcody what about just having a mooseErrorOnProcessor0() macro?

@friedmud
Copy link
Contributor

Or - I guess, just an argument to mooseError()

@permcody
Copy link
Member

permcody commented Feb 14, 2018

The problem with that is that it relies on the developer using it correctly which might not always be as easy as it seems. You run the risk of flying right through an error condition if for some reason processor zero doesn't encounter the error but some of the other ranks do, but are told to just carry on.

I can think of two other alternatives that might be better (might not). We could use the libMesh "parallel_only" macro that does an assert to make sure the error is in a parallel region. The problem with that, is the assert goes away in your big parallel run when you actually need it to error and you still miss the error in the worst case.

Another alternative would be to still call abort on the non-zero ranks but not print? The downside to that is that on a really big job you run the risk of just terminating with no error message what-so-ever which would probably be way worse than seeing a thousand of the same message.

I'm not sure I like any alternative.

cpr->binary() = true;
auto fname = MooseUtils::stripExtension(mesh->getFileName()) + ".cpr";
if (mesh->getFileName() == "")
fname = _app.parameters().get<std::string>("split_file");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few ways to go here. Line 52 could be before line 50... but that would mean that running the same command-line twice might not work.

Probably better just to error if the split_file does not have a .cpr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually - let's add the .cpr here... and then make sure it gets added properly when we're reading in too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that it is cpa or the user renamed it to something random though. I think if the user uses the --split-file flag we should just use that wholesale - this especially makes sense with tab completion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't work though - have you tried it?

If the user names it anything other than .cpr it doesn't work. I think we should make sure that .cpr is on the end (if the user passes it just leave it alone - if they don't then append it... for both writing and reading)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add the check for the extension when writing.

@permcody
Copy link
Member

Option #3: Just get in the habit of using separate file outputs for big jobs or ignoring your error stream instead of combining it into your output stream. This is what I've done recently because I almost never want to see errors.

@friedmud
Copy link
Contributor

@permcody : well I guess it's not something to solve here

@rwcarlsen
Copy link
Contributor Author

Recommendations for where (what file(s)) to put docs? This isn't really a system/object so it can't really follow the "docs go in the file corresponding to this class/object" pattern.

@friedmud
Copy link
Contributor

Add a short section here: http://mooseframework.org/moose/documentation/systems/Mesh/index.html

Then link to a longer explanation.

@rwcarlsen
Copy link
Contributor Author

Alright - I pushed up some docs.

Copy link
Contributor

@friedmud friedmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation is 👍

But we need to get the name extension thing right...

cpr->binary() = true;
auto fname = MooseUtils::stripExtension(mesh->getFileName()) + ".cpr";
if (mesh->getFileName() == "")
fname = _app.parameters().get<std::string>("split_file");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't work though - have you tried it?

If the user names it anything other than .cpr it doesn't work. I think we should make sure that .cpr is on the end (if the user passes it just leave it alone - if they don't then append it... for both writing and reading)

Copy link
Contributor

@friedmud friedmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the .cpr for both writing and reading or you lose symmetry in the commands:

[gastdr][~/projects/moose/test/tests/kernels/simple_diffusion]> ../../../moose_test-opt -i simple_diffusion.i --split-mesh 4 --split-file junk

Splitting 4 ways...
    - writing 4 files per process...

[gastdr][~/projects/moose/test/tests/kernels/simple_diffusion]> ls
gold			junk.cpr		simple_diffusion.i	simple_diffusion_out.e	speedtests		tests
[gastdr][~/projects/moose/test/tests/kernels/simple_diffusion]> mpiexec -n 4 ../../../moose_test-opt -i simple_diffusion.i --use-split --split-file junk

*** ERROR ***
cannot locate mesh file 'junk'

@rwcarlsen
Copy link
Contributor Author

Done.

@friedmud
Copy link
Contributor

I'll merge this when the testing is complete!

Thanks for working on this... it's really slick!

@friedmud friedmud merged commit 94266fa into idaholab:devel Feb 20, 2018
rwcarlsen added a commit to rwcarlsen/moose that referenced this pull request Feb 21, 2018
rwcarlsen added a commit to rwcarlsen/moose that referenced this pull request Feb 21, 2018
rwcarlsen added a commit to rwcarlsen/moose that referenced this pull request Feb 21, 2018
rwcarlsen added a commit to rwcarlsen/moose that referenced this pull request Feb 21, 2018
@rwcarlsen rwcarlsen deleted the split-mesh branch March 12, 2018 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants