New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CheckpointIO For Mesh Splitting #7752
Comments
@idaholab/moose-developers take note! I got this all working... and it's awesome. It depends on: libMesh/libmesh#1103 Here is the summary about how awesome: This is for a problem with ~10M Hex8 elements and ~700M DoFs (but I'm not solving a linear system). These numbers are for running using 240 MPI proceses. What I did is just run up to the point where
Yep... those numbers are real. ~40x faster startup! ~17x less RAM! Holy crap. This is a game changer for me... (BTW: I also have numbers for 720 procs for Parallel Checkpoint: ~20s and ~250 MB RAM/process) Here are the raw timing numbers... Exodus:
Parallel Checkpoint:
|
On Wed, Sep 21, 2016 at 11:48 PM Derek Gaston notifications@github.com
I guess it depends on how you spin it. If you are using 240 processors each Before you start celebrating too soon, that RAM and startup savings isn't @idaholab/moose-team
|
@friedmud How is this set up? In lots of our codes, we assumed the replicated mesh. I want to see how many test will fail if mesh is turned parallel. I really like this and want it. |
@YaqiWang, should be fine if you do not use mesh adaptivity much. |
@permcody true on the efficiency :-) but who cares? 700 million DoFs in 400MB. Anything under 1GB per process is awesome. I'm also happy to see it go down even more after spreading it out more. There will always be some fixed size overhead... but, honestly, I'm amazed it's this low. As for being slower with parallel Mesh... that's going to be problem dependent. For normal stuff there won't be any impact. For mesh adaptivity there is quite a bit of overhead. If you're wanting to output Exodus there will be a big overhead (the Mesh has to be serialized... and so does the solution vector). Also, contact stuff may be a bit slower. Other than that it shouldn't impact solve speed. What were you doing with it? For my application it runs EXACTLY the same speed with and without distributed Mesh. My algorithm is already domain decomposed... it doesn't matter if there are extra non-local elements there or not. (Already confirmed this last night) BTW: I was doing this with the Mesh files on Oh: this also makes threading even less useful. The only time threading used to be a good idea is when you were RAM limited... this removes some of the times you would be in that situation. @YaqiWang I'm going to finalize the process for this in the next couple of days so you can try it out. |
Yeah I was being facetious. I guess we're going to have to quit telling people to knock it off when they start using DistributedMesh now!
Now you are starting to sound more like the PETSc guys. You don't need threading, you just sometimes need "shared memory". The new version of MPI allows that. Hierarchical communicators with shared memory support. @YaqiWang - You can try out DistributedMesh by using the "parallel_type" option in the Mesh block. However that's only half of the battle. Derek is pre-splitting his meshes and reading them in already split. So there's really two steps here. All the pieces you need aren't merged yet so give us some time. There are still a few bugs (assertions) that we're seeing with DistributedMesh so you probably don't want to go nuts with it until we know everything is working properly. |
@permcody Maybe so! I'll definitely lower my threshold for when you should go to Distributed Mesh.... it's now planted at around 1 Million elements... One of the reasons why I've always warned people away from it is that we didn't have a reliable tool for pre-splitting the meshes. My new splitter is awesome. You can run with any number of MPI and write out partitions for any number of processors (like use 10 MPI processes to write out files for 1000 processes). That makes a huge difference over our old tools. Also: since it's using our existing partitioner infrastructure we can use any of our partitioners with it easily (or make our own). Where do you think I should put the splitter? Should it be in our contrib? Should it be one of those libMesh executables that always gets built with libMesh? Should there not be a separate executable... and maybe a command-line option to any MOOSE-based executable should automatically invoke the splitting? (I kind of like that last one) I think I'll try to implement that last one and we can see what it looks like... |
I'd love it as src/apps/meshsplit.C in libMesh. |
I agree with Roy, Let's put it in libMesh. On Thu, Sep 22, 2016 at 10:09 AM roystgnr notifications@github.com wrote:
|
Ok - I may do both ;-) |
@permcody we were WAY off on the parallel efficiency. You can't compare a ReplicatedMesh run at 240 procs to a DistributedMesh run at 240 procs.... you need to compare both to a truly serial run... I just did that. On one processor this problem takes ~85,000 MB of RAM (written that way for easy computation)... So... to add to the table I had earlier... here is the memory scaling efficiency:
"Perfect" memory scaling would have been 354 MB. The 400MB number was kind of an "eyeball" average anyway... some of the processes were less than that... and a few were more (just looking at the head node for the job) At any rate... our memory scaling efficiency is actually REALLY good. I'm going to be generating plots of this for my upcoming paper which will make it easier to see. |
This is very interesting. I was just messing with you and I was only referring to timing. The memory scaling is actually more important here anyway. So for the ReplicatedMesh run, I'm having a hard time understanding why one one process would require 85GB of RAM on the mesh but drop to 6.5GB of RAM per process when we are reading it in on 240 procs. If it's replicated shouldn't it be almost equal per process. We aren't considering the EQ objects or anything else right? Is this just the memory required to hold the mesh? What tool are you using to measure memory usage? |
No - all of these numbers are the total memory for the process (which is
|
Oh, Ok. Since all of the pieces of the EQs are distributed, that would explain the huge difference between the parallel and serial case. From a frameworks perspective, we really should care about both. This helps gives us more information about when DistributedMesh should be used versus Replicated. |
I think in this case you can basically look at it like there is a 6GB overhead for using ReplicatedMesh when running on 240 processors. That's a lot of overhead! |
I guess there are cases ReplicatedMesh is necessary. I do want to make DistributedMesh as the default though. |
No - not the default, at least not universally. When running small meshes On Fri, Sep 23, 2016 at 11:45 AM Yaqi notifications@github.com wrote:
|
Agreed - definitely don't want Distributed to be the default. In particular... it requires extra steps to split your mesh before running. That's just unnecessary for 90% of the runs out with MOOSE. |
It's possible to use distributed mesh if you don't split first. You'll lose
|
Depends on your problem. For most of my problems the initial memory spike kills my runs... I really think the utility of DistributedMesh is only really unlocked if you split first. |
Really? I'm running 180^3 without any problems in ReplicatedMesh mode. Are you running that much bigger meshes? I think the Mesh structure is around 5-6GB per process at that size. Maybe down the road, we should think about a live-splitter mechanism. It should be feasible to create a sub-communicator using maybe 10% of the total procs to read and split the mesh. Then we could launch the run all online. Perhaps overkill, I don't know. With your new utility maybe it's not so much of a pain but if you are running on the cluster, having to schedule two jobs to split then run is not always convenient. |
At 6GB you already can't utilize every core in a node. That's what I'm talking about. Yes, I can run my problems... but I waste a lot of cores. It's a pretty big issue when you're trying to run ~10k cores (like one of the jobs I currently have queued). If I needed to reserve 15k-20k cores to actually use 10k... that sucks. With pre-splitting and DistributedMesh that's not the case. |
@roystgnr I'm running into something here. I can't seem to be able to Partition the same Mesh object twice... do you think that should work? It runs into this assert: https://github.com/libMesh/libmesh/blob/master/src/mesh/mesh_communication_global_indices.C#L785 The issue seems to be because I'm partitioning for more processors than I'm running on. The weird thing is that it works the first time... only on the second time does it hit that assert. I don't understand either thing: why it should work the first time... and/or why it doesn't the second time. Is there something I should do to "clear" the mesh before re-partitioning it. BTW: My purpose here is to try to avoid reading the same mesh multiple times (as that can take many minutes)... but be able to write out partitionings for multiple numbers of processors. i.e. read once but output partitionings for 24, 48, 96, etc. processors. |
@roystgnr I figured out a workaround... I just "reset" the partitioning before each partitioning by partitioning it for one processor. Like this: for (auto n_procs : all_n_procs)
{
partitioner.partition(mesh, 1);
partitioner.partition(mesh, n_procs);
.....
} The nice thing about resetting it to So... kind of ugly... but still much faster than needing to do a separate run / read of the mesh to create multiple partitionings. |
So, I'm pretty sure I have this working perfectly now, but I don't know how to properly test that it's working perfectly. Ideally we want to have a test which:
We can't exactly add such a test to libMesh since moose apps aren't available there, though I suppose we could just run a libMesh example code instead. Can we do this easily in the MOOSE test harness? The "RunCommand" tester looks like it's flexible enough to call a python script, which could in turn run the splitter-mooseapp-exodiff sequence we'd like, but the only example I can see of that tester is with command='echo Hello World', which is less impressive. |
Yes, you can absolutely run an arbitrary command. The
That should be everything we need to describe the test. However, I suspect that you really won't need the second part as you'll define success or failure in the script itself and return the appropriate exit code (which we also look at). |
Summary of changes in this update: Remove 'old' Makefiles from contrib and unit test directories. ASCII option, naming fixes for apps/splitter Small update to RBEIMConstruction Put sideset/nodeset names in CheckpointIO header Checkpoint N->M restart fixes Check for exact file extensions in NameBasedIO::is_parallel_file_format() Update TetGenIO code for reading .ele files. allgather(vector<string>) overload Store integer data type in CheckpointIO files This is necessary to support idaholab#9782 in service of idaholab#9700, and to add more robustness to support of idaholab#7752
Based heavily on @friedmud's test in idaholab#8472; adds test coverage for issue idaholab#7752, for both ASCII and binary CheckpointIO pre-split mesh reads.
Summary of changes in this update: Remove 'old' Makefiles from contrib and unit test directories. ASCII option, naming fixes for apps/splitter Small update to RBEIMConstruction Put sideset/nodeset names in CheckpointIO header Checkpoint N->M restart fixes Check for exact file extensions in NameBasedIO::is_parallel_file_format() Update TetGenIO code for reading .ele files. allgather(vector<string>) overload Store integer data type in CheckpointIO files This is necessary to support idaholab#9782 in service of idaholab#9700, and to add more robustness to support of idaholab#7752
Closing this issue, as it has been completed. |
Description of the enhancement or error report
Forget #7744 and #7745 . Screw libMesh/libmesh#1087 .
I'm over it.
We don't really need to use Nemesis for reading split meshes. What incentive do we have? All of the tools that used to create split Nemesis meshes have bit rot at this point.
Instead: let's just use our own format. We already have
CheckpointIO
... it just needs a few tweaks and then it should work.Rationale for the enhancement or information for reproducing the error
We need a reliable method for creating split meshes and using them in simulations with
DistributedMesh
.Identified impact
The ability to partition and run truly huge problems.
The text was updated successfully, but these errors were encountered: