use current_local_solution for build_solution_vector #190

Merged
merged 11 commits into from Jan 30, 2014

Conversation

Projects
None yet
4 participants
Owner

friedmud commented Jan 22, 2014

This is not a complete solution, yet... but I wanted to start a discussion.

As I've mentioned before EquationSystems::build_solution_vector() is a PIG... and is now keeping us from running some of our larger problems (anything over 100 Million DoFs takes an enormous amount of memory during build_solution_vector()).

There are multiple problems, but this patch addresses an easy one. There is a vector called sys_soln that gets created in build_solution_vector. The current behavior is to serialize the solution vector of each system to all processors before starting to build the nodal solution vector that ultimately gets passed to an output class.

So, if you have a system with 100 Million DoFs this is creating a vector that is nearly 1GB in size ON EVERY PROCESSOR. So if you are trying to use 32 MPI per node on your cluster... your memory usage per node just went up by nearly 32GB! Considering our cluster has just 64GB of RAM per node... your job most likely just crashed...

So.... to get around this, this patch just uses the current_local_solution vector that we already have hanging around and is properly spread out in parallel and everything. The only weirdness is that it turns out that current_local_solution might not be up to date after the solve so you need to call system.update()... but this is a "const" function... so some const_cast()ing needs to happen. It's kinda ugly, but that's why I want to start a conversation.

What other ways can we solve this issue?

While we're thinking about that start thinking about the next problem which is the number_of_nodes*number_of_vars size vector that we create on every processor in this same function. It has the very same problem... but in this case it's even worse because if you're using a serial outputer (like Exodus) you really only need that vector on processor 0... but we allocate it on every processor! Crazy!

friedmud added some commits Jan 22, 2014

more build_solution_vector() optimizaiton. Move the solution vector a…
…nd 'repeat' vector to being actual NumericVectors and have them only localize_to_one() at the end so we don't ever get a full copy of this solution vector on all of the processors (only on processor 0 which can't be avoided because we're using a solution outputter most of the time
Owner

friedmud commented Jan 24, 2014

Ok - this is getting real now. I refactored build_solution_vector() to use NumericVectors internally so that it will never make a full copy of the solution vector on any one processor... except that at the end it uses localize_to_one() to get a full copy on processor 0 so that our serial output types (like Exodus) still have a full std::vector to work with just like they expect.

A couple of things:

  1. I had to add a new /= method to NumericVector that does a "pointwise" divide based on the entries of another NumericVector (ie it does u_i = u_i / v_i for every entry in both vectors). Currently I have only implemented this for PetscVector because I wanted to get some feedback before I go doing it for everything else.

That said - I do know how to do it for all of the other NumericVector types as well and won't be hard to implement if this branch gets some traction.

Adding that method made the algorithm pretty straightforward and keeps down parallel communication (there is only parallel communication at the very end of EquationSystems::build_solution_vector() now...

  1. In MeshOutput::_build_variable_names_and_solution_vector() I had to put a broadcast() in there of the solution vector for parallel output formats (like Nemesis). This is just a short-term fix until we do a bigger redesign. Note that this does exactly the OPPOSITE of what you would want to do for parallel output! But... no way around it as the Nemesis outputter currently depends on receiving a std::vector with the full solution in it on every processor.

All of this brings me to the next thing to tackle: I believe that build_solution_vector has the wrong interface. It really needs to NOT be building a std::vector at all. It should take a NumericVector to fill up... and then we should just pass that NumericVector to the output types directly and they can do what they need to with it (like localize_to_one() etc). But before I make that change I really want to get some buy in from @benkirk and @roystgnr ....

Owner

friedmud commented Jan 28, 2014

Here are some plots showing the difference this branch makes...

These are for a 2M DoF problem run using 100_node x 8_mpi (800 total MPI procs). The amounts reported are the aggregate amounts over all 800 MPI procs. This is a steady state problem that also writes an initial condition out... so there is one write, then solve, then write, then end.

Firstly a comparison of the current libMesh master to this branch:
master_v_branch

Firstly - it's using ~30x less memory overall. Secondly it also runs 2x faster....

And here's some detail of just what the memory profile looks like for the branch.
branch

I'm also going to show the difference on just the head node with some other plots... stay tuned

Owner

benkirk commented Jan 28, 2014

All of this brings me to the next thing to tackle: I believe that build_solution_vector has the wrong
interface. It really needs to NOT be building a std::vector at all. It should take a NumericVector to
fill up... and then we should just pass that NumericVector to the output types directly and they can
do what they need to with it (like localize_to_one() etc). But before I make that change I really want
to get some buy in from @benkirk and @roystgnr ….

Why don't we create a new EquationSystems::build_solution_vector() with the desired signature and gradually deprecate the old one, rather than removing it entirely right away?

I definitely agree this is important to do!

Owner

friedmud commented Jan 28, 2014

Well - it's actually not going to be a very tough change to implement. I'm halfway through it now... let me work it up and you can see what you think.

What do you think about those memory plots!

Owner

benkirk commented Jan 28, 2014

Compelling!

Owner

roystgnr commented Jan 28, 2014

"I can't wait for the plots!" harassment via IM uncovered that peak memory usage on even the head node was cut in half. I'd be happy to delay the next release to get this in if you guys agreed.

Owner

friedmud commented Jan 28, 2014

This is quite a bit of new code in a very critical section of the library... I wouldn't mind waiting a release on this to let the code settle out a bit.

That said - I am going to push this to completion right now (it's my top priority) and we'll definitely be deploying it to MOOSE users soon (hopefully next week). So I understand wanting to get it out the door...

Owner

friedmud commented Jan 28, 2014

Results are in for the head node. This is the same problem... we're just looking at the aggregate memory usage over the 8 processes on the head node. The results are still pretty spectacular... it's a 10x decrease in peak memory....

Here's master vs branch:

master_v_branch_head

Here's just the branch usage so you can see the characteristic of the spike:

branch_head

A bit of explanation...

For 200M DoFs the solution vector is ~1.6GB in size. I use the PETSc API to actually bring it all down to processor 0... then I copy each entry into a std::vector on processor zero. So the total spike on processor zero should be ~3GB (which it is).

Once I change the API to where we are just passing NumericVector directly to the outputers this will go down even further....

Owner

friedmud commented Jan 28, 2014

The great part about the memory usage on the head node is it's now completely dependent only on the total number of DoFs... so if I double the number of processes on the head node the spike will still just be 3GB.

Note that the spike on the head node with master was nearly 45 GB! That means that I can't really go up in the number of MPI processes I'm using on the head node (64GB per node on this machine). So I would be stuck at 8...

Owner

friedmud commented Jan 29, 2014

Ok @roystgnr (and other's here) have swayed me. Instead of continuing down the path of changing the API over to NumericVector... these results are compelling enough that I'm just going to work to polish this off so we can get it out the door.

I'll go ahead and make this work with the other linear algebra types... and test as much as I can and then we can merge it...

Owner

benkirk commented Jan 29, 2014

On Jan 28, 2014, at 6:15 PM, Derek Gaston notifications@github.com
wrote:

I'll go ahead and make this work with the other linear algebra types... and test as much as I can and then we can merge it…

OK. I can help with the NumericVector<> part refactoring after that, if you'd like.

Owner

friedmud commented Jan 29, 2014

Ok - I think this is pretty complete at this point. I just added operator/=() to all of the other NumericVector implementations... but I don't have a great way to test all of those.

@benkirk @roystgnr do you guys have enough configurations of libMesh in your buildbots to be able to check Laspack, Eigen and Trilinos configurations to make sure that stuff is working? If so, I would prefer to test things that way.

Other than that, if no one sees any other issues I think this may be ready to merge...

Owner

roystgnr commented Jan 29, 2014

I'm autorunning master in Laspack/Eigen/Trilinos configurations through the unit tests and example programs.

Only with SerialMesh, though. Are all the new code paths going to be hit there?

As awesome as this is, I suppose it is probably too big a merge to slip into a -rc2, but we should definitely merge it right after the final release.

Owner

friedmud commented Jan 29, 2014

I think that will hit all the major pathways. I already did extensive testing on OSX and Linux with multiple (MOOSE-based) applications and everything is good. I've even tested with ParallelMesh.

The only gap is the Laspack/Eigen/Trilinos testing...

When is the "final" release going to come out? Can we influence that so it happens "soonish"? I would like to get this in so we can get some testing on it and get it deployed to our users. We have quite a few who are eagerly awaiting this ;-)

Owner

roystgnr commented Jan 29, 2014

Do we still need to announce rc1 on the mailing lists? Tarballs are up on the releases page but I don't recall making an announcement or seeing one from Ben.

Owner

benkirk commented Jan 29, 2014

Yes, we (I) do.

src/systems/equation_systems.C
- // multiple processors.
- this->comm().sum(node_conn);
+ // Create a NumericVector to hold the parallel solution
+ NumericVector<Number> & parallel_soln = *(NumericVector<Number>::build().release());
@benkirk

benkirk Jan 29, 2014

Owner

Why not catch these in an AutoPtr<> rather than dangeroulsy releasing them here and requiring the manual delete below?

Why not

AutoPtr<NumericVector<Number> > parallel_soln_ptr = NumericVector<Number>::build();
NumericVector<Number> &parallel_soln = *parallel_soln_ptr;
@friedmud

friedmud Jan 29, 2014

Owner

Because I hate AutoPtr's (not just our implementation, but the entire idea)
so I try to get rid of them as soon as possible ;-)

Actually, the same idea occurred to me... especially after the second time
of forgetting to delete the vectors! ;-)

This is a good opportunity to use the new ability to modify files directly
on GitHub and create a new commit - let me try...

On Wed, Jan 29, 2014 at 11:50 AM, Benjamin S. Kirk <notifications@github.com

wrote:

In src/systems/equation_systems.C:

  • // Gather the distributed node_conn arrays in the case of
  • // multiple processors.
  • this->comm().sum(node_conn);
  • // Create a NumericVector to hold the parallel solution
  • NumericVector & parallel_soln = *(NumericVector::build().release());

Why not catch these in an AutoPtr<> rather than dangeroulsy releasing them
here and requiring the manual delete below?

Why not

AutoPtr<NumericVector > parallel_soln_ptr = NumericVector::build();NumericVector &parallel_soln = *parallel_soln_ptr;

Reply to this email directly or view it on GitHubhttps://github.com/libMesh/libmesh/pull/190/files#r9278579
.

@roystgnr

roystgnr Jan 29, 2014

Owner

Maybe libMesh::UniquePtr should be one of our "stuff from C++11 to shim or reimplement" targets? We can't give it a move constructor in C++03, naturally, but it would still be usable as a local variable, just not as an argument or return value. On the other hand, we wouldn't want C++11 users to try to use it in moves only to discover that their code broke on others' systems with older compilers.

@jwpeterson

jwpeterson Jan 29, 2014

Owner

On Wed, Jan 29, 2014 at 12:07 PM, roystgnr notifications@github.com wrote:

In src/systems/equation_systems.C:

  • // Gather the distributed node_conn arrays in the case of
  • // multiple processors.
  • this->comm().sum(node_conn);
  • // Create a NumericVector to hold the parallel solution
  • NumericVector & parallel_soln = *(NumericVector::build().release());

Maybe libMesh::UniquePtr should be one of our "stuff from C++11 to shim or
reimplement" targets? We can't give it a move constructor in C++03,
naturally, but it would still be usable as a local variable, just not as an
argument or return value. On the other hand, we wouldn't want C++11 users
to try to use it in moves only to discover that their code broke on others'
systems with older compilers.

The boost subset that comes with libmesh already provides boost::unique_ptr
if you want to start using it in the library...

John

Owner

benkirk commented Jan 30, 2014

When is the "final" release going to come out? Can we influence that so it happens "soonish"? I
would like to get this in so we can get some testing on it and get it deployed to our users. We have
quite a few who are eagerly awaiting this ;-)

Not sure what you're asking - do you want 0.9.3 final soonish and then merge this into master?

I've got a tag for v0.9.3-rc1, and since there have been no issues reported so far, I'm not expecting any major deltas between that and v0.9.3. I'd be happy merging these things into master if you like, and back porting any other fixes that may come up.

Owner

friedmud commented Jan 30, 2014

If that's the case - let's merge.

You say "yes"... and I hit the green button...

Owner

roystgnr commented Jan 30, 2014

We need to stick some libmesh_example_assert() in whichever reduced_basis examples no longer support ParallelMesh.

I think I found a long-standing bug in one of the FEMSystem features.

But yeah, agreed, no large deltas between here and release. I'll try and get those little fixes done shortly.

friedmud added a commit that referenced this pull request Jan 30, 2014

Merge pull request #190 from friedmud/eq_io_opt_pr
use current_local_solution for build_solution_vector

@friedmud friedmud merged commit 850a340 into libMesh:master Jan 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment