Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug with fetch_points/fetch_values #70

Closed
chrisrichardson opened this issue Feb 21, 2021 · 4 comments
Closed

Possible bug with fetch_points/fetch_values #70

chrisrichardson opened this issue Feb 21, 2021 · 4 comments

Comments

@chrisrichardson
Copy link
Contributor

I have been trying to adapt example 8 (fetchall.cpp) to my purposes, but found the following strange behaviour. When running the code below with mpirun -np 1 ./fetchall mpi://domain1/ifs 0.618 : -np 1 mpi://domain2/ifs 1.414 it will crash with:

terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::_M_create

However, it works OK for smaller amounts of data (e.g. n=1000)

Code (minimum example based on fetchall.cpp) compiled with g++10, mpich and using same Makefile from fetchall example

#include "../mui/mui.h"

int main( int argc, char ** argv ) {
    if ( argc < 3 ) {
        printf( "USAGE: mpirun -np n1 %s URI1 value1 : -np n2 %s URI2 value2\n\n"
                "n1, n2     : number of ranks for each 'subdomain'\n"
                "URI format : mpi://domain-identifier/interface-identifier\n"
                "value      : an arbitrary number\n\n"
                "EXAMPLE: mpirun -np 1 %s mpi://domain1/ifs 0.618 : -np 1 %s "
                "mpi://domain2/ifs 1.414\n\n",
                argv[0], argv[0], argv[0], argv[0] );
        exit( 0 );
    }

    mui::uniface3d interface( argv[1] );

    if (std::string(argv[1]) == "mpi://domain1/ifs")
    {
      int n = 3000;
      std::vector<mui::point3d> push_locs( n );

      printf( "domain %s pushed %d values %s\n", argv[1], n, argv[2] );

      // Push value stored in "argv[2]" to the MUI interface                                                                                                                                             
      for ( size_t i = 0; i < n; i++ ) { // Define push locations as 0-5 and push the value                                                                                                              
        push_locs[i] = { i*2.1, i/10.0, i/20.0};
        interface.push( "data", push_locs[i], atof( argv[2] ) );
      }

      // Commit (transmit by MPI) the values at time=0                                                                                                                                                   
      interface.commit( 0 );
    }
    else
    {
      // Fetch the values from the interface using the fetch_points and fetch_values methods                                                                                                             
      // (blocking until data at "t=0" exists according to chrono_sampler)                                                                                                                               
      int time = 0;

      mui::chrono_sampler_exact1d chrono_sampler;
      std::vector<mui::point3d> fetch_locs = interface.fetch_points<double>("data", time, chrono_sampler ); // Extract the locations stored in the interface at time=0                                   
      std::vector<double> fetch_vals = interface.fetch_values<double>("data", time, chrono_sampler ); // Extract the values stored in the interface at time=0                                            

      // Print returned values                                                                                                                                                                           
      for ( size_t i = 0; i < fetch_locs.size(); i++ ) 
        printf( "domain %s fetched value %lf at location %lf\n", argv[1], fetch_vals[i], fetch_locs[i][0] );
      
    }
    
    return 0;
}
@SLongshaw
Copy link
Member

Thank you for the bug report.

This is confirmed using g++9 and openmpi so not compiler/mpi library specific.

The segfault is triggered by the fetch_points/fetch_all call in your example but in fact stems from the creation of the internal message structure within the library triggered by the barrier() function, so is likely representative of a more deeply-rooted bug.

This will be looked into as a matter of urgency and this report updated once a fix is found.

A quick point for the moment, there is a mismatch in the chrono_sampler used, when using the specialisms approach to declaring MUI objects, it is important to match types, the interface and point types are 3d (3-dimensional and double data type) while the chrono_sampler is 1d. This isn't the underlying problem though.

@chrisrichardson
Copy link
Contributor Author

Thanks @SLongshaw. I will update my chrono_sampler dimensionality. Good luck finding the bug.

@SLongshaw
Copy link
Member

Hi @chrisrichardson I think this should now be resolved in both branches.

To summarise, you managed to uncover a situation that doesn't normally arise using MUI, specifically the sending rank was able to exit before the receiving rank had completed its fetch and as MUI relies on a non-blocking send / blocking receive design this meant MPI buffers etc. were lost as the sending rank exited. Normally coupled designs mean this tends not to happen (not specifically by design admittedly).

In reality though, this represented a bug in the general design of MUI as there should always be a corresponding MPI wait (or equivalent) call (which there was) but there was nothing to block to wait to ensure completion.

The solution implemented is a new blocking wait on MPI_Isend completion but with an inbuilt timer to avoid deadlock.

The problem here is that the original design was the safest and most general approach but inherently had this potential problem, the new design is more "correct" in MPI terms but reduces the generality of the library, for most scenarios this shouldn't represent any difference but for those few where it does, the addition of a timeout and user warning messages is the (inelegant) solution.

This change will remain permanent as long as further test cases in the future don't highlight any unforeseen major issues with the new solution. In that case there is a work-around using the original approach and that is to ensure that the sending side finishes using a call to the barrier() function at a time stamp where it hasn't yet issued a fetch() command and, on the receiving side to send a final empty commit() at that time to unblock both sides, in effect adding a synchronisation barrier.

Please try this change out and feel free to re-open this issue if it is not resolved for you, for now though I will close.

@SLongshaw
Copy link
Member

The fix for this problem eventually resulted in a runtime problem with a specific case where the new blocking wait adding significant overhead. The non-blocking wait has therefore been re-instated, however, a final blocking test loop has been added to the destructor of the MPI comm class, this has been tested against the issue raised here and the new solution still offers a fix while removing the need for a blocking MPI test loop after every send.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants