New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Struggling to set RoundRobin
parameter with SST
#3721
Comments
Thanks for the report. Can you please run with the environment variable "SstVerbose" set to a numeric value of 2 or more? That should let us know if SST is seeing the parameter, output will be something like this: eisen@Endor build % export SstVerbose=2 |
Thanks for the tip, here is the output:
Which seems to indicate that infact, RoundRobin is set on the backend. |
Interesting... I just checked to see that our CI test that covers RoundRobin distribution is still working, and it seems to be. You might kick that SstVerbose parameter up to '4', which should get you more detailed information about timestep distribution. Probably only necessary to do that on the writer side. Here's what a portion of the output looks like for our CI test, you can see the Round Robin distribution info and where each step was sent: Writer 0 (0x15af1d620): Sending TimestepMetadata for timestep 5 (ref count 1), one to each reader |
Ok, I increased the writer verbosity as you suggested. It produced the following output. I notice that I only have
|
Follow up question: When using RoundRobin with adios, do all connected readers need to I am trying to work around the issue above by running the round robin on the reader side by deciding which reader should read from the writer. But it seems that it still wants all readers to read all timesteps. I realize this is likely the incorrect work around - but I am unsure how else to achieve RoundRobin with our current setup. Perhaps our setup is unique or incorrect (although we are following the explicit instructions from #3675 (reply in thread)) Here, we have created our exact configuration in a MWE for you to test out, incase you'd like to see how we are trying to use Adios2: |
Ah, we may have a conceptual disconnect. It looks like you just have a single MPI reader application connected to the writer. That reader has multiple ranks, but since ADIOS is designed for communication between MPI applications, it assumes that all the writer/reader ranks in an application act cooperatively. None of SST's distribution modes come into play because there is only one reader application and it gets all the timesteps. Each of the reader's ranks might select different parts of the incoming arrays, but they will all come from the same set of data that the writer ranks created for that timestep. The RoundRobin distribution mode was designed to scatter created timesteps to multiple reader applications. There is a test in ADIOS that does this and you can try it by first running the writer like this: Note that I didn't run with MPI above, so we only have a single rank for the writer and each of the two readers. They could each be MPI applications. |
Sorry, I hadn't had time to go through your demo, and may not yet today. But generally if you pass an MPI communicator in to ADIOS initialization, then a bunch of things in ADIOS are collective operations. Every rank has do to Open(), BeginStep, EndStep, etc. However, you might get to where you want to be by NOT passing the mpi communicator in to ADIOS. Then each rank will operate completely independently as if it were it's own separate 1-rank application. That may be good or bad depending upon exactly what you're trying to do. (I.E. if you want everything to run sort of in lock-step, this isn't the way.) |
Ok, thanks for conveying the internal philosophy of the Adios2 round robin distribution method. Unfortunately - our application depends on all readers sitting on the same MPI application. I will try your suggestion of not passing the mpi communicator to adios, thanks for the tip! |
Just an update, I have managed to get Now it is "working" in our toy example. But I am wondering, what are the ramifications on the adios backend? You say "if you want everything to run in lock step then it isnt the way." Maybe I misunderstand, but we are still using our own MPI communicator on our side - so we have full control over the lock-step nature of our reading (in case we want/don't want that). So are you referring to something intrinisic to the adios back end? For example, without the communicator, is there some undefined behavior possible in the step distribution on adios' side? In all cases, thanks a lot for your assistance. I think your previous tips enable us to move away from out the toy example and try integration into our software. |
WRT what I meant by that comment, I'd go back to ADIOS' origins. It was designed to pass information between timestep-oriented simulation and analysis jobs where the prominent data structures were global arrays decomposed across the writer ranks with different portions of them consumed by each reader rank. In that context, ADIOS makes sure that the reader ranks are all working on the same timestep at the same time, etc. You're just have a bit more of a novel use case, so ADIOS isn't in that role. I don't think there should be any undefined behavior (at least WRT MPI). Hopefully the more defined behavior is also appropriate for your situation. Reader-side ADIOS BeginStep() without timeout will block until it gets data, which may hold up one of your ranks until its turn to get data sent to it (which might in turn hold up your whole application because your own collective MPI operations might wait for that rank to run again). There is a timeout parameter to BeginStep that you can use to help manage that, but with RoundRobin data sent to a particular reader is his to consume and won't be available any other reader. So one reader that didn't do BeginStep for a while could have a queue while another might have run through all his data. Maybe that's not a problem because it just doesn't matter or your outside-of-adios synchronization keeps that sort of thing in check. If it was a problem, you might also consider the |
Hello,
I am trying to set the parameter
RoundRobin
in mySST
writer, but it appears that the defaultAllToAll
is always used no matter how I try to set the parameter.Extra context: We started investigating the use of this library in another Adios discussion here.
To Reproduce
We have set up our minimal environment for you. In summary, we have N number of clients, each one is a writer. We have M number of server processes, each one is a reader. We are using SST for the engine, and we successfully run AllToAll communications.
However, when I try to set the
StepDistributionMode
toRoundRobin
, nothing changes. All M servers receive all steps.We tried to set the parameter using a variety of methods:
But neither of these methods change the behavior of the writer.
Here you can clone the minimal working repository at https://gitlab.inria.fr/mschoule/adios2-melissa-simple-demo
And to test it you can run:
or
The M server timesteps collected are saved to
time_step_<rank>.json
in the top directory. As you will see, the same output is produced for both, meaning all M server processes got all steps from all simulations.Expected behavior
RoundRobin should follow the documented description from the Adios documentation:
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: