Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struggling to set RoundRobin parameter with SST #3721

Open
robcaulk opened this issue Jul 28, 2023 · 10 comments
Open

Struggling to set RoundRobin parameter with SST #3721

robcaulk opened this issue Jul 28, 2023 · 10 comments

Comments

@robcaulk
Copy link

Hello,

I am trying to set the parameter RoundRobin in my SST writer, but it appears that the default AllToAll is always used no matter how I try to set the parameter.

Extra context: We started investigating the use of this library in another Adios discussion here.

To Reproduce
We have set up our minimal environment for you. In summary, we have N number of clients, each one is a writer. We have M number of server processes, each one is a reader. We are using SST for the engine, and we successfully run AllToAll communications.

However, when I try to set the StepDistributionMode to RoundRobin, nothing changes. All M servers receive all steps.

We tried to set the parameter using a variety of methods:

    adios = adios2.ADIOS(comm=comm)
    io = adios.DeclareIO("writerIO")
    io.SetEngine("SST")
    print(f"Setting distribution mode to {args.step_mode}")
    io.SetParameters({"StepDistributionMode": args.step_mode})
    # io.SetParameter("StepDistributionMode", args.step_mode)

But neither of these methods change the behavior of the writer.

Here you can clone the minimal working repository at https://gitlab.inria.fr/mschoule/adios2-melissa-simple-demo

And to test it you can run:

python3 launcher.py --server_np=2 --n_client=4 --client_np=2 --n_step=100 --thread_data --step_mode RoundRobin

or

python3 launcher.py --server_np=2 --n_client=4 --client_np=2 --n_step=100 --thread_data --step_mode AllToAll

The M server timesteps collected are saved to time_step_<rank>.json in the top directory. As you will see, the same output is produced for both, meaning all M server processes got all steps from all simulations.

Expected behavior
RoundRobin should follow the documented description from the Adios documentation:

“RoundRobin”, each step is delivered only to a single reader, determined in a round-robin fashion based upon the number or readers who have opened the stream at the time the step is submitted.

Desktop (please complete the following information):

  • OS/Platform: Ubuntu 22.04
  • Build: compiled from source
@eisenhauer
Copy link
Member

Thanks for the report. Can you please run with the environment variable "SstVerbose" set to a numeric value of 2 or more? That should let us know if SST is seeing the parameter, output will be something like this:

eisen@Endor build % export SstVerbose=2
eisen@Endor build % bin/TestCommonWrite sst tmp StepDistributionMode=RoundRobin
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from CommonWriteTest
[ RUN ] CommonWriteTest.ADIOS2CommonWrite
Nx is set to 10 on Rank 0
Selecting DataPlane "evpath", priority 1 for use
Opening Stream "tmp"
Writer stream params are:
Param - RegistrationMethod=File
Param - RendezvousReaderCount=1
Param - QueueLimit=0 (unlimited)
Param - QueueFullPolicy=Block
Param - StepDistributionMode=StepsRoundRobin
Param - DataTransport=evpath
Param - ControlTransport=sockets
Param - NetworkInterface=(default)
Param - ControlInterface=(default to NetworkInterface if applicable)
Param - DataInterface=(default to NetworkInterface if applicable)
Param - CompressionMethod=None
Param - CPCommPattern=Min
Param - MarshalMethod=BP5
Param - FirstTimestepPrecious=False
Param - IsRowMajor=1 (not user settable)
Param - OpenTimeoutSecs=60 (seconds)
Param - SpeculativePreloadMode=Auto
Param - SpecAutoNodeThreshold=1
Param - ControlModule=select

@robcaulk
Copy link
Author

Thanks for the tip, here is the output:

Opening Reader Stream.
Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsRoundRobin
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Reader stream params are:
Param -   RegistrationMethod=File
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   AlwaysProvideLatestTimestep=False
Param -   OpenTimeoutSecs=1 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select

Which seems to indicate that infact, RoundRobin is set on the backend.

@eisenhauer
Copy link
Member

Interesting... I just checked to see that our CI test that covers RoundRobin distribution is still working, and it seems to be. You might kick that SstVerbose parameter up to '4', which should get you more detailed information about timestep distribution. Probably only necessary to do that on the writer side. Here's what a portion of the output looks like for our CI test, you can see the Round Robin distribution info and where each step was sent:

Writer 0 (0x15af1d620): Sending TimestepMetadata for timestep 5 (ref count 1), one to each reader
Writer 0 (0x15af1d620): Round Robin Distribution, step sent to reader 2
Writer 0 (0x15af1d620): Sent timestep 5 to reader cohort 2
Writer 0 (0x15af1d620): ADDING timestep 5 to sent list for reader cohort 2, READER 0x600002053400, reference count is now 2
Writer 0 (0x15af1d620): PRELOADMODE for timestep 5 non-default for reader , active at timestep 0, mode 1
DP Writer 0 (0x15af1d620): Per reader registration for timestep 5, preload mode 1
DP Writer 0 (0x15af1d620): Sending Speculative Preload messages, reader 0x600001b44900, timestep 5
Writer 0 (0x15af1d620): Removing dead entries
Writer 0 (0x15af1d620): QueueMaintenance complete
Writer 0 (0x15af1d620): Reader sent timestep list 0x600000c44210, trying to release 5
Writer 0 (0x15af1d620): Writer tagging timestep 3 as expired
DP Writer 0 (0x15af1d620): Releasing timestep 3
Writer 0 (0x15af1d620): Removing dead entries
Writer 0 (0x15af1d620): Remove queue Entries removing Timestep 3 (exp 1, Prec 0, Ref 0), Count now 2
Writer 0 (0x15af1d620): QueueMaintenance complete
DP Writer 0 (0x15af1d620): ProvideTimestep, registering timestep 6, data 0x15b046e00, fprint 41070373fd07d306
Writer 0 (0x15af1d620): Removing dead entries
Writer 0 (0x15af1d620): QueueMaintenance complete
Writer 0 (0x15af1d620): Sending TimestepMetadata for timestep 6 (ref count 1), one to each reader
Writer 0 (0x15af1d620): Round Robin Distribution, step sent to reader 0
Writer 0 (0x15af1d620): Sent timestep 6 to reader cohort 0
Writer 0 (0x15af1d620): ADDING timestep 6 to sent list for reader cohort 0, READER 0x600002053200, reference count is now 2
Writer 0 (0x15af1d620): PRELOADMODE for timestep 6 non-default for reader , active at timestep 0, mode 1
DP Writer 0 (0x15af1d620): Per reader registration for timestep 6, preload mode 1
DP Writer 0 (0x15af1d620): Sending Speculative Preload messages, reader 0x600001b44840, timestep 6
Writer 0 (0x15af1d620): Removing dead entries
Writer 0 (0x15af1d620): QueueMaintenance complete

@robcaulk
Copy link
Author

Ok, I increased the writer verbosity as you suggested. It produced the following output. I notice that I only have sent to reader 0 where you have sent to reader 2 and sent to reader 0 (I also only have cohort 0, you have cohort 1 and cohort 2). This tells me that I may not be initializing the readers correctly? Do you have a minimal working example of initializing a round robin reader configuration?

Writer 0 (0x55715c02fcf0): Sst set to use sockets as a Control Transport
DP Writer 0 (0x55715c02fcf0): Considering DataPlane "evpath" for possible use, priority is 1
DP Writer 0 (0x55715c02fcf0): Selecting DataPlane "evpath", priority 1 for use
Writer 0 (0x55715c02fcf0): Opening Stream "melissa.sid-0"
Writer 0 (0x55715c02fcf0): Writer stream params are:
Param -   RegistrationMethod=File
Param -   RendezvousReaderCount=1
Param -   QueueLimit=0 (unlimited)
Param -   QueueFullPolicy=Block
Param -   StepDistributionMode=StepsRoundRobin
Param -   DataTransport=evpath
Param -   ControlTransport=sockets
Param -   NetworkInterface=(default)
Param -   ControlInterface=(default to NetworkInterface if applicable)
Param -   DataInterface=(default to NetworkInterface if applicable)
Param -   CompressionMethod=None
Param -   CPCommPattern=Min
Param -   MarshalMethod=BP5
Param -   FirstTimestepPrecious=False
Param -   IsRowMajor=1  (not user settable) 
Param -   OpenTimeoutSecs=60 (seconds)
Param -   SpeculativePreloadMode=Auto
Param -   SpecAutoNodeThreshold=1
Param -   ControlModule=select
Writer 0 (0x55715c02fcf0): Stream "melissa.sid-0" waiting for 1 readers
Writer 0 (0x55715c02fcf0): Beginning writer-side reader open protocol
Writer 0 (0x55715c02fcf0): Finish writer-side reader open protocol for reader 0x55715c0488a0, reader ready response pending
Writer 0 (0x55715c02fcf0): (PID cb87, TID 7fbee9fed000) Waiting for Reader ready on WSR 0x55715c0488a0.
Writer 0 (0x55715c02fcf0): Reader Activate message received for Stream 0x55715c0488a0.  Setting state to Established.
Writer 0 (0x55715c02fcf0): Parent stream reader count is now 1.
Writer 0 (0x55715c02fcf0): Reader ready on WSR 0x55715c0488a0, Stream established, Starting 0 LastProvided 0.
Writer 0 (0x55715c02fcf0): Finish opening Stream "melissa.sid-0"
DP Writer 0 (0x55715c02fcf0): ProvideTimestep, registering timestep 0, data 0x55715c04e970, fprint 5222418180845f8
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Sending TimestepMetadata for timestep 0 (ref count 1), one to each reader
Writer 0 (0x55715c02fcf0): Round Robin Distribution, step sent to reader 0
Writer 0 (0x55715c02fcf0): Sent timestep 0 to reader cohort 0
Writer 0 (0x55715c02fcf0): ADDING timestep 0 to sent list for reader cohort 0, READER 0x55715c0488a0, reference count is now 2
DP Writer 0 (0x55715c02fcf0): Per reader registration for timestep 0, preload mode 0
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
DP Writer 0 (0x55715c02fcf0): ProvideTimestep, registering timestep 1, data 0x55715c0568d0, fprint 5222418180845f8
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Sending TimestepMetadata for timestep 1 (ref count 1), one to each reader
Writer 0 (0x55715c02fcf0): Round Robin Distribution, step sent to reader 0
Writer 0 (0x55715c02fcf0): Sent timestep 1 to reader cohort 0
Writer 0 (0x55715c02fcf0): ADDING timestep 1 to sent list for reader cohort 0, READER 0x55715c0488a0, reference count is now 2
DP Writer 0 (0x55715c02fcf0): Per reader registration for timestep 1, preload mode 0
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
DP Writer 0 (0x55715c02fcf0): ProvideTimestep, registering timestep 2, data 0x55715c056e40, fprint 5222418180845f8
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Sending TimestepMetadata for timestep 2 (ref count 1), one to each reader
Writer 0 (0x55715c02fcf0): Round Robin Distribution, step sent to reader 0
Writer 0 (0x55715c02fcf0): Sent timestep 2 to reader cohort 0
Writer 0 (0x55715c02fcf0): ADDING timestep 2 to sent list for reader cohort 0, READER 0x55715c0488a0, reference count is now 2
DP Writer 0 (0x55715c02fcf0): Per reader registration for timestep 2, preload mode 0
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): SstWriterClose, Sending Close at Timestep 2, one to each reader
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Waiting for timesteps to be released in WriterClose
Writer 0 (0x55715c02fcf0): Reader sent timestep list 0x55715c04f910, trying to release 0
Writer 0 (0x55715c02fcf0): Writer tagging timestep 0 as expired
DP Writer 0 (0x55715c02fcf0): Releasing timestep 0
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): Remove queue Entries removing Timestep 0 (exp 1, Prec 0, Ref 0), Count now 2
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Waiting for timesteps to be released in WriterClose
Writer 0 (0x55715c02fcf0): Reader sent timestep list 0x55715c04e7b0, trying to release 1
Writer 0 (0x55715c02fcf0): Writer tagging timestep 1 as expired
DP Writer 0 (0x55715c02fcf0): Releasing timestep 1
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): Remove queue Entries removing Timestep 1 (exp 1, Prec 0, Ref 0), Count now 1
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): Waiting for timesteps to be released in WriterClose
Writer 0 (0x55715c02fcf0): Reader sent timestep list 0x55715c04e4d0, trying to release 2
Writer 0 (0x55715c02fcf0): Writer tagging timestep 2 as expired
DP Writer 0 (0x55715c02fcf0): Releasing timestep 2
Writer 0 (0x55715c02fcf0): Removing dead entries
Writer 0 (0x55715c02fcf0): Remove queue Entries removing Timestep 2 (exp 1, Prec 0, Ref 0), Count now 0
Writer 0 (0x55715c02fcf0): QueueMaintenance complete
Writer 0 (0x55715c02fcf0): 
Stream "melissa.sid-0" (0x55715c02fcf0) summary info:
Writer 0 (0x55715c02fcf0): 	Duration (secs) = 0.624963
Writer 0 (0x55715c02fcf0): 	Timesteps Created = 3
Writer 0 (0x55715c02fcf0): 	Timesteps Delivered = 3
Writer 0 (0x55715c02fcf0): 
Writer 0 (0x55715c02fcf0): All timesteps are released in WriterClose
Writer 0 (0x55715c02fcf0): Destroying stream 0x55715c02fcf0, name melissa.sid-0
Writer 0 (0x55715c02fcf0): Reference count now zero, Destroying process SST info cache
Writer 0 (0x55715c02fcf0): Freeing LastCallList
Writer 0 (0x7ffeb3258700): SstStreamDestroy successful, returning

@robcaulk
Copy link
Author

Follow up question:

When using RoundRobin with adios, do all connected readers need to BeginStep and EndStep for all timesteps still? I assume the writer is deciding who gets the data in this case.

I am trying to work around the issue above by running the round robin on the reader side by deciding which reader should read from the writer. But it seems that it still wants all readers to read all timesteps.

I realize this is likely the incorrect work around - but I am unsure how else to achieve RoundRobin with our current setup. Perhaps our setup is unique or incorrect (although we are following the explicit instructions from #3675 (reply in thread))

Here, we have created our exact configuration in a MWE for you to test out, incase you'd like to see how we are trying to use Adios2:

https://gitlab.inria.fr/mschoule/adios2-melissa-simple-demo

@eisenhauer
Copy link
Member

Ah, we may have a conceptual disconnect. It looks like you just have a single MPI reader application connected to the writer. That reader has multiple ranks, but since ADIOS is designed for communication between MPI applications, it assumes that all the writer/reader ranks in an application act cooperatively. None of SST's distribution modes come into play because there is only one reader application and it gets all the timesteps. Each of the reader's ranks might select different parts of the incoming arrays, but they will all come from the same set of data that the writer ranks created for that timestep. The RoundRobin distribution mode was designed to scatter created timesteps to multiple reader applications. There is a test in ADIOS that does this and you can try it by first running the writer like this:
bin/TestDistributionWrite SST RR.sst RendezvousReaderCount=2 --round_robin
This should wait for two readers to connect to it.
Then start up two separate terminal windows, cd to the same directory and in each one do:
bin/TestDistributionRead SST RR.SST --round_robin
If you have SstVerbose turned on you should see the timesteps alternating WRT which reader application they are delivered to.

Note that I didn't run with MPI above, so we only have a single rank for the writer and each of the two readers. They could each be MPI applications.

@eisenhauer
Copy link
Member

Follow up question:

When using RoundRobin with adios, do all connected readers need to BeginStep and EndStep for all timesteps still? I assume the writer is deciding who gets the data in this case.

I am trying to work around the issue above by running the round robin on the reader side by deciding which reader should read from the writer. But it seems that it still wants all readers to read all timesteps.

I realize this is likely the incorrect work around - but I am unsure how else to achieve RoundRobin with our current setup. Perhaps our setup is unique or incorrect (although we are following the explicit instructions from #3675 (reply in thread))

Here, we have created our exact configuration in a MWE for you to test out, incase you'd like to see how we are trying to use Adios2:

https://gitlab.inria.fr/mschoule/adios2-melissa-simple-demo

Sorry, I hadn't had time to go through your demo, and may not yet today. But generally if you pass an MPI communicator in to ADIOS initialization, then a bunch of things in ADIOS are collective operations. Every rank has do to Open(), BeginStep, EndStep, etc. However, you might get to where you want to be by NOT passing the mpi communicator in to ADIOS. Then each rank will operate completely independently as if it were it's own separate 1-rank application. That may be good or bad depending upon exactly what you're trying to do. (I.E. if you want everything to run sort of in lock-step, this isn't the way.)

@robcaulk
Copy link
Author

Ok, thanks for conveying the internal philosophy of the Adios2 round robin distribution method.

Unfortunately - our application depends on all readers sitting on the same MPI application.

I will try your suggestion of not passing the mpi communicator to adios, thanks for the tip!

@robcaulk
Copy link
Author

robcaulk commented Aug 1, 2023

Just an update, I have managed to get RoundRobin working for our configuration by taking your suggestion and removing the MPI communicator from the reader side adios2.ADIOS() initialization. The client side still takes its own client communicator (since our clients are individual MPI applications).

Now it is "working" in our toy example. But I am wondering, what are the ramifications on the adios backend? You say "if you want everything to run in lock step then it isnt the way." Maybe I misunderstand, but we are still using our own MPI communicator on our side - so we have full control over the lock-step nature of our reading (in case we want/don't want that). So are you referring to something intrinisic to the adios back end? For example, without the communicator, is there some undefined behavior possible in the step distribution on adios' side?

In all cases, thanks a lot for your assistance. I think your previous tips enable us to move away from out the toy example and try integration into our software.

@eisenhauer
Copy link
Member

WRT what I meant by that comment, I'd go back to ADIOS' origins. It was designed to pass information between timestep-oriented simulation and analysis jobs where the prominent data structures were global arrays decomposed across the writer ranks with different portions of them consumed by each reader rank. In that context, ADIOS makes sure that the reader ranks are all working on the same timestep at the same time, etc. You're just have a bit more of a novel use case, so ADIOS isn't in that role. I don't think there should be any undefined behavior (at least WRT MPI). Hopefully the more defined behavior is also appropriate for your situation. Reader-side ADIOS BeginStep() without timeout will block until it gets data, which may hold up one of your ranks until its turn to get data sent to it (which might in turn hold up your whole application because your own collective MPI operations might wait for that rank to run again). There is a timeout parameter to BeginStep that you can use to help manage that, but with RoundRobin data sent to a particular reader is his to consume and won't be available any other reader. So one reader that didn't do BeginStep for a while could have a queue while another might have run through all his data. Maybe that's not a problem because it just doesn't matter or your outside-of-adios synchronization keeps that sort of thing in check. If it was a problem, you might also consider the OnDemand distribution mode, where each writer-side timestep is sent to the next reader that asks for it, rather than to specific readers in sequence as in RoundRobin. But again, depends upon your use-case. Happy to chat more if anything seems weird when you're integrating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants