Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No more use of unordered reactions in federated programs #191

Merged
merged 51 commits into from
Jul 28, 2023

Conversation

arengarajan99
Copy link
Collaborator

@arengarajan99 arengarajan99 commented Apr 2, 2023

The current runtime support for federated execution relies on a set of blocking unordered "control reactions" for each input port that to block the handling of reactions until the required port statuses are known. The drawbacks of this design are discussed in Rengarajan 2023, along with a solution that no longer relies on the use of blocking unordered reactions. This PR implements that solution.

This work was initiated by @arengarajan99 and completed by @petervdonovan.

- Place unsafe reactions onto a hashset, probe hashset as signaled to remove reactions to be safely enqueued if upstream port statuses are known
@arengarajan99 arengarajan99 added enhancement Enhancement of existing feature help wanted Extra attention is needed federated refactoring labels Apr 2, 2023
@petervdonovan petervdonovan changed the title Remove Unordered/Spin Lock Reactions from Federated Runtime Remove Unordered Reactions from Federated Runtime Jul 12, 2023
This also removes two other fields of the reaction_t struct which are
not necessary.
After this change I was able to run Python's StopAtShutdown 95 times
without observing any deadlock, whereas before I could consistently
reproduce the deadlock (when using the macOS runner from CI) by running
it several times.

It is not entirely clear what this deadlock has to do with the changes
in this branch. Although I have not reproduced it in master (I have
tried, and after running it 95 times I did not see deadlock), it does
seem like a programming mistake to allow the stop tag to increase.
Besides, the error was very flaky, non-reproducible on Ubuntu, and quite
possibly present even if it is not observed.
So much for failing fast. It seems like this is the behavior that is
expected by our tests, so for now maybe it is what we will do.
Thanks to Jacky Kwok for finding this.
The ability to do this is the initial motivation for this PR.
@petervdonovan petervdonovan marked this pull request as ready for review July 25, 2023 20:36
Copy link
Member

@lhstrh lhstrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good! Only some comments missing, but other than that, this looks ready to merge...

core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Outdated Show resolved Hide resolved
petervdonovan and others added 2 commits July 25, 2023 20:40
Co-authored-by: Marten Lohstroh <marten@berkeley.edu>
@petervdonovan
Copy link
Contributor

petervdonovan commented Jul 26, 2023

I think the timeout in ParallelSourcesMultiport is likely to be a (legitimate) deadlock, although it is hard to be sure since the output was truncated. I do not remember seeing that test deadlock before, so this must be something new. I am trying to reproduce now.

Edit: Failed to reproduce locally on Ubuntu after running 348 times with log level LOG.

Edit: Also failed to reproduce locally on Ubuntu after running 348 times with log level DEBUG.

Edit: Also failed to reproduce on a macOS GH actions runner after running 265 times with log level DEBUG.

@petervdonovan
Copy link
Contributor

petervdonovan commented Jul 26, 2023

Could we maybe pretend that everything is fine? I have run ParallelSourcesMultiport hundreds of times locally and in a macOS GH actions runner without being able to reproduce the latest failure; additionally, I have never seen ParallelSourcesMultiport fail when running it on Wessel with random delays inserted.

For the record, this is all the information we got from CI about the apparent deadlock:

Execution output:
    Federate ParallelSourcesMultiport in Federation ID '3da11054f538fe46ef345bbdfe648c7c7e52d705cef6ded8'
    #### Launching the federate federate__s1.
    #### Launching the federate federate__s2.
    #### Launching the federate federate__d1.
    #### Launching the federate federate__t4.
    #### Bringing the RTI back to foreground so it can receive Control-C.
    RTI -i ${FEDERATION_ID} -n 4 -c init exchanges-per-interval 10

It looks like the program never completed startup. It is possible to imagine that it is somehow not a true deadlock (some resource not available on GH's server, etc.) but I cannot say with confidence that it is not a true deadlock.

@lhstrh
Copy link
Member

lhstrh commented Jul 26, 2023

Could we maybe pretend that everything is fine? I have run ParallelSourcesMultiport hundreds of times locally and a macOS GH actions runner without being able to reproduce the latest failure; additionally, I have never seen ParallelSourcesMultiport fail when running it on Wessel with random delays inserted.

For the record, this is all the information we got from CI about the apparent deadlock:


Execution output:

    Federate ParallelSourcesMultiport in Federation ID '3da11054f538fe46ef345bbdfe648c7c7e52d705cef6ded8'

    #### Launching the federate federate__s1.

    #### Launching the federate federate__s2.

    #### Launching the federate federate__d1.

    #### Launching the federate federate__t4.

    #### Bringing the RTI back to foreground so it can receive Control-C.

    RTI -i ${FEDERATION_ID} -n 4 -c init exchanges-per-interval 10

It looks like the program never completed startup. It is possible to imagine that it is somehow not a true deadlock (some resource not available on GH's server, etc.) but I cannot say with confidence that it is not a true deadlock.

Sounds fair to me. Should this reappear we'll deal with it then.

@lhstrh lhstrh changed the title Remove Unordered Reactions from Federated Runtime No more use of unordered reactions in federated programs Jul 27, 2023
@lhstrh lhstrh merged commit 8c69304 into main Jul 28, 2023
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement of existing feature federated help wanted Extra attention is needed refactoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants