Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for STP violation #257

Merged
merged 2 commits into from
Aug 14, 2023
Merged

Fix for STP violation #257

merged 2 commits into from
Aug 14, 2023

Conversation

petervdonovan
Copy link
Contributor

@petervdonovan petervdonovan commented Aug 8, 2023

This fixes yet another bug reported by Jacky.

The commit message is reproduced here for convenience:

The communication protocol remains undocumented AFAIK, but my working
assumption has been that if a federate receives a PTAG, then that means
that all messages that are expected over delayed connections have
arrived. Therefore, delayed connections are not guarded from STP
violations by the MLAA; this property is acceptable because delayed connections impose
no deadlock risk and in some cases (startup) this property is necessary to avoid
deadlocks, but it requires some special care when sending a PTAG.

Regarding testing: Although Jacky has provided an example program for reproducing the issue, it is not obvious to me whether we should turn it into a test since the failure is not reproduced reliably without running the program for some seconds (and that is locally, on a laptop, with the time.sleeps removed). I do not think that our current method of running tests in GH Actions can be treated as an effective or efficient way to detect races of this kind.

The same could be said of the other STP violation that Jacky reported.

NOTE: Previously an irrelevant change was included in this PR that actually introduced a bug. It has since been deleted (I rebased and force-pushed).

The communication protocol remains undocumented AFAIK, but my working
assumption has been that if a federate receives a PTAG, then that means
that all messages that are expected over delayed connections have
arrived. Therefore, delayed connections are not guarded from STP
violations by the MLAA; this property is acceptable because they impose
no deadlock risk and in some cases (startup) it is necessary to avoid
deadlocks, but it requires some special care when sending a PTAG.
Copy link
Contributor

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@petervdonovan
Copy link
Contributor Author

The timeout might just be flakiness 🙄 but the real reason why this cannot be merged yet is that it is not clear whether the problem has been fixed. I no longer reproduce it locally, but @jackykwok2024 can reproduce it with the latest RTI checked out.

It's hard for me to reproduce the problem locally (either in master or in my branch) unless I comment out the sleeps, so I had to comment out the sleeps in order to determine if my branch had fixed anything. In master, with the sleeps commented out, I was able to pretty consistently reproduce the problem within (say) 200 rounds, but in my branch I was not able to reproduce after running for more than 200 rounds a few different times.

This seems like it could be dependent on the timing behavior imposed by the hardware/operating system, so I tried to reproduce it on Wessel. I couldn't reproduce it, not even on master, although I did get this in master, which looks like a different bug:

Server 10 50
Server 10 50
Federation ID for executable /home/peter/lingua-franca/test/Python/fed-gen/fedfail/src-gen/federate__client1/federate__client1.py: a5969099fea749c6961bf6d0202e9adefcf1bada4f562de5
---- Start execution at time Thu Aug 10 22:07:43 2023
---- plus 349306463 nanoseconds
Federate 0: Successfully connected to RTI.
Federate 0: Connected to RTI at localhost:15045.
Federate 0: Starting timestamp is: 1691730464369868596.
Federate 0: Environment 0: ---- Spawning 5 workers.
_pickle.UnpicklingError: invalid load key, '\x00'.
Federate 0: FATAL ERROR: Could not deserialize deserialized_message.
WARNING: RTI: Destination federate 0 is no longer connected. Dropping message.
WARNING: RTI: Destination federate 0 is no longer connected. Dropping message.

In short I'm not sure how to proceed because although the changes in this branch fixes the problem on my machine, I'm no longer able to reproduce the behavior that Jacky is observing.

@jackyk02
Copy link
Contributor

jackyk02 commented Aug 11, 2023

It's hard for me to reproduce the problem locally (either in master or in my branch) unless I comment out the sleeps, so I had to comment out the sleeps in order to determine if my branch had fixed anything. In master, with the sleeps commented out, I was able to pretty consistently reproduce the problem within (say) 200 rounds, but in my branch I was not able to reproduce after running for more than 200 rounds a few different times.

In short I'm not sure how to proceed because although the changes in this branch fixes the problem on my machine, I'm no longer able to reproduce the behavior that Jacky is observing.

Hi Peter, thank you so much for the effort you've dedicated to addressing this bug. I've set up an AWS EC2 Instance and checked out the most recent RTI (with the updates for zero-delay connection). This branch seem to have addressed the issue, as I'm unable to reproduce it on the AWS instance. However, the problem still remains on my laptop. It does seem like the issue is tied to the timing behavior of the hardware/OS. I'll further explore this using my machine…

@petervdonovan
Copy link
Contributor Author

OK, I expect the test to pass now.

Even though the exact error reported might not have been reproduced, I think we should merge this because it does demonstrably fix an actual bug. These error reports from @jackykwok2024 have been really helpful. It would be reasonable to re-raise the original issue if someone who can reproduce it can produce a file with full debug logs from both the RTI and federates, and ideally also a trace from fedsd.

@lhstrh lhstrh merged commit 809c454 into main Aug 14, 2023
25 checks passed
@lhstrh lhstrh deleted the another-stp-violation-fix branch August 14, 2023 23:11
@lhstrh lhstrh added the bugfix label Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants