Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TAN messages no longer used and in-transit messages recorded in the RTI #1074

Merged
merged 29 commits into from Jun 11, 2022

Conversation

Soroosh129
Copy link
Contributor

@Soroosh129 Soroosh129 commented Apr 4, 2022

This PR removes TAN messages entirely.

Instead, a federate with a physical action (that is connected to a network output) is going to periodically create a dummy event (with the period controlled by coordination-options: {advance-message-interval: 10 msec}) which forces the federate to advance its tag and allow downstream federates to make progress.

After fixing this bug, another bug was exposed in the RTI, in which the RTI could potentially lose track of a federate's actual earliest next event (see #1074 (comment) comment for more detail). This caused the RTI to grant incorrect tag advance grant (TAG) messages. This bug was fixed by adding a queue to the RTI that keeps a record of all currently in-transit messages.

@Soroosh129
Copy link
Contributor Author

Soroosh129 commented May 30, 2022

I've been wrestling with this issue for the past few days. I think I have identified two possible race conditions, one of which I think I found a fix for (but not the other).

Imagine the following federated program (PassThrough and TestCount are simple library reactors with trivial implementation that is elided for the sake of simplicity):

target C { 
    timeout: 4 sec
};

import PassThrough from "../lib/PassThrough.lf"
import TestCount from "../lib/TestCount.lf"

reactor WithLogicalAction {
    output out:int;
    state thread_id:lf_thread_t(0);
    state counter:int(1);
    logical action act(0):int;

    reaction(startup, act) -> act, out {=
        SET(out, self->counter);
        schedule_int(act, MSEC(1), self->counter++);
    =}
}

federated reactor {
    a = new WithLogicalAction();
    test = new TestCount(num_inputs=21);

    passThroughs = new PassThrough();

    a.out, passThroughs.out -> 
    passThroughs.in, test.in
}

diagram

Race condition 1: NET with FOREVER tag from the federate after message forwarding by the RTI but before delivery

Example scenario:

  • ...
  • Federate WithLogicalAction sends a message with tag 1 sec to the RTI
  • The RTI sees this message and forwards it to PassThrough
    • As part of its forwarding, the RTI checks if the tag of the message is smaller than the recorded next event of PassThrough. In our case, assume that it is (1 sec < FOREVER), so the RTI will record 1 sec as the next event for PassThrough.
  • While the message is in transit, PassThrough sends a(nother) NET with tag FOREVER to the RTI (because it has no other local events). I'm not sure why this happens.
  • Upon receiving this NET, the RTI incorrectly replaces the next event of the PassThrough from the in-transit message to FOREVER.
  • Race condition: At this point (while the message is in-transit), if TestCount asks for a TAG of FOREVER, it will be granted because the RTI incorrectly thinks that PassThrough has a next event of FOREVER.

Fix: RTI checks that a recorded next event is already completed before replacing it with a larger next event.

Race condition 2: NET with smaller tag from the federate before message forwarding by the RTI

Example scenario:

  • ...
  • Federate WithLogicalAction sends a message with tag 1 sec to the RTI
  • ...
  • Federate PassThrough starts processing the message with tag 1 sec with a properly sent NET(1 sec) and received TAG(1 sec)
  • Federate WithLogicalAction sends a message with tag 1001 msec to the RTI
  • The RTI forwards the message to PassThrough
    • The recorded next event for PassThrough is 1 sec, so the RTI will not replace that with 1001 msec.
  • While the message is in-transit, PassThrough finishes processing the message with tag 1 sec and sends a NET of FOREVER (because it has no other local events and the message is still in-transit).
  • Race condition: At this point, if TestCount asks for a TAG of FOREVER, it will be granted because the RTI incorrectly thinks that PassThrough has a next event of FOREVER (even though a message with tag 1001 msec is in transit).

Fix: ?

Any suggestion on how to fix the second race condition?

@edwardalee
Copy link
Collaborator

It seems that maybe the RTI needs to keep a list of tags of messages it has forwarded to a federate for which it has not yet received an LTC. When it receives an LTC, it removes items from this list with tags <= the LTC. When it receives a NET, the RTI checks and sets its local record of the NET to the minimum of the NET it received and the tags in the list.

This is unfortunate because, although this list is likely to be small in practice, there is no upper bound on its size, so the implementation inevitably will require a malloc. But it probably could be optimized to minimize the likelihood of a malloc.

This whole protocol is screaming to be formally verified...

@Soroosh129
Copy link
Contributor Author

It seems that maybe the RTI needs to keep a list of tags of messages it has forwarded to a federate for which it has not yet received an LTC. When it receives an LTC, it removes items from this list with tags <= the LTC. When it receives a NET, the RTI checks and sets its local record of the NET to the minimum of the NET it received and the tags in the list.

This was my initial instinct as well. I implemented a prototype using a vector but I quickly realized that this list of tags needs to be sorted to be efficient. I think our existing pqueue sounds like the right structure for this.

This whole protocol is screaming to be formally verified...

I agree. Tagging @lsk567 to see if he would be interested in continuing our unfinished work toward formally verifying the federated execution.

@Soroosh129
Copy link
Contributor Author

Here is an implementation for the queue of in-transit messages: lf-lang/reactor-c@98dd2c2

@Soroosh129 Soroosh129 changed the title Hotfix: TAN message are not working as intended in federated execution Remove TAN messages Jun 3, 2022
@Soroosh129 Soroosh129 marked this pull request as ready for review June 3, 2022 22:33
Copy link
Collaborator

@edwardalee edwardalee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Soroosh129 Soroosh129 changed the title Remove TAN messages Remove TAN messages and record in-transit messages in the RTI Jun 4, 2022
@Soroosh129 Soroosh129 merged commit f890aec into master Jun 11, 2022
@Soroosh129 Soroosh129 deleted the hotfix-C-TAN branch June 11, 2022 05:28
@lhstrh lhstrh added the bug Something isn't working label Jul 7, 2022
@lhstrh lhstrh changed the title Remove TAN messages and record in-transit messages in the RTI TAN messages no longer used and in-transit messages recorded in the RTI Jul 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants