Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various bugfixes and cleanups in the support for federated programs #323

Merged
merged 86 commits into from
Jan 22, 2024

Conversation

edwardalee
Copy link
Contributor

@edwardalee edwardalee commented Dec 21, 2023

This replaces PR #319 and #317 with a cleaner commit history. There is a companion PR in lingua-franca.

This PR removes absent messages except where they are absolutely needed, namely in zero-delay cycles.

This PR also improves the port search done by the RTI if it cannot bind to the specified or default port. This previously did not work and resulted in federates taking a long time to fail as they searched over possible ports.

This PR also fixes two issues that can cause the RTI to deadlock during shutdown, particularly if a task is kill with a SIGINT during execution. Both of these issues had been previously found by @erlingrj, but fixed only the federate side, not on the RTI side.

The first is that if a write to a socket fails because of a broken pipe, a SIGPIPE signal is issued, and, by default, the process is terminated. The reason for this is that a common use of sockets is to pipe one process's output to another, e.g. foo | bar, and if the second process crashes, you want the first process to exit. However, in the RTI (and the federates), there is a mutex lock that is held when writing to an outgoing socket (to prevent multiple threads from writing simultaneously to the socket). So, what can happen is that the process acquires the mutex and calls write(), but then write() fails, and before it can be return, a registered termination function is called. That termination function tries to acquire the same mutex in order to close the sockets. The result is a deadlock.

The second problem is similar, but only occurs if tracing is turned on. When a process exits the termination function tries to acquire a lock to flush tracing output, but that lock may be held by a thread that has exited anomalously while in a critical section.

It is possible that either of these problems could cause CI tests to hang rather than time out. When they time out, the CI tools try to kill the process with a SIGINT, and hence could trigger one of the above deadlocks, thereby failing to kill the process. The CI job will then run for the full six hours before being cancelled.

This PR also simplifies what is done when an abnormal termination occurs, e.g. SIGINT. In particular, it avoids acquiring any mutex locks and does not bother freeing memory (the memory will be freed by the OS anyway). The reason for freeing memory on normal termination is so that valgrind or similar tools can be used to check for memory leaks, but checking for memory leaks in a program that gets terminated with Control-C is not reasonable.

This PR also fixes some problems that could occur when safe-to-process violations occur and, in case problem resurface in the future, causes the program to error out rather than to deadlock.

Prevent exiting via SIGPIPE on socket write failure and instead handle the error.
Distinguish normal termination from interrupted termination and avoid mutexes in the latter and avoid lf_print in the former.
Remove unnecessary absent messages using EIMT and EIMT_strict.
Eliminate the bogus port search algorithm (which never worked) and just use DEFAULT_PORT or an override given on the command line or an `at` clause.
Also, calloc instead of malloc federate info so that pointers are reliably NULL.
Also, added lf_print_error_system_failure utility to print system call error information.
Distinguish normal termination from interrupted termination and avoid mutexes in the latter and avoid lf_print in the former.
Remove unnecessary absent messages using EIMT and EIMT_strict.
Eliminate the bogus port search algorithm (which never worked) and just use DEFAULT_PORT or an override given on the command line or an `at` clause.
Replace last_time field on a trigger with last_tag.
Also, implement and use lf_print_error_system_failure to report system call errors.
@edwardalee edwardalee added enhancement Enhancement of existing feature bugfix federated labels Dec 21, 2023
@lhstrh lhstrh changed the title Federated cleanup Various bugfixes and clean ups in the support for federated programs Dec 21, 2023
Copy link
Member

@lhstrh lhstrh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for the clean up...

This was referenced Dec 21, 2023
@lhstrh lhstrh changed the title Various bugfixes and clean ups in the support for federated programs Various bugfixes and cleanups in the support for federated programs Dec 21, 2023
@petervdonovan
Copy link
Contributor

Is clock synchronization broken in this branch? It looks like here it is assumed that the return value of write_to_socket is the number of bytes written, which seems to be no longer true due to refactoring in net_util.c.

@edwardalee
Copy link
Contributor Author

Is clock synchronization broken in this branch? It looks like here it is assumed that the return value of write_to_socket is the number of bytes written, which seems to be no longer true due to refactoring in net_util.c.

My bad. I've pushed a fix. Thanks for spotting this. Not sure how I missed it.

Copy link
Contributor

@petervdonovan petervdonovan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review federate.c, but so far I agree that this PR is bringing many improvements.

core/federated/RTI/main.c Outdated Show resolved Hide resolved
core/federated/RTI/main.c Show resolved Hide resolved
core/federated/RTI/rti_common.c Outdated Show resolved Hide resolved
core/federated/RTI/rti_common.c Outdated Show resolved Hide resolved
core/federated/RTI/rti_common.c Outdated Show resolved Hide resolved
core/trace.c Outdated Show resolved Hide resolved
core/utils/util.c Show resolved Hide resolved
core/reactor_common.c Show resolved Hide resolved
core/reactor_common.c Show resolved Hide resolved
core/reactor_common.c Show resolved Hide resolved
Copy link
Contributor

@petervdonovan petervdonovan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, finally signing off on the review of this PR. Thanks for waiting, I wanted to make sure I understood what changed.

Overall the changes look like a clear improvement.

core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Outdated Show resolved Hide resolved
core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Show resolved Hide resolved
core/federated/federate.c Outdated Show resolved Hide resolved
@edwardalee edwardalee merged commit 36d3249 into main Jan 22, 2024
28 checks passed
@edwardalee edwardalee deleted the federated-cleanup branch January 22, 2024 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix enhancement Enhancement of existing feature federated
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants