-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add redundancy to the traffic schedule node: Add monitor node, synchronisation for traffic schedule node data, and fail-over functionality #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking like a really nice robust fail over approach.
I have various comments about how the implementation could be improved, but most of it can wait for a future PR. The one thing I think we should really consider addressing before merging would be the concern I brought up here. It probably won't be a big deal in our ordinary uses cases (since we're almost always using query_all
anyhow), but I don't want it to take us by surprise if anyone starts using fancy queries and their performance gets killed.
rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MirrorManager.cpp
Outdated
Show resolved
Hide resolved
rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MirrorManager.cpp
Outdated
Show resolved
Hide resolved
rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MirrorManager.cpp
Outdated
Show resolved
Hide resolved
@@ -282,11 +282,11 @@ class MirrorManager::Implementation | |||
register_query_request.query = convert(query); | |||
register_query_client->async_send_request( | |||
std::make_shared<RegisterQuery::Request>(register_query_request), | |||
[&](const RegisterQueryFuture response) | |||
[this](const RegisterQueryFuture response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, this is still effectively capture by reference in the sense that this
is a raw pointer, making it equivalent to an unsafe reference. The original version of the code is functionally equivalent, because this
pointer is always captured by value, and this->
is always implied when using a member variable/function within a lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the correct way to safely capture here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two options for safe capturing:
- Every member field that needs to be used in the callback should be stored as a
std::shared_ptr<T>
and each relevantstd::shared_ptr<T>
should be copied into the capture list. - The whole
Implementation
class should have a nestedShared
class that contains all the fields, and you should copy astd::weak_ptr<Shared>
into the capture list. You can find an example of this inrmf_traffic::schedule::Participant::Implementation::Shared
.
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
…on robust to timeouts Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
7c00cdf
to
fd0e309
Compare
I've rebased to fix the merge conflicts. |
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
I've pushed a change that removes the need for a separate call to the |
Signed-off-by: Michael X. Grey <grey@openrobotics.org>
rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MirrorManager.cpp
Outdated
Show resolved
Hide resolved
Signed-off-by: Michael X. Grey <grey@openrobotics.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this using open-rmf/rmf_demos#49 and it's working great! When I kill the original schedule node, the monitor node appears to take over seamlessly.
Unfortunately CI won't be able to pass until this is merged: open-rmf/rmf_internal_msgs#16
We have to choose between:
- Use admin privileges to force-merge this PR simultaneously with the
rmf_internal_msgs
PR. - Merge the
rmf_internal_msgs
PR and rerun the CI of this PR and wait for it to turn green.
The problem with (2) is that the rmf_internal_msgs
changes break API+ABI, which means anyone who happens to clone the repos while we wait for CI will get a failed build.
It would be nice if the GitHub Actions allowed us to expose variables that we could use to set the upstream branches for specific runs, but I don't see that ability anywhere.
I'll go with option (1) and just be diligent about merging this PR as soon as the build passes.
Hi, When the original schedule node (primary) comes back online, does the monitor node become the backup node again, performing monitoring role? |
Sorry I just noticed this question was never answered. In the current out-of-the-box implementation, the monitor node will simply become the new schedule node. That being said, we've tried to design the library so that many different strategies could be developed with the monitor node. It was made to be a reusable class so you could conceivably write an application whose callback forks the process into:
That being said, we didn't design it to work this way automatically because we expect the most likely cause of a fail-over would be loss of a network connection, which means it wouldn't be very helpful for the monitor node to exist on the same server as the active schedule node. Instead we anticipate strategies where the monitor node will always be running on a different server than the schedule node. There are also open questions about if/how multiple monitor nodes should run on multiple different servers, and if so how do they decide which one will become the active schedule node? We have some ideas floating around, but we haven't converged on one right way to handle that. |
…onisation for traffic schedule node data, and fail-over functionality (#61) Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net> Signed-off-by: Michael X. Grey <grey@openrobotics.org> Co-authored-by: Michael X. Grey <grey@openrobotics.org> Signed-off-by: Arjo Chakravarty <arjo@openrobotics.org>
Signed-off-by: Michael X. Grey <grey@openrobotics.org>
New feature implementation
Implemented feature
This PR:
Resolves open-rmf/rmf#57.
Requires open-rmf/rmf_internal_msgs!16. Required by open-rmf/rmf_demos!49.
Implementation description
The traffic schedule node is modified to add the following functions.
Additionally, some refactoring of the code that manages query topics has been done.
A monitor node has been added. This node listens to the data broadcasts from the schedule node. It has a database mirror contained in it to receive a constantly-updated copy of the database, and it also receives and stores the registered query data from the schedule node.
The monitor node listens to the heartbeat from the schedule node. When it receives a notification that the heartbeat has not arrived, indicating that the schedule node has probably died, it will do the following to start a replacement traffic schedule node.
Entities that use the schedule node have had functionality added to listen to the fail-over event notification from the monitor node. When they receive a notification, they will reconnect to the services provided by the schedule node. If this is not done then these user nodes will not be able to register/unregister participants and queries.
How to test
Redundancy functionality
ros2 run rmf_demos_tasks dispatch_loop -s coe -f lounge -n 3 --use_sim_time
Mirror recovery from lost query
common.launch.xml
in thermf_demos
repository to replace the launched executable `` withmissing_query_schedule_node
.Mirror recovery from wrong query
common.launch.xml
in thermf_demos
repository to replace the launched executablermf_traffic_schedule_node
withwrong_query_schedule_node
.