Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add redundancy to the traffic schedule node: Add monitor node, synchronisation for traffic schedule node data, and fail-over functionality #61

Merged
merged 24 commits into from
Jul 14, 2021

Conversation

gbiggs
Copy link
Collaborator

@gbiggs gbiggs commented May 28, 2021

New feature implementation

Implemented feature

This PR:

  • adds a monitor node that monitors the state of the traffic schedule node,
  • adds data synchronisation from the traffic schedule node to the monitor node,
  • adds the launching of a replacement traffic schedule node, kick-started with the synchronised data, when the monitor node detects the death of the traffic schedule node,
  • ensures that the newly-created replacement traffic schedule node starts in a running state as close as possible to the state the dead node was in,
  • ensures that the newly-created replacement traffic schedule node is still under the control of ros2 launch, and
  • adds functionality to mirrors making them robust to lost queries and receiving incorrect updates.

Resolves open-rmf/rmf#57.

Requires open-rmf/rmf_internal_msgs!16. Required by open-rmf/rmf_demos!49.

Implementation description

The traffic schedule node is modified to add the following functions.

  • Broadcasting of a heartbeat signal.
  • Broadcasting of registered queries, along with their IDs and subscriber counts. (This information could be usefully visualised in rviz, if anyone wants to take up that challenge.)
  • Allow construction of a schedule node from an existing database and set of registered queries, meaning the schedule node will start from a running state as if it had already been a part of a system.

Additionally, some refactoring of the code that manages query topics has been done.

A monitor node has been added. This node listens to the data broadcasts from the schedule node. It has a database mirror contained in it to receive a constantly-updated copy of the database, and it also receives and stores the registered query data from the schedule node.

The monitor node listens to the heartbeat from the schedule node. When it receives a notification that the heartbeat has not arrived, indicating that the schedule node has probably died, it will do the following to start a replacement traffic schedule node.

  1. Create a new schedule node.
  2. Initialise the new schedule with the data received from the old schedule node (set of registered queries and a new database forked from the monitor's mirror).
  3. Broadcast a fail-over event announcement to all entities interested in when the schedule node has died and a replacement started.
  4. Stop itself and start spinning the replacement schedule node.

Entities that use the schedule node have had functionality added to listen to the fail-over event notification from the monitor node. When they receive a notification, they will reconnect to the services provided by the schedule node. If this is not done then these user nodes will not be able to register/unregister participants and queries.

How to test

Redundancy functionality

  1. Using open-rmf/rmf_demos!49, launch the office demonstration.
  2. Using the console or the web interface, start one or more tasks running.
    ros2 run rmf_demos_tasks dispatch_loop -s coe -f lounge -n 3 --use_sim_time
  3. Once the tasks are running, kill the primary traffic schedule node.
    $ ps -ax | grep rmf_traffic_schedule
    $ kill [pid of the node named rmf_traffic_schedule_primary]
    
  4. Observe in the console output that the death of the primary node is detected and handled by the monitor.
  5. Observe in rviz that the traffic schedule continues to be updated, particularly at the end of loops when a new route is posted.

Mirror recovery from lost query

  1. Build with testing on to compile the mock schedule nodes.
  2. Edit the file common.launch.xml in the rmf_demos repository to replace the launched executable `` with missing_query_schedule_node.
  3. Launch the office demo.
  4. Observe the console output to see the mirrors discovering that their query is not known by the schedule node and re-registering it.

Mirror recovery from wrong query

  1. Build with testing on to compile the mock schedule nodes.
  2. Edit the file common.launch.xml in the rmf_demos repository to replace the launched executable rmf_traffic_schedule_node with wrong_query_schedule_node.
  3. Launch the office demo.
  4. Observe the console output to see the mirrors discovering that they are receiving updates for a different query than the one they registered, and re-registering their queries.

@gbiggs gbiggs requested a review from mxgrey May 28, 2021 08:21
@gbiggs gbiggs self-assigned this May 28, 2021
@gbiggs gbiggs added the enhancement New feature or request label May 28, 2021
@gbiggs gbiggs added this to In Review in Research & Development via automation May 28, 2021
Copy link
Contributor

@mxgrey mxgrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking like a really nice robust fail over approach.

I have various comments about how the implementation could be improved, but most of it can wait for a future PR. The one thing I think we should really consider addressing before merging would be the concern I brought up here. It probably won't be a big deal in our ordinary uses cases (since we're almost always using query_all anyhow), but I don't want it to take us by surprise if anyone starts using fancy queries and their performance gets killed.

@@ -282,11 +282,11 @@ class MirrorManager::Implementation
register_query_request.query = convert(query);
register_query_client->async_send_request(
std::make_shared<RegisterQuery::Request>(register_query_request),
[&](const RegisterQueryFuture response)
[this](const RegisterQueryFuture response)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this is still effectively capture by reference in the sense that this is a raw pointer, making it equivalent to an unsafe reference. The original version of the code is functionally equivalent, because this pointer is always captured by value, and this-> is always implied when using a member variable/function within a lambda.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the correct way to safely capture here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two options for safe capturing:

  1. Every member field that needs to be used in the callback should be stored as a std::shared_ptr<T> and each relevant std::shared_ptr<T> should be copied into the capture list.
  2. The whole Implementation class should have a nested Shared class that contains all the fields, and you should copy a std::weak_ptr<Shared> into the capture list. You can find an example of this in rmf_traffic::schedule::Participant::Implementation::Shared.

gbiggs added 20 commits June 30, 2021 10:20
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
…on robust to timeouts

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
@gbiggs gbiggs force-pushed the geoff/redundant-schedule-node branch from 7c00cdf to fd0e309 Compare June 30, 2021 01:23
@gbiggs
Copy link
Collaborator Author

gbiggs commented Jun 30, 2021

I've rebased to fix the merge conflicts.

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
@gbiggs
Copy link
Collaborator Author

gbiggs commented Jul 5, 2021

I've pushed a change that removes the need for a separate call to the setup() member function when constructing the ScheduleNode. Now there is a set of three constructors that fall through to each other. Calling the lowest one directly will require a separate call to setup(), while the other two will call it automatically. There is also a path using a fourth, tagged constructor to force the setup() method not to be called even when using the simplest constructor. This is required to satisfy the needs of the "wrong query" test case.

Signed-off-by: Michael X. Grey <grey@openrobotics.org>
Signed-off-by: Michael X. Grey <grey@openrobotics.org>
Copy link
Contributor

@mxgrey mxgrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this using open-rmf/rmf_demos#49 and it's working great! When I kill the original schedule node, the monitor node appears to take over seamlessly.

Unfortunately CI won't be able to pass until this is merged: open-rmf/rmf_internal_msgs#16

We have to choose between:

  1. Use admin privileges to force-merge this PR simultaneously with the rmf_internal_msgs PR.
  2. Merge the rmf_internal_msgs PR and rerun the CI of this PR and wait for it to turn green.

The problem with (2) is that the rmf_internal_msgs changes break API+ABI, which means anyone who happens to clone the repos while we wait for CI will get a failed build.

It would be nice if the GitHub Actions allowed us to expose variables that we could use to set the upstream branches for specific runs, but I don't see that ability anywhere.

I'll go with option (1) and just be diligent about merging this PR as soon as the build passes.

@codecov
Copy link

codecov bot commented Jul 14, 2021

Codecov Report

Merging #61 (5499f58) into main (fcb3123) will decrease coverage by 0.42%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
- Coverage   22.46%   22.03%   -0.43%     
==========================================
  Files         200      410     +210     
  Lines       16128    32872   +16744     
  Branches     7899    16058    +8159     
==========================================
+ Hits         3623     7244    +3621     
- Misses       8592    17792    +9200     
- Partials     3913     7836    +3923     
Flag Coverage Δ
tests 22.03% <ø> (-0.43%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
..._adapter/rmf_rxcpp/include/rmf_rxcpp/Transport.hpp
...cs/rmf_ros2/rmf_fleet_adapter_python/src/tests.cpp
...adapter/src/rmf_fleet_adapter/phases/DockRobot.cpp
...rmf_traffic_ros2/src/rmf_traffic_blockade/main.cpp
.../rmf_task_ros2/src/rmf_task_ros2/action/Server.cpp
...ffic_ros2/src/rmf_traffic_ros2/blockade/Writer.cpp
...rxcpp/RxCpp-4.1.0/Rx/v2/src/rxcpp/rx-operators.hpp
...s2/src/rmf_traffic_ros2/schedule/convert_Patch.cpp
...xcpp/RxCpp-4.1.0/Rx/v2/src/rxcpp/rx-observable.hpp
...2/rmf_traffic_ros2/src/update_participant/main.cpp
... and 600 more

@mxgrey mxgrey merged commit 41ede21 into main Jul 14, 2021
@mxgrey mxgrey deleted the geoff/redundant-schedule-node branch July 14, 2021 07:11
Research & Development automation moved this from In Review to Done Jul 14, 2021
@cwrx777
Copy link
Contributor

cwrx777 commented Aug 17, 2021

Hi,

When the original schedule node (primary) comes back online, does the monitor node become the backup node again, performing monitoring role?

@mxgrey
Copy link
Contributor

mxgrey commented Sep 24, 2021

Sorry I just noticed this question was never answered.

In the current out-of-the-box implementation, the monitor node will simply become the new schedule node.

That being said, we've tried to design the library so that many different strategies could be developed with the monitor node. It was made to be a reusable class so you could conceivably write an application whose callback forks the process into:

  1. spinning the newly created schedule node
  2. starting up a new monitor node that's ready to fail over again

That being said, we didn't design it to work this way automatically because we expect the most likely cause of a fail-over would be loss of a network connection, which means it wouldn't be very helpful for the monitor node to exist on the same server as the active schedule node. Instead we anticipate strategies where the monitor node will always be running on a different server than the schedule node.

There are also open questions about if/how multiple monitor nodes should run on multiple different servers, and if so how do they decide which one will become the active schedule node? We have some ideas floating around, but we haven't converged on one right way to handle that.

arjo129 pushed a commit that referenced this pull request Oct 12, 2021
…onisation for traffic schedule node data, and fail-over functionality (#61)

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>
Signed-off-by: Michael X. Grey <grey@openrobotics.org>
Co-authored-by: Michael X. Grey <grey@openrobotics.org>
Signed-off-by: Arjo Chakravarty <arjo@openrobotics.org>
luca-della-vedova pushed a commit that referenced this pull request Jan 10, 2023
Signed-off-by: Michael X. Grey <grey@openrobotics.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

Successfully merging this pull request may close these issues.

Add redundancy to the traffic schedule node
3 participants