Add redundancy to the traffic schedule node: Add monitor node, synchronisation for traffic schedule node data, and fail-over functionality #61

gbiggs · 2021-05-28T08:21:15Z

New feature implementation

Implemented feature

This PR:

adds a monitor node that monitors the state of the traffic schedule node,
adds data synchronisation from the traffic schedule node to the monitor node,
adds the launching of a replacement traffic schedule node, kick-started with the synchronised data, when the monitor node detects the death of the traffic schedule node,
ensures that the newly-created replacement traffic schedule node starts in a running state as close as possible to the state the dead node was in,
ensures that the newly-created replacement traffic schedule node is still under the control of ros2 launch, and
adds functionality to mirrors making them robust to lost queries and receiving incorrect updates.

Resolves open-rmf/rmf#57.

Requires open-rmf/rmf_internal_msgs!16. Required by open-rmf/rmf_demos!49.

Implementation description

The traffic schedule node is modified to add the following functions.

Broadcasting of a heartbeat signal.
Broadcasting of registered queries, along with their IDs and subscriber counts. (This information could be usefully visualised in rviz, if anyone wants to take up that challenge.)
Allow construction of a schedule node from an existing database and set of registered queries, meaning the schedule node will start from a running state as if it had already been a part of a system.

Additionally, some refactoring of the code that manages query topics has been done.

A monitor node has been added. This node listens to the data broadcasts from the schedule node. It has a database mirror contained in it to receive a constantly-updated copy of the database, and it also receives and stores the registered query data from the schedule node.

The monitor node listens to the heartbeat from the schedule node. When it receives a notification that the heartbeat has not arrived, indicating that the schedule node has probably died, it will do the following to start a replacement traffic schedule node.

Create a new schedule node.
Initialise the new schedule with the data received from the old schedule node (set of registered queries and a new database forked from the monitor's mirror).
Broadcast a fail-over event announcement to all entities interested in when the schedule node has died and a replacement started.
Stop itself and start spinning the replacement schedule node.

Entities that use the schedule node have had functionality added to listen to the fail-over event notification from the monitor node. When they receive a notification, they will reconnect to the services provided by the schedule node. If this is not done then these user nodes will not be able to register/unregister participants and queries.

How to test

Redundancy functionality

Using open-rmf/rmf_demos!49, launch the office demonstration.
Using the console or the web interface, start one or more tasks running.
ros2 run rmf_demos_tasks dispatch_loop -s coe -f lounge -n 3 --use_sim_time

Once the tasks are running, kill the primary traffic schedule node.

$ ps -ax | grep rmf_traffic_schedule
$ kill [pid of the node named rmf_traffic_schedule_primary]

Observe in the console output that the death of the primary node is detected and handled by the monitor.
Observe in rviz that the traffic schedule continues to be updated, particularly at the end of loops when a new route is posted.

Mirror recovery from lost query

Build with testing on to compile the mock schedule nodes.
Edit the file common.launch.xml in the rmf_demos repository to replace the launched executable `` with missing_query_schedule_node.
Launch the office demo.
Observe the console output to see the mirrors discovering that their query is not known by the schedule node and re-registering it.

Mirror recovery from wrong query

Build with testing on to compile the mock schedule nodes.
Edit the file common.launch.xml in the rmf_demos repository to replace the launched executable rmf_traffic_schedule_node with wrong_query_schedule_node.
Launch the office demo.
Observe the console output to see the mirrors discovering that they are receiving updates for a different query than the one they registered, and re-registering their queries.

mxgrey

This is looking like a really nice robust fail over approach.

I have various comments about how the implementation could be improved, but most of it can wait for a future PR. The one thing I think we should really consider addressing before merging would be the concern I brought up here. It probably won't be a big deal in our ordinary uses cases (since we're almost always using query_all anyhow), but I don't want it to take us by surprise if anyone starts using fancy queries and their performance gets killed.

rmf_traffic_ros2/include/rmf_traffic_ros2/StandardNames.hpp

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MirrorManager.cpp

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/Node.cpp

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/internal_Node.hpp

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/Node.cpp

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MonitorNode.hpp

mxgrey · 2021-06-29T07:59:29Z

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MirrorManager.cpp

@@ -282,11 +282,11 @@ class MirrorManager::Implementation
      register_query_request.query = convert(query);
      register_query_client->async_send_request(
        std::make_shared<RegisterQuery::Request>(register_query_request),
-        [&](const RegisterQueryFuture response)
+        [this](const RegisterQueryFuture response)


Just to be clear, this is still effectively capture by reference in the sense that this is a raw pointer, making it equivalent to an unsafe reference. The original version of the code is functionally equivalent, because this pointer is always captured by value, and this-> is always implied when using a member variable/function within a lambda.

What would be the correct way to safely capture here?

There are two options for safe capturing:

Every member field that needs to be used in the callback should be stored as a std::shared_ptr<T> and each relevant std::shared_ptr<T> should be copied into the capture list.

The whole Implementation class should have a nested Shared class that contains all the fields, and you should copy a std::weak_ptr<Shared> into the capture list. You can find an example of this in rmf_traffic::schedule::Participant::Implementation::Shared.

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

…on robust to timeouts Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

gbiggs · 2021-06-30T01:25:59Z

I've rebased to fix the merge conflicts.

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

gbiggs · 2021-07-05T07:18:21Z

I've pushed a change that removes the need for a separate call to the setup() member function when constructing the ScheduleNode. Now there is a set of three constructors that fall through to each other. Calling the lowest one directly will require a separate call to setup(), while the other two will call it automatically. There is also a path using a fourth, tagged constructor to force the setup() method not to be called even when using the simplest constructor. This is required to satisfy the needs of the "wrong query" test case.

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MirrorManager.cpp

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

mxgrey

I've tested this using open-rmf/rmf_demos#49 and it's working great! When I kill the original schedule node, the monitor node appears to take over seamlessly.

Unfortunately CI won't be able to pass until this is merged: open-rmf/rmf_internal_msgs#16

We have to choose between:

Use admin privileges to force-merge this PR simultaneously with the rmf_internal_msgs PR.
Merge the rmf_internal_msgs PR and rerun the CI of this PR and wait for it to turn green.

The problem with (2) is that the rmf_internal_msgs changes break API+ABI, which means anyone who happens to clone the repos while we wait for CI will get a failed build.

It would be nice if the GitHub Actions allowed us to expose variables that we could use to set the upstream branches for specific runs, but I don't see that ability anywhere.

I'll go with option (1) and just be diligent about merging this PR as soon as the build passes.

codecov · 2021-07-14T05:47:05Z

Codecov Report

Merging #61 (5499f58) into main (fcb3123) will decrease coverage by 0.42%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main      #61      +/-   ##
==========================================
- Coverage   22.46%   22.03%   -0.43%     
==========================================
  Files         200      410     +210     
  Lines       16128    32872   +16744     
  Branches     7899    16058    +8159     
==========================================
+ Hits         3623     7244    +3621     
- Misses       8592    17792    +9200     
- Partials     3913     7836    +3923

Flag	Coverage Δ
tests	`22.03% <ø> (-0.43%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
..._adapter/rmf_rxcpp/include/rmf_rxcpp/Transport.hpp
...cs/rmf_ros2/rmf_fleet_adapter_python/src/tests.cpp
...adapter/src/rmf_fleet_adapter/phases/DockRobot.cpp
...rmf_traffic_ros2/src/rmf_traffic_blockade/main.cpp
.../rmf_task_ros2/src/rmf_task_ros2/action/Server.cpp
...ffic_ros2/src/rmf_traffic_ros2/blockade/Writer.cpp
...rxcpp/RxCpp-4.1.0/Rx/v2/src/rxcpp/rx-operators.hpp
...s2/src/rmf_traffic_ros2/schedule/convert_Patch.cpp
...xcpp/RxCpp-4.1.0/Rx/v2/src/rxcpp/rx-observable.hpp
...2/rmf_traffic_ros2/src/update_participant/main.cpp
... and 600 more

cwrx777 · 2021-08-17T02:11:21Z

Hi,

When the original schedule node (primary) comes back online, does the monitor node become the backup node again, performing monitoring role?

mxgrey · 2021-09-24T06:28:04Z

Sorry I just noticed this question was never answered.

In the current out-of-the-box implementation, the monitor node will simply become the new schedule node.

That being said, we've tried to design the library so that many different strategies could be developed with the monitor node. It was made to be a reusable class so you could conceivably write an application whose callback forks the process into:

spinning the newly created schedule node
starting up a new monitor node that's ready to fail over again

That being said, we didn't design it to work this way automatically because we expect the most likely cause of a fail-over would be loss of a network connection, which means it wouldn't be very helpful for the monitor node to exist on the same server as the active schedule node. Instead we anticipate strategies where the monitor node will always be running on a different server than the schedule node.

There are also open questions about if/how multiple monitor nodes should run on multiple different servers, and if so how do they decide which one will become the active schedule node? We have some ideas floating around, but we haven't converged on one right way to handle that.

…onisation for traffic schedule node data, and fail-over functionality (#61) Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net> Signed-off-by: Michael X. Grey <grey@openrobotics.org> Co-authored-by: Michael X. Grey <grey@openrobotics.org> Signed-off-by: Arjo Chakravarty <arjo@openrobotics.org>

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

gbiggs mentioned this pull request May 28, 2021

Add redundancy to the traffic schedule node: Add necessary messages open-rmf/rmf_internal_msgs#16

Merged

gbiggs requested a review from mxgrey May 28, 2021 08:21

gbiggs self-assigned this May 28, 2021

gbiggs added the enhancement New feature or request label May 28, 2021

gbiggs added this to In Review in Research & Development via automation May 28, 2021

mxgrey reviewed Jun 25, 2021

View reviewed changes

This was referenced Jun 28, 2021

Clean up lamda captures to avoid using [&] #71

Open

Switch MirrorManager unregistering to use a heartbeat #72

Closed

mxgrey reviewed Jun 28, 2021

View reviewed changes

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MonitorNode.hpp Outdated Show resolved Hide resolved

mxgrey reviewed Jun 29, 2021

View reviewed changes

gbiggs added 20 commits June 30, 2021 10:20

Add monitor node and synchronisation for query data

bbfd1b1

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Working on a mock node

ec7bfa0

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Finished handling both sides of the corner case

af50385

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Make redundancy-related topics public

1237fde

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Remove variable name

755ea5d

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Clarify behaviour with a comment

6dcc209

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Add comment clarifying user responsibility

fb35ca7

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Remove explicit destructor

d0efce6

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Remove questioning comment

de94d13

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Remove destructor declaration

583b1dd

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Make startup timeout configurable

a51e9fa

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Use std::optional

f497fc2

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Remove old header

738e0a8

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Remove old header

7cf69a6

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Remove old comment

3de57f7

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Send the oldest reqeusted update

c2eed8e

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Renamed internal header

836ac88

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Renamed internal header

7b30940

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Fix bug with restoring query subscriber counts and make re-registrati…

10da957

…on robust to timeouts Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Remove capture-by-reference

fd0e309

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

gbiggs force-pushed the geoff/redundant-schedule-node branch from 7c00cdf to fd0e309 Compare June 30, 2021 01:23

gbiggs added 2 commits July 5, 2021 15:32

Refactor constructors to remove required separate setup call

3842f76

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Add tagged constructor for no automatic setup

cd909c4

Signed-off-by: Geoffrey Biggs <gbiggs@killbots.net>

Fixing merge conflicts

e6adb82

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

mxgrey reviewed Jul 13, 2021

View reviewed changes

rmf_traffic_ros2/src/rmf_traffic_ros2/schedule/MirrorManager.cpp Outdated Show resolved Hide resolved

Satisfy uncrustify

5499f58

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

mxgrey approved these changes Jul 13, 2021

View reviewed changes

mxgrey merged commit 41ede21 into main Jul 14, 2021

mxgrey deleted the geoff/redundant-schedule-node branch July 14, 2021 07:11

Research & Development automation moved this from In Review to Done Jul 14, 2021

luca-della-vedova pushed a commit that referenced this pull request Jan 10, 2023

Allow GoToPlace to know about expected future destinations (#61)

63fce74

Signed-off-by: Michael X. Grey <grey@openrobotics.org>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add redundancy to the traffic schedule node: Add monitor node, synchronisation for traffic schedule node data, and fail-over functionality #61

Add redundancy to the traffic schedule node: Add monitor node, synchronisation for traffic schedule node data, and fail-over functionality #61

gbiggs commented May 28, 2021 •

edited

Loading

mxgrey left a comment

mxgrey Jun 29, 2021

gbiggs Jun 30, 2021

mxgrey Jul 13, 2021

gbiggs commented Jun 30, 2021

gbiggs commented Jul 5, 2021

mxgrey left a comment

codecov bot commented Jul 14, 2021 •

edited

Loading

cwrx777 commented Aug 17, 2021

mxgrey commented Sep 24, 2021

Add redundancy to the traffic schedule node: Add monitor node, synchronisation for traffic schedule node data, and fail-over functionality #61

Add redundancy to the traffic schedule node: Add monitor node, synchronisation for traffic schedule node data, and fail-over functionality #61

Conversation

gbiggs commented May 28, 2021 • edited Loading

New feature implementation

Implemented feature

Implementation description

How to test

Redundancy functionality

Mirror recovery from lost query

Mirror recovery from wrong query

mxgrey left a comment

Choose a reason for hiding this comment

mxgrey Jun 29, 2021

Choose a reason for hiding this comment

gbiggs Jun 30, 2021

Choose a reason for hiding this comment

mxgrey Jul 13, 2021

Choose a reason for hiding this comment

gbiggs commented Jun 30, 2021

gbiggs commented Jul 5, 2021

mxgrey left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 14, 2021 • edited Loading

Codecov Report

cwrx777 commented Aug 17, 2021

mxgrey commented Sep 24, 2021

gbiggs commented May 28, 2021 •

edited

Loading

codecov bot commented Jul 14, 2021 •

edited

Loading