Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuse -> ROS 2 fuse_core : Nodes and Waitables #284

Merged
merged 17 commits into from Nov 12, 2022

Conversation

methylDragon
Copy link
Collaborator

@methylDragon methylDragon commented Nov 7, 2022

See: #276
This depends on: #283

  • When merging this, @methylDragon should check that it doesn't lead to Time regressions...

Description

This PR ports the ROS 1 nodes to ROS 2 node pointers, and also uses Brett's solution for CallbackWrappers and Waitables to support async behavior.

Whether it works or not remains to be seen, we'll likely need to circle back if we see deadlocks happening.

Some node handles have been replaced with node pointers or node interfaces. A future PR will introduce the concept of a NodeInterfaceHandle and change the signatures to support it.

PR Layout

Because I was cherry-picking Brett's commits here, I'd recommend reviewing the PR by looking at the merged changes instead of per commit.

Additional Notes

All subordinate plugins (e.g. sensors, motion models, etc.) now start an internal node using the default global context. We need to first make sure that configuration works first before we try to improve it, so it's a place that's likely to be circled back to.. Hopefully...

Pinging @svwilliams for visibility.

methylDragon and others added 7 commits November 7, 2022 13:57
Signed-off-by: methylDragon <methylDragon@gmail.com>
Authored-by: Brett Downing
Signed-off-by: methylDragon <methylDragon@gmail.com>
Authored-by: Brett Downing
Signed-off-by: methylDragon <methylDragon@gmail.com>
Authored-by: Brett Downing
Signed-off-by: methylDragon <methylDragon@gmail.com>
Authored-by: Brett Downing
Signed-off-by: methylDragon <methylDragon@gmail.com>
Authored-by: Brett Downing
Signed-off-by: methylDragon <methylDragon@gmail.com>
Authored-by: Brett Downing
@methylDragon methylDragon changed the title fuse -> ROS 2 : Nodes and Waitables fuse -> ROS 2 fuse_core : Nodes and Waitables Nov 7, 2022
Signed-off-by: methylDragon <methylDragon@gmail.com>
@methylDragon
Copy link
Collaborator Author

methylDragon commented Nov 7, 2022

fuse_core builds if #283 is merged and this is rebased (with conflicts resolved)
I'm going to mark this PR as ready.

@methylDragon methylDragon marked this pull request as ready for review November 7, 2022 23:47
@sloretz
Copy link
Collaborator

sloretz commented Nov 9, 2022

fuse_core builds if #283 is merged and this is rebased (with conflicts resolved)

🎉 That's great! That means it's a great time to start re-enabling tests. I opened a PR against the ros-params branch to enable the gtest ones in methylDragon#1 . Once #283 merges, mind cherry-picking that commit to this branch?

Is enough of fuse_core ported with this PR to enable all of the tests? (the rest of the tests will need to be migrated to launch_pytest).

@methylDragon
Copy link
Collaborator Author

fuse_core builds if #283 is merged and this is rebased (with conflicts resolved)

tada That's great! That means it's a great time to start re-enabling tests. I opened a PR against the ros-params branch to enable the gtest ones in methylDragon#1 . Once #283 merges, mind cherry-picking that commit to this branch?

Is enough of fuse_core ported with this PR to enable all of the tests? (the rest of the tests will need to be migrated to launch_pytest).

This PR should migrate everything except ceres options (and its interactions with ROS 2 parameters)
I was gonna get the params first, but that might take awhile, so enabling tests first and fixing them probably makes more sense

fuse_core/CMakeLists.txt Show resolved Hide resolved
fuse_core/include/fuse_core/async_motion_model.h Outdated Show resolved Hide resolved
ros::NodeHandle private_node_handle_; //!< A node handle in the private namespace using the local callback queue
ros::AsyncSpinner spinner_; //!< A single/multi-threaded spinner assigned to the local callback queue

rclcpp::Node::SharedPtr node_; //!< The node for this motion model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the node is unused except for adding the waitable to the executor.

Maybe we could pass the node or node interfaces via MotionModel::Initialize()? That eliminates the extra context and node, and it looks doable higher in the stack where MotionModel::initialize() is called by an Optimizer which does have a Node.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node here is mostly used to structure parameters in the config files
It's reasonable to run the graph with dozens of named motion_models, and each one needs a unique name and parameters, each motion model can be associated with a different physical component that may or may not be included in any given deployment.

The separated node also allows complex motion_models (usually associated with a single robot chassis) to be targeted by ros cli tools as a separate entity in the ROS graph

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BrettRD would you say that it's preferable then to keep the nodes separated? This was one of the places I planned to circle back on to try to combine it all into a single node (with the iffy bit being what happens when all the callback queues are merged), but the point you raised about CLI tools is pretty valid..

I'm not sure of the implications on lifecycle nodes and node components though.

Copy link
Collaborator

@BrettRD BrettRD Nov 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what's better. (edit: it's a pretty huge API divergence to change the shape of Fuse on the node graph)

Have a think about sensor_models and fuse_publishers too, they will need to interact with sensor nodes and planning tools that will definitely appear under ros namespaces, so it would be nice for a plugin to exist in the node graph under the relevant namespace

One node wouldn't be a disaster.
You can still isolate device-specific configs into separate conf-only packages and load multiple config files into the optimiser node at launch.
However, if it's one node, it's going to make a blob in the node graph so dense it'll become a running joke. edit: (I have this exact problem exposing gstreamer properties on a pipeline node, it's not unmanageable)

I was expecting to run separate nodes and have to do some fancy footwork to allow fuse_optimiser to side-load its plugins into its host composable node container, and then have a lot of trouble getting access to the parameters.
Either that, or have fuse_optimiser inherit from container, and cause a headache when any other package tries to do the same
I haven't studied lifecycle nodes enough to comment there

I don't think the callback queues will be a problem; executors are built to cope with multiple queues, and most of the fuse plugins are sharing the one optimiser queue

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node here is mostly used to structure parameters in the config files
It's reasonable to run the graph with dozens of named motion_models, and each one needs a unique name and parameters, each motion model can be associated with a different physical component that may or may not be included in any given deployment.

I think we can still structure parameters in the config file by prepending name_ + '.param_name' to each parameter name.

The separated node also allows complex motion_models (usually associated with a single robot chassis) to be targeted by ros cli tools as a separate entity in the ROS graph

I can think of a few reasons to avoid multiple nodes, but I'm not familiar enough with fuse to judge how useful it would be to a node per async motion model/publisher/sensor model. @svwilliams what are your thoughts?

Reasons to avoid multiple nodes:

  • Nodes come with extra overhead like a graph guard condition, a rosout publisher, a parameter events publisher, and services for setting and getting parameters.
  • Atomically changing parameters is scoped to a single node. A single node would allow someone to change parameters to two motion models atomically, such that if one failed (maybe from an invalid value) none of the changes would go through
  • Building the nodes internally prevents choosing whether they use regular or lifecycle nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where my lack of experience with ROS2 is going to hinder me offering sound advice.

I can discuss the design intent in ROS1 and see if that helps.

  • The motion models and sensors models act very node-like. They subscribe to sensor topics, process the data, and "publish" the results for consumption by the optimizer.
  • The reason the sensor models and motion models are not stand-alone nodes or nodelets in ROS1 is the each model "publishes" a derived class object instead of a message. Each model creates derived Constraints and Variables for use by the Optimizer. New user-derived models may create Constraints and Variables the Optimizer node cannot know about at compile-time. I could not think of an elegant solution to transmit derived objects using the standard pub-sub mechanisms in ROS1.
  • Using ROS1 plugins, however, I can define an interface to the Optimizer that sends derived object pointers around, so that was what I implemented.
  • But there is a downside to the plugin system. The Optimizer class creates the sensor and motion model instances as class variables. The Optimizer can call methods of the loaded plugins, so it would be easy to poll each loaded sensor and motion model for new data. But the sensor models and motion models are producing data for the Optimizer. In an ideal world, the models would decide for themselves when they had new data and would push that into the Optimizer. It is difficult for the plugins to call methods of the Optimizer.
  • To solve the polling/pushing issues, there are a series callbacks registered between the Optimizer and the Models. This results in rather complex threading implications.
  • To hide all all of that complexity, the AsyncModels each run their own thread(s) and callback spinner. This makes them act a lot like a conventional ROS node or nodelet. Each derived Model can define the number of threads used to service its callbacks, it gets a private node handle in its namespace for parameter reading, etc.

Does any of that help?

fuse_core/include/fuse_core/callback_wrapper.h Outdated Show resolved Hide resolved


private:
std::recursive_mutex reentrant_mutex_; //!< mutex to allow this callback to be added to multiple callback groups simultaneously
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "multiple callback groups simultaneously" mean adding the same Waitable instance to multiple executors? That sounds undesirable. If two executors are trying to execute the same thing at the same time then one is going to be blocked, when it could have been executing something else.

Can this be removed?

Copy link
Collaborator

@BrettRD BrettRD Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment is mistaken, (oops)
reentrant_mutex is called to protect rcl_wait_set_add_guard_condition, which inserts a pointer into a list of pointers and is not thread-safe. This allows multiple threads to insert callbacks into one queue without trashing RCL's internal data structures

queue_mutex is the one you're worried about, it allows multiple executors to pull jobs from the queue.
If someone managed to put the wrapper into multiple executors without the mutex, it would be a nightmare to debug.

The lock is released before the callback is run. An executor will only be blocked by the time it takes for another executor to pull a pointer from a queue.

{
std::lock_guard<std::recursive_mutex> lock(queue_mutex_);
if(!callback_queue_.empty()){
cb_wrapper = callback_queue_.front();
callback_queue_.pop_front();
}
}
//the lock is released and the callback is no longer associated with the queue, run it.
if(cb_wrapper) {
cb_wrapper->call();
}

Executors won't try to execute the same code twice, and they won't wait for a callback to finish either.

edit: grammar

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment fb52543

fuse_core/src/callback_wrapper.cpp Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
Signed-off-by: methylDragon <methylDragon@gmail.com>
Signed-off-by: methylDragon <methylDragon@gmail.com>
…lling-waitables

Signed-off-by: methylDragon <methylDragon@gmail.com>
Co-Authored-by: Brett Downing
Copy link
Collaborator

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-progress review. Github is now showing all the Time changes in this PR. I'll try an old trick of changing the base and back to see if that fixes it.

fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
@sloretz sloretz changed the base branch from rolling to devel November 11, 2022 00:40
@sloretz sloretz changed the base branch from devel to rolling November 11, 2022 00:40
sloretz and others added 2 commits November 10, 2022 16:40
Signed-off-by: Shane Loretz <sloretz@osrfoundation.org>
Signed-off-by: methylDragon <methylDragon@gmail.com>
fuse_core/src/async_motion_model.cpp Show resolved Hide resolved
);
auto result = callback->getFuture();
callback_queue_->addCallback(callback);
result.wait();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the intent of the separate queue and executor is to not wait for the result here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it everywhere where it made sense to d3b6c94

There is only one wait() remaining, in bool AsyncMotionModel::apply(Transaction& transaction)

I commented them out and added a TODO though, just in case we need to wait due to race conditions/etc. when everything compiles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The futures and waits are all about thread management.

The AsyncMotionModel is running its own callback spinner thread. So if a derived implementation created a Timer or subscribed to a sensor topic, those callbacks will run in the AsyncMotionModel spinner thread. However, the apply() method here is going to be called by the Optimizer, and hence it will run in the Optimizer's thread. This means any derived motion model must be very careful about thread synchronization. And it is not very obvious what callbacks will run in what threads. To make it easier to write derived motion model classes, I implemented the following:

The apply() method will be called by the Optimizer thread

bool AsyncMotionModel::apply(Transaction& transaction)
{

Wrap the user-defined implementation, applyCallback(), inside a promise.

  auto callback = boost::make_shared<CallbackWrapper<bool>>(
    std::bind(&AsyncMotionModel::applyCallback, this, std::ref(transaction)));

Create a future from the wrapper so that the apply() function may receive the results of the applyCallback() function.

auto result = callback->getFuture();

Insert the applyCallback() call into the callback queue of the AsyncMotionModel. This means that the applyCallback() method will be executed by the AsyncMotionModel spinner thread.

callback_queue_.addCallback(callback, reinterpret_cast<uint64_t>(this));

Wait for the AsyncMotionModel spinner thread to complete the applyCallback() call.

result.wait();

Return the results of the applyCallback() method out of the apply() method.

return result.get();

This was done so that the user-implementation of the apply() business logic, implemented in the applyCallback() method, will execute synchronously with any other callbacks (timers, subscriptions, services, etc) defined in the motion model. Typically the author of a derived AsyncMotionModel does not need to implement any mutexes or other thread synchronization mechanisms even though the execution of a motion model involves multiple threads created in different parts of the fuse ecosystem.

fuse_core/src/async_publisher.cpp Show resolved Hide resolved
fuse_core/src/async_publisher.cpp Outdated Show resolved Hide resolved
fuse_core/src/async_sensor_model.cpp Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
Signed-off-by: methylDragon <methylDragon@gmail.com>
Signed-off-by: methylDragon <methylDragon@gmail.com>
fuse_core/include/fuse_core/callback_wrapper.h Outdated Show resolved Hide resolved
fuse_core/src/async_motion_model.cpp Outdated Show resolved Hide resolved
fuse_core/src/async_motion_model.cpp Outdated Show resolved Hide resolved
fuse_core/src/async_publisher.cpp Outdated Show resolved Hide resolved
fuse_core/src/async_publisher.cpp Outdated Show resolved Hide resolved
fuse_core/src/async_sensor_model.cpp Outdated Show resolved Hide resolved
fuse_core/src/async_sensor_model.cpp Outdated Show resolved Hide resolved
fuse_core/src/callback_wrapper.cpp Outdated Show resolved Hide resolved
Signed-off-by: methylDragon <methylDragon@gmail.com>
Copy link
Collaborator

@sloretz sloretz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM (noting TODO to decide on multiple nodes vs single node).

Builds locally and all tests pass except for the linters.

@methylDragon methylDragon merged commit 9f9395e into locusrobotics:rolling Nov 12, 2022
@methylDragon methylDragon deleted the rolling-waitables branch November 12, 2022 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants