Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/rosbag #47

Merged
merged 35 commits into from
Feb 13, 2023
Merged

Feature/rosbag #47

merged 35 commits into from
Feb 13, 2023

Conversation

mvukov
Copy link
Owner

@mvukov mvukov commented Jan 21, 2023

Adds support for rosbag2.

@mvukov
Copy link
Owner Author

mvukov commented Jan 21, 2023

So, //ros2/test/rosbag:tests works, meaning to me that rosbag2 stuff works. However, the cli app segfaults. Example:

  • Run in the examples folder: bazel run //chatter
  • And in the repo root: bazel run //ros2:ros2_bag -- record /topic. It looks like the app starts up but on CTRL+C segfaults:
bazel run //ros2:ros2_bag -- record /topic
INFO: Analyzed target //ros2:ros2_bag (49 packages loaded, 11500 targets configured).
INFO: Found 1 target...
Target //ros2:ros2_bag up-to-date:
  bazel-bin/ros2/ros2_bag.py
  bazel-bin/ros2/ros2_bag
INFO: Elapsed time: 8.299s, Critical Path: 7.25s
INFO: 10 processes: 1 internal, 9 linux-sandbox.
INFO: Build completed successfully, 10 total actions
INFO: Running command line: bazel-bin/ros2/ros2_bag record /topic
[ERROR] [1674339231.732949368] [rosbag2_cpp]: Unable to create class loader instance: package 'rosbag2_cpp' not found, searching: [ros2/ros2_bag_ament_setup]  # Ignore this for the moment
[INFO] [1674339231.739594184] [rosbag2_recorder]: Press SPACE for pausing/resuming
[INFO] [1674339231.741430007] [rosbag2_storage]: Opened database 'rosbag2_2023_01_21-23_13_51/rosbag2_2023_01_21-23_13_51_0.db3' for READ_WRITE.
[INFO] [1674339231.741541053] [rosbag2_recorder]: Event publisher thread: Starting
[INFO] [1674339231.741749758] [rosbag2_recorder]: Listening for topics...
[INFO] [1674339231.742830969] [rosbag2_recorder]: Subscribed to topic '/topic'
[INFO] [1674339231.743211045] [rosbag2_recorder]: All requested topics are subscribed. Stopping discovery...
[INFO] [1674339234.384837410] [rclcpp]: signal_handler(signum=2)
Segmentation fault (core dumped)

The problem seems to be in the cleanup of subscriptions: https://github.com/ros2/rosbag2/blob/0.15.3/rosbag2_transport/src/rosbag2_transport/recorder.cpp#L115.

@ahans I am digging into this, but another pair of eyes on this would help.

Except this issue, I have to clean up some Bazel-related macros before this lands in main. + fix some tests.

ros2/BUILD.bazel Outdated
@@ -35,3 +38,13 @@ py_binary(
"@ros2cli//:ros2param",
],
)

ros2_bag(
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a tmp example app.

@ahans
Copy link
Contributor

ahans commented Jan 23, 2023

So, //ros2/test/rosbag:tests works, meaning to me that rosbag2 stuff works. However, the cli app segfaults.

Thanks for this! However, I can't repro the segfault. Tried on Ubuntu 20.04 as well as 22.04 hosts. Have you pushed some commits after writing this?

What I did notice:

  1. I got a compilation error complaining about a missing sqlite3.h include. First I thought you hadn't given that as a Bazel dep and just used the system one, but turns out your sqlite.BUILD.bazel is just missing a line includes = ["."],. I think it'd be a good idea to also put copts = ["-w"], to silence the warnings it generates (or only `-Wno-misleading-indentation to be more selective).
  2. Upon startup there's this error message:
    [ERROR] [1674461886.354162316] [rosbag2_cpp]: Unable to create class loader instance: package 'rosbag2_cpp' not     found, searching: [ros2/ros2_bag_ament_setup]
    
    A brief look at your changes indicates that this is expected. Would definitely be nice to fix that for a final solution, though.
  3. When creating a rosbag with ros2 bag record, there's a file metadata.yaml next to the .db3 file. The recording made via //ros2:ros2_bag doesn't have that. So it looks like despite the apparent absence of the segfault for me, something's not quite right yet when shutting down.

@ahans
Copy link
Contributor

ahans commented Jan 23, 2023

Ok, after digging a bit, turns out what I report under 3. above seems to be related to what you see as explicit segfault. When I hit Ctrl-C, it doesn't shut down cleanly. This is the reason why metadata.yaml isn't written. If there are messages not written to disk yet, they would be lost as well. So this definitely needs to be fixed.

@ahans
Copy link
Contributor

ahans commented Jan 23, 2023

I can confirm it seems to be related to cleanup of subscriptions. More specifically, when I comment out the call to rcl_subscription_fini here, it terminates cleanly and metadata.yaml gets written. I followed that rcl_subscription_fini() call a bit further, it leads to rmw_destroy_subscription() and from there to destroy_subscription(), where this call to dds_delete is the one that segfaults. That is already part of CycloneDDS. I suspect something's wrong before and this is only the first visible symptom. I tried to run it under asan, but the fact that everything is a Python extension and the Python binary the main driver makes this hard. Asan barks already during what seems to be its setup. Not sure if that's real or a false alarm.

I will keep digging, but maybe this info already helps to give you some idea? You know better what's different here compared to a standard ROS 2 build. Maybe it has to do with shared vs static libraries? What are you building statically here that would be a shared library in the standard ROS setup?

@ahans
Copy link
Contributor

ahans commented Jan 23, 2023

So the issues with running asan also occur with stock ROS2, so that's unrelated.

@mvukov
Copy link
Owner Author

mvukov commented Jan 23, 2023

Many thanks for digging into this. Regarding 1: hm, this is fishy. It probably means that compiler pulls sqlite from /usr folder somehow -- both on my local machine and on the CI; will double-check this. Regarding 2: yeah, this will be fixed before the merge: I'll convert those error statements into debug statements.

I am leaning towards Python bindings doing something fishy given that the unit test runs just fine. An alternative is that my unit test isn't representative enough and in that case the problem lies somewhere else. So far, my understanding was that there is no static init involved with the recorder node, but I have to double-check this. And I'll go again over compilation flags for those Python bindings. I'll keep you posted.

@ahans
Copy link
Contributor

ahans commented Jan 23, 2023

Yeah, it picks the sqlite3.h from /usr/include. Since you use the host compiler, this is expected behavior. With a custom toolchain it would break. The GitHub build image has a lot of tools and libs pre-installed, sqlite3 is apparently one of them. Just put includes = ["."] and it's fine.

Regarding the more "interesting issue": I was able to run this under asan now, this comment was gold! However, the only things it finds are unmatched delete/new (where it would always free less than it allocated before) and leaks. When disabling those two, only the segfault remains. I checked the obvious things like pointers being valid and the dds_entity_t (or something like that), which is what you get when you create a subscription and pass when you want to delete it, and those match (the handle is just an integer).

Will also do some more digging, but this seems to be quite a tough one ... ;-)

@mvukov
Copy link
Owner Author

mvukov commented Jan 29, 2023

Remove the MessageCounter node and also recorder.cc segfaults. That node kept the shared library around. It probably uses some other subscription mechanism that doesn't have that ts_lib bug.

I tried that locally, but I couldn't reproduce the segfault with the test.

Now, what is different between the way stock ros2 bag record works and your setup? Does the default ament setup do something that makes the libraries stick around?

ament_setup just lays out file structure such that ament_index_cpp can find necessary files. Nothing fancy.

Root cause for the segfault is the fact that stock ROS 2 builds two typesupport libraries, e.g., for std_msgs those would be
libstd_msgs__rosidl_typesupport_cpp.so and libstd_msgs__rosidl_typesupport_introspection_cpp.so, but rules_ros2 puts the functionality of both into the _typesupport_cpp.so one. As outlined above, *_typesupport_cpp.so lives as a std::shared_ptrrcpputils::SharedLibrary in GenericSubscription, but *_typesupport_introspection_cpp.so is loaded here. There's no matching delete for that new, so that one stays loaded for the lifetime of the process.

When libstd_msgs__rosidl_typesupport_cpp.so needs to be loaded, ament_index_cpp is called to get a path to the library. If it's not present on the ament_prefix_path -> an exception is thrown. So, if e.g. libstd_msgs__rosidl_typesupport_introspection_cpp.so would be needed, I'd expect I should know about that by now by seeing an exception coming from ament_index_cpp code. BTW, I put some debug fprintf statements in that get_typesupport_handle_function from your link, but it looks like it's not called -- still have to dig a bit deeper. At the moment I have the feeling ros doesn't dynamically load typessupport introspection libs, only typesupport libs.

One more thing: type_support depends on typesupport_introspection, at least auto-generated code tells that.

I committed your patch in d9ca94c: please let me know if you want me to decorate it with extra text.

Many thanks for your help! :)

The PR still needs some cleanup before it can be merged, besides possibly digging deeper into dynamic loading stuff.

@mvukov
Copy link
Owner Author

mvukov commented Jan 29, 2023

IMO, the rclcpp patch is OKish at the moment.

I also tried splitting interface generation stuff, but it turned out not to be so trivial -- I know what I need to do but would take more time given the current code struct that I would need to split. Plus, given that typesupport depends on introspection code, I am not sure yet this is the path to go. It would help if we can definitely rule out whether introspection libs are dyn loaded, or not.

@mvukov
Copy link
Owner Author

mvukov commented Jan 29, 2023

@ahans WDYT?

@mvukov
Copy link
Owner Author

mvukov commented Jan 29, 2023

In fact, from what I can see, get_typesupport_handle_function doesn't get called at all when I run bazel run //ros2:ros2_bag -- record /topic when I inject

fprintf(stderr, "get_typesupport_handle_function: %s", identifier);
::exit(1);

at the top of the function.

@ahans
Copy link
Contributor

ahans commented Jan 29, 2023

I'm fairly certain that splitting the libraries is the proper solution here. The introspection one is loaded in stock ROS from SubscriptionBase, and it is loaded in a way that it is only unloaded when the proceas terminates. The non-introspection one is unloaded just the same way as we do here in rules_ros2, but in stock ROS that's fine, because the introspection one remains loaded. I can give you the exact stack traces tomorrow.

Why we don't even try to load it with yours, I don't know. My suspicion still is that it has to do with symbols lookup and in the case of the all-in-one Liv the requested symbols are already there. I tried to follow the path as well in the debugger, but at some point we call into C code that gdb cannot properly resolve. Interestingly, though, for the standard ROS build this is not an issue at all.

@ahans
Copy link
Contributor

ahans commented Jan 30, 2023

Remove the MessageCounter node and also recorder.cc segfaults. That node kept the shared library around. It probably uses some other subscription mechanism that doesn't have that ts_lib bug.

I tried that locally, but I couldn't reproduce the segfault with the test.

When testing before, I had the talker run in parallel and manually executed the recorder binary. But with this diff I can also make the recorder die with -11. By only using GenericSubscription, we trigger the issue with unloading the shared library too early.

diff --git a/ros2/test/rosbag/recorder.cc b/ros2/test/rosbag/recorder.cc
index da8f46c..1971966 100644
--- a/ros2/test/rosbag/recorder.cc
+++ b/ros2/test/rosbag/recorder.cc
@@ -16,22 +16,23 @@
 #include "rosbag2_storage/storage_options.hpp"
 #include "rosbag2_transport/reader_writer_factory.hpp"
 #include "rosbag2_transport/record_options.hpp"
-#include "std_msgs/msg/string.hpp"
 
 constexpr auto kTopicName = "topic";
 
 class MessageCounter : public rclcpp::Node {
  public:
   MessageCounter() : Node("message_counter") {
-    subscription_ = create_subscription<std_msgs::msg::String>(
-        kTopicName, 10,
-        [this](std_msgs::msg::String::SharedPtr /*msg*/) { ++msg_count_; });
+    subscription_ = create_generic_subscription(
+        kTopicName, "std_msgs/msg/String", rclcpp::SensorDataQoS{},
+        [this](std::shared_ptr<rclcpp::SerializedMessage> message) {
+          ++msg_count_;
+        });
   }
 
   auto msg_count() const { return msg_count_; }
 
  private:
-  rclcpp::Subscription<std_msgs::msg::String>::SharedPtr subscription_;
+  std::shared_ptr<rclcpp::GenericSubscription> subscription_;
   int msg_count_ = 0;
 };
 

@ahans
Copy link
Contributor

ahans commented Jan 30, 2023

IMO, the rclcpp patch is OKish at the moment.

It's a workaround, but not the proper solution. Having a cc_library that has the two .so files as outputs (or two cc_library targets with one per .so) would be the way forward IMHO: A rosbag rule would depend on those in its data attribute, while for the cc_binary that a ros2_cpp_binary produces it could end up on either data or deps, but in either case be loaded dynamically. Putting it in deps would have the advantage that we can probably get away without having to rely on ament index, because Bazel would put it into the NEEDED section and set RPATH properly.

In any case, a standard ROS build loads type support dynamically from shared objects, no matter if it's a C++ binary for a user node or a rosbag recorder. I think we should try to stay as close as possible to the standard ROS structure.

Going with my patch for rclcpp for now I can live with. But there should be a TODO saying that this is tech debt that needs to be fixed eventually.

I also tried splitting interface generation stuff, but it turned out not to be so trivial -- I know what I need to do but would take more time given the current code struct that I would need to split. Plus, given that typesupport depends on introspection code, I am not sure yet this is the path to go.

Since this is what the standard ROS build does, it is the way to go IMHO.

It would help if we can definitely rule out whether introspection libs are dyn loaded, or not.

Introspection libs are definitely loaded dynamically. I was able to confirm my previous observations using a standard ROS build.

Based on test_generic_pubsub.cpp I implemented a test like this:

TEST_F(RclcppGenericNodeFixture, generic_pubsub_cleanup)
{
  using namespace std::chrono_literals;
  std::string topic_name = "/topic";
  std::string topic_type = "std_msgs/msg/String";
  {
    auto subscription = node_->create_generic_subscription(
        topic_name, topic_type, rclcpp::SensorDataQoS{},
        [](std::shared_ptr<rclcpp::SerializedMessage>/* message */) {std::cerr << "message" << std::endl;});
    for (auto i = 0; i < 10; ++i) {
      rclcpp::spin_some(node_);
      std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }
  }
  ASSERT_TRUE(true);
}

The only really relevant part is that we call create_generic_subscription() and eventually destroy the subscription object.

Then we load the non-instrospection library from create_generic_subscription. Full stack trace is this:


#0  rcpputils::SharedLibrary::SharedLibrary (this=0x5555556eaf40, library_path=...) at /home/ahans/src/ros2_rolling/src/ros2/rcpputils/src/shared_library.cpp:28
#1  0x00007ffff7dfb59b in __gnu_cxx::new_allocator<rcpputils::SharedLibrary>::construct<rcpputils::SharedLibrary, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x7fffffff822f, __p=0x5555556eaf40) at /usr/include/c++/11/ext/new_allocator.h:162
#2  0x00007ffff7dfb42c in std::allocator_traits<std::allocator<rcpputils::SharedLibrary> >::construct<rcpputils::SharedLibrary, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (__a=..., __p=0x5555556eaf40) at /usr/include/c++/11/bits/alloc_traits.h:516
#3  0x00007ffff7dfb214 in std::_Sp_counted_ptr_inplace<rcpputils::SharedLibrary, std::allocator<rcpputils::SharedLibrary>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x5555556eaf30, __a=...) at /usr/include/c++/11/bits/shared_ptr_base.h:519
#4  0x00007ffff7dfae58 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<rcpputils::SharedLibrary, std::allocator<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x7fffffff8498, __p=@0x7fffffff8490: 0x0, __a=...) at /usr/include/c++/11/bits/shared_ptr_base.h:650
#5  0x00007ffff7dfabdc in std::__shared_ptr<rcpputils::SharedLibrary, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x7fffffff8490, __tag=...) at /usr/include/c++/11/bits/shared_ptr_base.h:1342
#6  0x00007ffff7dfa9a5 in std::shared_ptr<rcpputils::SharedLibrary>::shared_ptr<std::allocator<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (this=0x7fffffff8490, __tag=...) at /usr/include/c++/11/bits/shared_ptr.h:409
#7  0x00007ffff7dfa7dd in std::allocate_shared<rcpputils::SharedLibrary, std::allocator<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (__a=...) at /usr/include/c++/11/bits/shared_ptr.h:863
#8  0x00007ffff7dfa61a in std::make_shared<rcpputils::SharedLibrary, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> () at /usr/include/c++/11/bits/shared_ptr.h:879
#9  0x00007ffff7df9e81 in rclcpp::get_typesupport_library (type=..., typesupport_identifier=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/typesupport_helpers.cpp:101
#10 0x00005555555b92d9 in rclcpp::create_generic_subscription<std::allocator<void> >(std::shared_ptr<rclcpp::node_interfaces::NodeTopicsInterface>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (topics_interface=..., topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/create_generic_subscription.hpp:60
#11 0x00005555555b7e8b in rclcpp::Node::create_generic_subscription<std::allocator<void> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x55555568d300, topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/node_impl.hpp:238
#12 0x00005555555b3549 in RclcppGenericNodeFixture_generic_pubsub_cleanup_Test::TestBody (this=0x55555568d1a0) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/test/rclcpp/test_generic_pubsub.cpp:278
[testing framework functions omitted]

Two lines later in that function, we construct GenericSubscription, which triggers the load of the type support introspection library:

#0  rcpputils::SharedLibrary::SharedLibrary (this=0x555555710570, library_path=...) at /home/ahans/src/ros2_rolling/src/ros2/rcpputils/src/shared_library.cpp:26
#1  0x00007ffff6b34634 in rosidl_typesupport_cpp::get_typesupport_handle_function<rosidl_message_type_support_t> (handle=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, identifier=0x7ffff718a000 "rosidl_typesupport_introspection_cpp") at /home/ahans/src/ros2_rolling/src/ros2/rosidl_typesupport/rosidl_typesupport_cpp/src/type_support_dispatch.hpp:74
#2  0x00007ffff6b34400 in rosidl_typesupport_cpp::get_message_typesupport_handle_function (handle=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, identifier=0x7ffff718a000 "rosidl_typesupport_introspection_cpp") at /home/ahans/src/ros2_rolling/src/ros2/rosidl_typesupport/rosidl_typesupport_cpp/src/message_type_support_dispatch.cpp:27
#3  0x00007ffff6b3c6d1 in get_message_typesupport_handle (handle=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, identifier=0x7ffff718a000 "rosidl_typesupport_introspection_cpp") at /home/ahans/src/ros2_rolling/src/ros2/rosidl/rosidl_runtime_c/src/message_type_support.c:28
#4  0x00007ffff696e660 in get_typesupport (type_supports=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>) at /home/ahans/src/ros2_rolling/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:1955
#5  0x00007ffff697291c in create_cdds_subscription (dds_ppant=662357872, dds_sub=115765071, type_supports=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, topic_name=0x55555570b330 "/topic", qos_policies=0x7fffffff4590, ignore_local_publications=false) at /home/ahans/src/ros2_rolling/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2811
#6  0x00007ffff697303b in create_subscription (dds_ppant=662357872, dds_sub=115765071, type_supports=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, topic_name=0x55555570b330 "/topic", qos_policies=0x7fffffff4590, subscription_options=0x7fffffff7d70) at /home/ahans/src/ros2_rolling/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2895
#7  0x00007ffff69738d0 in rmw_create_subscription (node=0x5555556ae730, type_supports=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, topic_name=0x55555570b330 "/topic", qos_policies=0x7fffffff7cf0, subscription_options=0x7fffffff7d70) at /home/ahans/src/ros2_rolling/src/ros2/rmw_cyclonedds/rmw_cyclonedds_cpp/src/rmw_node.cpp:2983
#8  0x00007ffff6cc51c8 in rmw_create_subscription (v5=0x5555556ae730, v4=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, v3=0x55555570b330 "/topic", v2=0x7fffffff7cf0, v1=0x7fffffff7d70) at /home/ahans/src/ros2_rolling/src/ros2/rmw_implementation/rmw_implementation/src/functions.cpp:393
#9  0x00007ffff71da541 in rcl_subscription_init (subscription=0x55555570b2f0, node=0x555555690dc0, type_support=0x7ffff0b573c0 <std_msgs::msg::rosidl_typesupport_cpp::String_message_type_support_handle>, topic_name=0x7fffffff8550 "/topic", options=0x7fffffff7cf0) at /home/ahans/src/ros2_rolling/src/ros2/rcl/rcl/src/rcl/subscription.c:101
#10 0x00007ffff7d62469 in rclcpp::SubscriptionBase::SubscriptionBase (this=0x555555701050, node_base=0x55555568d770, type_support_handle=..., topic_name=..., subscription_options=..., is_serialized=true) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/src/rclcpp/subscription_base.cpp:68
#11 0x00005555555be34f in rclcpp::GenericSubscription::GenericSubscription<std::allocator<void> >(rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x555555701050, node_base=0x55555568d770, ts_lib=..., topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/generic_subscription.hpp:87
#12 0x00005555555bdfe4 in __gnu_cxx::new_allocator<rclcpp::GenericSubscription>::construct<rclcpp::GenericSubscription, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(rclcpp::GenericSubscription*, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x7fffffff80ff, __p=0x555555701050) at /usr/include/c++/11/ext/new_allocator.h:162
#13 0x00005555555bdcb7 in std::allocator_traits<std::allocator<rclcpp::GenericSubscription> >::construct<rclcpp::GenericSubscription, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::allocator<rclcpp::GenericSubscription>&, rclcpp::GenericSubscription*, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (__a=..., __p=0x555555701050) at /usr/include/c++/11/bits/alloc_traits.h:516
#14 0x00005555555bd911 in std::_Sp_counted_ptr_inplace<rclcpp::GenericSubscription, std::allocator<rclcpp::GenericSubscription>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x555555701040, __a=...) at /usr/include/c++/11/bits/shared_ptr_base.h:519
#15 0x00005555555bcf6d in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<rclcpp::GenericSubscription, std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(rclcpp::GenericSubscription*&, std::_Sp_alloc_shared_tag<std::allocator<rclcpp::GenericSubscription> >, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x7fffffff85e8, __p=@0x7fffffff85e0: 0x0, __a=...) at /usr/include/c++/11/bits/shared_ptr_base.h:650
#16 0x00005555555bc968 in std::__shared_ptr<rclcpp::GenericSubscription, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::_Sp_alloc_shared_tag<std::allocator<rclcpp::GenericSubscription> >, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x7fffffff85e0, __tag=...) at /usr/include/c++/11/bits/shared_ptr_base.h:1342
#17 0x00005555555bbf49 in std::shared_ptr<rclcpp::GenericSubscription>::shared_ptr<std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::_Sp_alloc_shared_tag<std::allocator<rclcpp::GenericSubscription> >, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x7fffffff85e0, __tag=...) at /usr/include/c++/11/bits/shared_ptr.h:409
#18 0x00005555555bb2bc in std::allocate_shared<rclcpp::GenericSubscription, std::allocator<rclcpp::GenericSubscription>, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(std::allocator<rclcpp::GenericSubscription> const&, rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (__a=...) at /usr/include/c++/11/bits/shared_ptr.h:863
#19 0x00005555555ba34a in std::make_shared<rclcpp::GenericSubscription, rclcpp::node_interfaces::NodeBaseInterface*, std::shared_ptr<rcpputils::SharedLibrary>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&>(rclcpp::node_interfaces::NodeBaseInterface*&&, std::shared_ptr<rcpputils::SharedLibrary>&&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>&, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) () at /usr/include/c++/11/bits/shared_ptr.h:879
#20 0x00005555555b935c in rclcpp::create_generic_subscription<std::allocator<void> >(std::shared_ptr<rclcpp::node_interfaces::NodeTopicsInterface>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (topics_interface=..., topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/create_generic_subscription.hpp:63
#21 0x00005555555b7e8b in rclcpp::Node::create_generic_subscription<std::allocator<void> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rclcpp::QoS const&, std::function<void (std::shared_ptr<rclcpp::SerializedMessage>)>, rclcpp::SubscriptionOptionsWithAllocator<std::allocator<void> > const&) (this=0x55555568d300, topic_name=..., topic_type=..., qos=..., callback=..., options=...) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/include/rclcpp/node_impl.hpp:238
#22 0x00005555555b3549 in RclcppGenericNodeFixture_generic_pubsub_cleanup_Test::TestBody (this=0x55555568d1a0) at /home/ahans/src/ros2_rolling/src/ros2/rclcpp/rclcpp/test/rclcpp/test_generic_pubsub.cpp:278
[...]

Frame #1 is where we do the new SharedLibrary that has no matching delete.

When I delete/rename libstd_msgs__rosidl_typesupport_introspection_cpp.so, the test fails.

When I replace the original libstd_msgs__rosidl_typesupport_cpp.so with the one built via rules_ros2, I get the exact same behavior as in rules_ros2: libstd_msgs__rosidl_typesupport_introspection_cpp.so is not loaded and upon shutdown there's a segfault.

Interestingly, in the ROS build I can follow the entire execution flow in the debugger. It basically confirmed that if we use the merged library, we don't even try to load the introspection one.

@ahans
Copy link
Contributor

ahans commented Jan 30, 2023

So the decision whether to load the library or if it is considered to be already loaded happens here:

  • When the library is not loaded, func points at rosidl_typesupport_cpp::get_typesupport_handle_function, where we eventually load the introspection library. In the last stacktrace above the call to func happens in frame 3.
  • When we have the library loaded, func points at the get_message_typesupport_handle_function() that is defined further down in the same file here. For the _cpp version that will then return handle. When having loaded the "merged" library, also also stay in the same file and never call the get_message_typesupport_handle_function that loads the introspection library.

I haven't figured out where those function pointers are set, but overall it makes sense to me.

@mvukov
Copy link
Owner Author

mvukov commented Feb 5, 2023

Thanks for looking into this this deep, I really appreciate this.

It's a workaround, but not the proper solution. Having a cc_library that has the two .so files as outputs (or two cc_library targets with one per .so) would be the way forward IMHO: A rosbag rule would depend on those in its data attribute, while for the cc_binary that a ros2_cpp_binary produces it could end up on either data or deps, but in either case be loaded dynamically. Putting it in deps would have the advantage that we can probably get away without having to rely on ament index, because Bazel would put it into the NEEDED section and set RPATH properly.

In any case, a standard ROS build loads type support dynamically from shared objects, no matter if it's a C++ binary for a user node or a rosbag recorder. I think we should try to stay as close as possible to the standard ROS structure.

In deps you put stuff that you want linked with a linker with your libs/binary. Runtime .so plugins typically go to the data section (from what I know so far).

And this brings us to the point that: if you want to link ros nodes/libs against IDL libs it's fine to use deps. If you want to use IDL as plugins (generic subscriptions, rosbag), well, then ament index kicks in and generation of shared libs for all those plugins.

After quite some thinking, I think that we want here that e.g. cpp_ros2_interface_library behaves more as a cc_library, such that we can generate .so libs (that can be used as plugins) on demand that are dynamically linked to it's dependencies. For static linking things are solved (for now), but for dynamic linking there is more work to be done. Let's say we have std_msgs cpp library. Besides the fact we want it's introspection lib to be a shared lib and that corresponding typesupport shared lib dynamically links against it, we want that it's dependency builtin_interfaces is also linked as a shared lib. I think this should be possible to do in Starlark, but... it's an unexplored territory for me :) Once this works, a separate rule+aspect can be used to generate an ament-compatible file tree -- this is just a hunch at the moment TBH.

And then there is indeed rpath handling which should be done correctly by Bazel.

At the moment, I think I'll just finish up this PR, merge it, and then open an issue for the tech debt. The correct approach will possibly require some serious work (and experimentation).

Sadly, this isn't the first time that it was revealed that ROS relies on linking mechanism to work correctly. The very first patch #2 / ros2/rmw_cyclonedds#320 is (possibly) related to linking.

@ahans
Copy link
Contributor

ahans commented Feb 7, 2023

In deps you put stuff that you want linked with a linker with your libs/binary. Runtime .so plugins typically go to the data section (from what I know so far).

This is true for the stock ROS setup. However, when assembling entire launches in BUILD files, we know at compile/link time exactly what we need at runtime. Having the linker put the reference to the .so plugin in that case I would find reasonable, if it makes our life easier in some other place. But yes, if we want to replicate what stock ROS does as closely as possible, data is the way to go.

And this brings us to the point that: if you want to link ros nodes/libs against IDL libs it's fine to use deps. If you want to use IDL as plugins (generic subscriptions, rosbag), well, then ament index kicks in and generation of shared libs for all those plugins.

I was under the impression that there are different mechanisms in place. One that explicitly builds a library's path and explicitly wants that through a dlopen call. And another, that first checks if the library/symbol is already present. For the former, we still have to make ament happy, agreed. However, even in that case I think it would be good enough to do the bare minimum, which seems to be having some directory under $AMENT_PREFIX_PATH/**/share/** in place. The standard mechanism involving dlopen will respect $LD_LIBRARY_PATH, which means it will also consider libraries already loaded due to being mentioned in the NEEDED section.

After quite some thinking, I think that we want here that e.g. cpp_ros2_interface_library behaves more as a cc_library, such that we can generate .so libs (that can be used as plugins) on demand that are dynamically linked to it's dependencies. For static linking things are solved (for now), but for dynamic linking there is more work to be done. Let's say we have std_msgs cpp library. Besides the fact we want it's introspection lib to be a shared lib and that corresponding typesupport shared lib dynamically links against it, we want that it's dependency builtin_interfaces is also linked as a shared lib. I think this should be possible to do in Starlark, but... it's an unexplored territory for me :) Once this works, a separate rule+aspect can be used to generate an ament-compatible file tree -- this is just a hunch at the moment TBH.

Sounds good! I totally agree about the cc_library part. Honestly, I was a bit confused that running bazel build on a cpp_ros2_interface_library did not cause any compilation action. ;-) Having cpp_ros2_interface_library be a cc_library in the end that produces .a and/or .so files would be less surprising, at least to me.

About the dependencies, why wouldn't it work to put a dependency's cpp_ros2_interface_library (which in the end is another cc_library) in the cc_library's deps section? Then Bazel would make sure the dep is built and also properly linked against (RPATH and all).

At the moment, I think I'll just finish up this PR, merge it, and then open an issue for the tech debt. The correct approach will possibly require some serious work (and experimentation).

That sounds acceptable. It's good that we have (almost?) fully understood what's happening. In that case we're not blindly applying a workaround, but are well aware of the implications.

Sadly, this isn't the first time that it was revealed that ROS relies on linking mechanism to work correctly. The very first patch #2 / ros2/rmw_cyclonedds#320 is (possibly) related to linking.

Yeah, it looks like ROS relies heavily on dynamic linking at runtime, which offers quite a chances for bugs going unnoticed, because there are so many situations that in theory should work, but only so many (probably just a single constellation) are routinely tested in practice.

I think ROS also heavily profits from user testing (who do this willingly or not). AFAIU, their CI setup doesn't even test optimized builds, which may surface issues that in a debug build go unnoticed. When you then deviate from the standard setup (as we do here), you can rely a lot less on the indirect user testing. That's part of the reason why I said above (or elsewhere) that we should try to stay as close as possible to what standard ROS does.

@mvukov
Copy link
Owner Author

mvukov commented Feb 13, 2023

Honestly, I was a bit confused that running bazel build on a cpp_ros2_interface_library did not cause any compilation action. ;-)

It wasn't essential at the time I implemented the relevant rules, so, it's still on my todo list.

About the dependencies, why wouldn't it work to put a dependency's cpp_ros2_interface_library (which in the end is another cc_library) in the cc_library's deps section? Then Bazel would make sure the dep is built and also properly linked against (RPATH and all).

That works for static linking today and to some extent (IIUC) it should work with dynamic loading. For some part of dynamic loading ament setup is still required. Bazel has it's own system to lay out folder structure for .so files, and we need to at least symlink those libs (one day) to a file tree that ament can process.

@mvukov mvukov marked this pull request as ready for review February 13, 2023 19:45
@mvukov mvukov merged commit 2d147f4 into main Feb 13, 2023
@mvukov mvukov deleted the feature/rosbag branch February 13, 2023 19:45
@mvukov
Copy link
Owner Author

mvukov commented Feb 13, 2023

@ahans Many thanks once again for all your help!

@ahans ahans mentioned this pull request May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants