-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/rosbag #47
Feature/rosbag #47
Conversation
Conflicts: ros2/plugin.bzl ros2/test/pluginlib/tests.py
Conflicts: repositories/repositories.bzl repositories/ros2_repo_mappings.yaml repositories/ros2_repositories_impl.bzl
So,
The problem seems to be in the cleanup of subscriptions: https://github.com/ros2/rosbag2/blob/0.15.3/rosbag2_transport/src/rosbag2_transport/recorder.cpp#L115. @ahans I am digging into this, but another pair of eyes on this would help. Except this issue, I have to clean up some Bazel-related macros before this lands in main. + fix some tests. |
ros2/BUILD.bazel
Outdated
@@ -35,3 +38,13 @@ py_binary( | |||
"@ros2cli//:ros2param", | |||
], | |||
) | |||
|
|||
ros2_bag( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a tmp example app.
Thanks for this! However, I can't repro the segfault. Tried on Ubuntu 20.04 as well as 22.04 hosts. Have you pushed some commits after writing this? What I did notice:
|
Ok, after digging a bit, turns out what I report under 3. above seems to be related to what you see as explicit segfault. When I hit Ctrl-C, it doesn't shut down cleanly. This is the reason why |
I can confirm it seems to be related to cleanup of subscriptions. More specifically, when I comment out the call to I will keep digging, but maybe this info already helps to give you some idea? You know better what's different here compared to a standard ROS 2 build. Maybe it has to do with shared vs static libraries? What are you building statically here that would be a shared library in the standard ROS setup? |
So the issues with running asan also occur with stock ROS2, so that's unrelated. |
Many thanks for digging into this. Regarding 1: hm, this is fishy. It probably means that compiler pulls sqlite from /usr folder somehow -- both on my local machine and on the CI; will double-check this. Regarding 2: yeah, this will be fixed before the merge: I'll convert those error statements into debug statements. I am leaning towards Python bindings doing something fishy given that the unit test runs just fine. An alternative is that my unit test isn't representative enough and in that case the problem lies somewhere else. So far, my understanding was that there is no static init involved with the recorder node, but I have to double-check this. And I'll go again over compilation flags for those Python bindings. I'll keep you posted. |
Yeah, it picks the Regarding the more "interesting issue": I was able to run this under asan now, this comment was gold! However, the only things it finds are unmatched delete/new (where it would always free less than it allocated before) and leaks. When disabling those two, only the segfault remains. I checked the obvious things like pointers being valid and the Will also do some more digging, but this seems to be quite a tough one ... ;-) |
I tried that locally, but I couldn't reproduce the segfault with the test.
ament_setup just lays out file structure such that ament_index_cpp can find necessary files. Nothing fancy.
When libstd_msgs__rosidl_typesupport_cpp.so needs to be loaded, ament_index_cpp is called to get a path to the library. If it's not present on the ament_prefix_path -> an exception is thrown. So, if e.g. libstd_msgs__rosidl_typesupport_introspection_cpp.so would be needed, I'd expect I should know about that by now by seeing an exception coming from ament_index_cpp code. BTW, I put some debug fprintf statements in that get_typesupport_handle_function from your link, but it looks like it's not called -- still have to dig a bit deeper. At the moment I have the feeling ros doesn't dynamically load typessupport introspection libs, only typesupport libs. One more thing: type_support depends on typesupport_introspection, at least auto-generated code tells that. I committed your patch in d9ca94c: please let me know if you want me to decorate it with extra text. Many thanks for your help! :) The PR still needs some cleanup before it can be merged, besides possibly digging deeper into dynamic loading stuff. |
d9ca94c
to
fee8556
Compare
IMO, the rclcpp patch is OKish at the moment. I also tried splitting interface generation stuff, but it turned out not to be so trivial -- I know what I need to do but would take more time given the current code struct that I would need to split. Plus, given that typesupport depends on introspection code, I am not sure yet this is the path to go. It would help if we can definitely rule out whether introspection libs are dyn loaded, or not. |
@ahans WDYT? |
In fact, from what I can see, get_typesupport_handle_function doesn't get called at all when I run
at the top of the function. |
I'm fairly certain that splitting the libraries is the proper solution here. The introspection one is loaded in stock ROS from SubscriptionBase, and it is loaded in a way that it is only unloaded when the proceas terminates. The non-introspection one is unloaded just the same way as we do here in rules_ros2, but in stock ROS that's fine, because the introspection one remains loaded. I can give you the exact stack traces tomorrow. Why we don't even try to load it with yours, I don't know. My suspicion still is that it has to do with symbols lookup and in the case of the all-in-one Liv the requested symbols are already there. I tried to follow the path as well in the debugger, but at some point we call into C code that gdb cannot properly resolve. Interestingly, though, for the standard ROS build this is not an issue at all. |
When testing before, I had the talker run in parallel and manually executed the recorder binary. But with this diff I can also make the recorder die with -11. By only using GenericSubscription, we trigger the issue with unloading the shared library too early.
|
It's a workaround, but not the proper solution. Having a In any case, a standard ROS build loads type support dynamically from shared objects, no matter if it's a C++ binary for a user node or a rosbag recorder. I think we should try to stay as close as possible to the standard ROS structure. Going with my patch for rclcpp for now I can live with. But there should be a
Since this is what the standard ROS build does, it is the way to go IMHO.
Introspection libs are definitely loaded dynamically. I was able to confirm my previous observations using a standard ROS build. Based on test_generic_pubsub.cpp I implemented a test like this: TEST_F(RclcppGenericNodeFixture, generic_pubsub_cleanup)
{
using namespace std::chrono_literals;
std::string topic_name = "/topic";
std::string topic_type = "std_msgs/msg/String";
{
auto subscription = node_->create_generic_subscription(
topic_name, topic_type, rclcpp::SensorDataQoS{},
[](std::shared_ptr<rclcpp::SerializedMessage>/* message */) {std::cerr << "message" << std::endl;});
for (auto i = 0; i < 10; ++i) {
rclcpp::spin_some(node_);
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
}
ASSERT_TRUE(true);
} The only really relevant part is that we call Then we load the non-instrospection library from create_generic_subscription. Full stack trace is this:
Two lines later in that function, we construct
Frame #1 is where we do the When I delete/rename When I replace the original Interestingly, in the ROS build I can follow the entire execution flow in the debugger. It basically confirmed that if we use the merged library, we don't even try to load the introspection one. |
So the decision whether to load the library or if it is considered to be already loaded happens here:
I haven't figured out where those function pointers are set, but overall it makes sense to me. |
Thanks for looking into this this deep, I really appreciate this.
In deps you put stuff that you want linked with a linker with your libs/binary. Runtime .so plugins typically go to the data section (from what I know so far). And this brings us to the point that: if you want to link ros nodes/libs against IDL libs it's fine to use deps. If you want to use IDL as plugins (generic subscriptions, rosbag), well, then ament index kicks in and generation of shared libs for all those plugins. After quite some thinking, I think that we want here that e.g. cpp_ros2_interface_library behaves more as a cc_library, such that we can generate .so libs (that can be used as plugins) on demand that are dynamically linked to it's dependencies. For static linking things are solved (for now), but for dynamic linking there is more work to be done. Let's say we have std_msgs cpp library. Besides the fact we want it's introspection lib to be a shared lib and that corresponding typesupport shared lib dynamically links against it, we want that it's dependency builtin_interfaces is also linked as a shared lib. I think this should be possible to do in Starlark, but... it's an unexplored territory for me :) Once this works, a separate rule+aspect can be used to generate an ament-compatible file tree -- this is just a hunch at the moment TBH. And then there is indeed rpath handling which should be done correctly by Bazel. At the moment, I think I'll just finish up this PR, merge it, and then open an issue for the tech debt. The correct approach will possibly require some serious work (and experimentation). Sadly, this isn't the first time that it was revealed that ROS relies on linking mechanism to work correctly. The very first patch #2 / ros2/rmw_cyclonedds#320 is (possibly) related to linking. |
This is true for the stock ROS setup. However, when assembling entire launches in
I was under the impression that there are different mechanisms in place. One that explicitly builds a library's path and explicitly wants that through a
Sounds good! I totally agree about the About the dependencies, why wouldn't it work to put a dependency's
That sounds acceptable. It's good that we have (almost?) fully understood what's happening. In that case we're not blindly applying a workaround, but are well aware of the implications.
Yeah, it looks like ROS relies heavily on dynamic linking at runtime, which offers quite a chances for bugs going unnoticed, because there are so many situations that in theory should work, but only so many (probably just a single constellation) are routinely tested in practice. I think ROS also heavily profits from user testing (who do this willingly or not). AFAIU, their CI setup doesn't even test optimized builds, which may surface issues that in a debug build go unnoticed. When you then deviate from the standard setup (as we do here), you can rely a lot less on the indirect user testing. That's part of the reason why I said above (or elsewhere) that we should try to stay as close as possible to what standard ROS does. |
Conflicts: repositories/ros2_repo_mappings.yaml repositories/ros2_repositories_impl.bzl ros2/test.bzl
It wasn't essential at the time I implemented the relevant rules, so, it's still on my todo list.
That works for static linking today and to some extent (IIUC) it should work with dynamic loading. For some part of dynamic loading ament setup is still required. Bazel has it's own system to lay out folder structure for .so files, and we need to at least symlink those libs (one day) to a file tree that ament can process. |
@ahans Many thanks once again for all your help! |
Adds support for rosbag2.