Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SensorManager deadlock in PublishPerformanceMetrics if multiple SensorContainers have sensors #2902

Closed
scpeters opened this issue Dec 10, 2020 · 2 comments

Comments

@scpeters
Copy link
Member

I've noticed a deadlock in SensorManager::PublishPerformanceMetrics (added in #2819) when subscribing to /gazebo/performance_metrics with a world that contains sensors in more than one SensorContainer. Recalling that each SensorContainer has its own thread and manages sensors of the same SensorCategory (IMAGE, RAY, or OTHER), this could happen with a model containing a CameraSensor and a RaySensor or a CameraSensor and an ImuSensor. For example, if you load the PR2 world and open the Image viewer from Window -> Topic Visualization, the images will update until you also open the /gazebo/performance_metrics topic in the viewer.

I've collected a backtrace with two SensorContainer objects calling SensorManager::GetSensor at the same time and waiting on mutexes:

Thread 49 (Thread 0x7fd162ffc700 (LWP 2572)):
#0  0x00007fd24ed2b1fd in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fd24ed240f4 in __GI___pthread_mutex_lock (mutex=0x7fd251212370 <SingletonT<gazebo::sensors::SensorManager>::GetInstance()::t+16>) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007fd24f472bd0 in boost::recursive_mutex::lock() (this=<optimized out>) at /usr/include/boost/thread/pthread/recursive_mutex.hpp:113
#3  0x00007fd24f472bd0 in boost::unique_lock<boost::recursive_mutex>::lock() (this=this@entry=0x7fd162ffa6b0) at /usr/include/boost/thread/lock_types.hpp:346
#4  0x00007fd24f46b63f in boost::unique_lock<boost::recursive_mutex>::unique_lock(boost::recursive_mutex&) (m_=..., this=0x7fd162ffa6b0) at /usr/include/boost/thread/lock_types.hpp:121
#5  0x00007fd24f46b63f in gazebo::sensors::SensorManager::GetSensor(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const (this=0x7fd251212360 <SingletonT<gazebo::sensors::SensorManager>::GetInstance()::t>, _name="default::viper::base_link::left_camera") at gazebo/sensors/SensorManager.cc:538
#6  0x00007fd24f462540 in gazebo::sensors::get_sensor(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (_name="default::viper::base_link::left_camera") at gazebo/common/SingletonT.hh:48
#7  0x00007fd24f46dc8e in PublishPerformanceMetrics() () at gazebo/sensors/SensorManager.cc:231
#8  0x00007fd24f46eb6c in gazebo::sensors::SensorManager::SensorContainer::Update(bool) (this=0x5609ed9396c0, _force=false) at gazebo/sensors/SensorManager.cc:856
#9  0x00007fd24f46af8f in gazebo::sensors::SensorManager::SensorContainer::RunLoop() (this=0x5609ed9396c0) at gazebo/sensors/SensorManager.cc:806
#10 0x00007fd24c39cbcd in  () at /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.65.1
#11 0x00007fd24ed216db in start_thread (arg=0x7fd162ffc700) at pthread_create.c:463
#12 0x00007fd24f7f071f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Thread 1 (Thread 0x7fd251364a00 (LWP 2493)):
#0  0x00007fd24ed2b1fd in __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fd24ed240f4 in __GI___pthread_mutex_lock (mutex=0x5609ed9396f0) at ../nptl/pthread_mutex_lock.c:115
#2  0x00007fd24f472bd0 in boost::recursive_mutex::lock() (this=<optimized out>) at /usr/include/boost/thread/pthread/recursive_mutex.hpp:113
#3  0x00007fd24f472bd0 in boost::unique_lock<boost::recursive_mutex>::lock() (this=this@entry=0x7ffeb1ca0860) at /usr/include/boost/thread/lock_types.hpp:346
#4  0x00007fd24f46b482 in boost::unique_lock<boost::recursive_mutex>::unique_lock(boost::recursive_mutex&) (m_=..., this=0x7ffeb1ca0860) at /usr/include/boost/thread/lock_types.hpp:121
#5  0x00007fd24f46b482 in gazebo::sensors::SensorManager::SensorContainer::GetSensor(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) const (this=0x5609ed9396c0, _name="default::viper::base_link::imu_sensor", _useLeafName=false) at gazebo/sensors/SensorManager.cc:876
#6  0x00007fd24f46b6a0 in gazebo::sensors::SensorManager::GetSensor(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const (this=0x7fd251212360 <SingletonT<gazebo::sensors::SensorManager>::GetInstance()::t>, _name="default::viper::base_link::imu_sensor") at gazebo/sensors/SensorManager.cc:548
#7  0x00007fd24f462540 in gazebo::sensors::get_sensor(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (_name="default::viper::base_link::imu_sensor") at gazebo/common/SingletonT.hh:48
#8  0x00007fd24f46dc8e in PublishPerformanceMetrics() () at gazebo/sensors/SensorManager.cc:231
#9  0x00007fd24f46eb6c in gazebo::sensors::SensorManager::SensorContainer::Update(bool) (this=0x5609edb22dc0, _force=false) at gazebo/sensors/SensorManager.cc:856
#10 0x00007fd24f46cb31 in gazebo::sensors::SensorManager::Update(bool) (this=0x7fd251212360 <SingletonT<gazebo::sensors::SensorManager>::GetInstance()::t>, _force=<optimized out>) at gazebo/sensors/SensorManager.cc:373
#11 0x00007fd250fdd647 in gazebo::Server::Run() (this=0x5609ed1c9060) at gazebo/Server.cc:619
#12 0x00005609ec032289 in main (argc=12, argv=0x7ffeb1ca0fd8) at gazebo/server_main.cc:45
#13 0x00007fd24f6f0bf7 in __libc_start_main (main=0x5609ec032190 <main>, argc=12, argv=0x7ffeb1ca0fd8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffeb1ca0fc8) at ../csu/libc-start.c:310
#14 0x00005609ec0326ba in _start () at /usr/include/ignition/math4/ignition/math/Vector3.hh:85
scpeters added a commit to scpeters/gazebo_ros_pkgs that referenced this issue Dec 12, 2020
We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
The link_states and model_states topics currently use an advertise
mechanism with callbacks when a subscriber connects or disconnects,
so I've used that same pattern for the performance_metrics topic.
This also helps workaround the deadlock documented in ros-simulation#1175 and
gazebosim/gazebo-classic#2902.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to scpeters/gazebo_ros_pkgs that referenced this issue Dec 12, 2020
We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
The link_states and model_states topics currently use an advertise
mechanism with callbacks when a subscriber connects or disconnects,
so I've used that same pattern for the performance_metrics topic.
This also helps workaround the deadlock documented in ros-simulation#1175 and
gazebosim/gazebo-classic#2902.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to ros-simulation/gazebo_ros_pkgs that referenced this issue Dec 15, 2020
We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
The link_states and model_states topics currently use an advertise
mechanism with callbacks when a subscriber connects or disconnects,
so I've used that same pattern for the performance_metrics topic.
This also helps workaround the deadlock documented in #1175 and
gazebosim/gazebo-classic#2902.

This also adds a GAZEBO_ROS_HAS_PERFORMANCE_METRICS
macro that reduces duplication of the version checking logic for
performance metrics in gazebo and adds fixes some doc-string and
typos in existing code

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to scpeters/gazebo_ros_pkgs that referenced this issue Dec 22, 2020
We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
This changes gazebo_ros_init to only subscribe to the gazebo topic
if there are any subscribers to the corresponding ROS topic.
While advertiser callbacks are used in ROS 1 but are not yet in ROS2,
here we use polling in the GazeboRosInitPrivate::PublishSimTime
callback to check for subscribers since it is called for each Gazebo
time step.

This also helps workaround the deadlock documented in ros-simulation#1175 and
gazebosim/gazebo-classic#2902.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to ros-simulation/gazebo_ros_pkgs that referenced this issue Dec 22, 2020
Backport of #1202 to melodic-devel.

We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
The link_states and model_states topics currently use an advertise
mechanism with callbacks when a subscriber connects or disconnects,
so I've used that same pattern for the performance_metrics topic.
This also helps workaround the deadlock documented in #1175 and
gazebosim/gazebo-classic#2902.

This also adds a GAZEBO_ROS_HAS_PERFORMANCE_METRICS
macro that reduces duplication of the version checking logic for
performance metrics in gazebo and adds fixes some doc-string and
typos in existing code

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to ros-simulation/gazebo_ros_pkgs that referenced this issue Dec 22, 2020
…1205)

We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
This changes gazebo_ros_init to only subscribe to the gazebo topic
if there are any subscribers to the corresponding ROS topic.
While advertiser callbacks are used in ROS 1 but are not yet in ROS2,
here we use polling in the GazeboRosInitPrivate::PublishSimTime
callback to check for subscribers since it is called for each Gazebo
time step.

This also helps workaround the deadlock documented in #1175 and
gazebosim/gazebo-classic#2902.

This also adds a macro to reduce duplication of the version checking
logic.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to scpeters/gazebo_ros_pkgs that referenced this issue Dec 22, 2020
We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
The link_states and model_states topics currently use an advertise
mechanism with callbacks when a subscriber connects or disconnects,
so I've used that same pattern for the performance_metrics topic.
This also helps workaround the deadlock documented in ros-simulation#1175 and
gazebosim/gazebo-classic#2902.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to ros-simulation/gazebo_ros_pkgs that referenced this issue Dec 23, 2020
Backport of #1202 to kinetic-level.

We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
The link_states and model_states topics currently use an advertise
mechanism with callbacks when a subscriber connects or disconnects,
so I've used that same pattern for the performance_metrics topic.
This also helps workaround the deadlock documented in #1175 and
gazebosim/gazebo-classic#2902.

This also adds a GAZEBO_ROS_HAS_PERFORMANCE_METRICS
macro that reduces duplication of the version checking logic for
performance metrics in gazebo and adds fixes some doc-string and
typos in existing code

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to scpeters/gazebo that referenced this issue Jan 17, 2021
Currently the PublishPerformanceMetrics function is called within
SensorManager.cc by each SensorContainer with protection by the
mutex of each SensorContainer. This leads to a deadlock when a
world has sensors in more than one container (i.e. more than one
category of sensors) as described in gazebosim#2902.

Since the PublishPerformanceMetrics function is not a member
function of SensorContainer and uses only global variables,
the function can be called by only the SensorManager thread
instead of all the SensorContainer threads. It is also
left unprotected by a mutex, as it may not be needed.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to scpeters/gazebo that referenced this issue Jan 20, 2021
Currently the PublishPerformanceMetrics function is called within
SensorManager.cc by each SensorContainer with protection by the
mutex of each SensorContainer. This leads to a deadlock when a
world has sensors in more than one container (i.e. more than one
category of sensors) as described in gazebosim#2902.

Since the PublishPerformanceMetrics function is not a member
function of SensorContainer and uses only global variables,
the function can be called by only the SensorManager thread
instead of all the SensorContainer threads. It is also
left unprotected by a mutex, as it may not be needed.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to scpeters/gazebo that referenced this issue Jan 20, 2021
Currently the PublishPerformanceMetrics function is called within
SensorManager.cc by each SensorContainer with protection by the
mutex of each SensorContainer. This leads to a deadlock when a
world has sensors in more than one container (i.e. more than one
category of sensors) as described in gazebosim#2902.

Since the PublishPerformanceMetrics function is not a member
function of SensorContainer and uses only global variables,
the function can be called by the SensorManager thread instead
of by each SensorContainer thread. This change will avoid the
deadlock.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
@scpeters
Copy link
Member Author

An easy way to reproduce this is to load the pr2.world with --lockstep. The entire simulation will hang if you subscribe to /gazebo/performance_metrics

gazebo --verbose --lockstep worlds/pr2.world

@scpeters
Copy link
Member Author

test and fix in #2917

scpeters added a commit that referenced this issue Jan 21, 2021
* Add test for performance metrics deadlock

Load the pr2 world with --lockstep and then subscribe to
/gazebo/performance_metrics to reproduce the deadlock.

* Fix deadlock when publishing performance metrics

Currently the PublishPerformanceMetrics function is called within
SensorManager.cc by each SensorContainer with protection by the
mutex of each SensorContainer. This leads to a deadlock when a
world has sensors in more than one container (i.e. more than one
category of sensors) as described in #2902.

Since the PublishPerformanceMetrics function is not a member
function of SensorContainer and uses only global variables,
the function can be called by the SensorManager thread instead
of by each SensorContainer thread. This change will avoid the
deadlock.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to scpeters/gazebo that referenced this issue Jan 26, 2021
* Add test for performance metrics deadlock

Load the pr2 world with --lockstep and then subscribe to
/gazebo/performance_metrics to reproduce the deadlock.

* Fix deadlock when publishing performance metrics

Currently the PublishPerformanceMetrics function is called within
SensorManager.cc by each SensorContainer with protection by the
mutex of each SensorContainer. This leads to a deadlock when a
world has sensors in more than one container (i.e. more than one
category of sensors) as described in gazebosim#2902.

Since the PublishPerformanceMetrics function is not a member
function of SensorContainer and uses only global variables,
the function can be called by the SensorManager thread instead
of by each SensorContainer thread. This change will avoid the
deadlock.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit to scpeters/gazebo that referenced this issue Jan 26, 2021
* Add test for performance metrics deadlock

Load the pr2 world with --lockstep and then subscribe to
/gazebo/performance_metrics to reproduce the deadlock.

* Fix deadlock when publishing performance metrics

Currently the PublishPerformanceMetrics function is called within
SensorManager.cc by each SensorContainer with protection by the
mutex of each SensorContainer. This leads to a deadlock when a
world has sensors in more than one container (i.e. more than one
category of sensors) as described in gazebosim#2902.

Since the PublishPerformanceMetrics function is not a member
function of SensorContainer and uses only global variables,
the function can be called by the SensorManager thread instead
of by each SensorContainer thread. This change will avoid the
deadlock.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
scpeters added a commit that referenced this issue Jan 28, 2021
* Add test for performance metrics deadlock

Load the pr2 world with --lockstep and then subscribe to
/gazebo/performance_metrics to reproduce the deadlock.

* Fix deadlock when publishing performance metrics

Currently the PublishPerformanceMetrics function is called within
SensorManager.cc by each SensorContainer with protection by the
mutex of each SensorContainer. This leads to a deadlock when a
world has sensors in more than one container (i.e. more than one
category of sensors) as described in #2902.

Since the PublishPerformanceMetrics function is not a member
function of SensorContainer and uses only global variables,
the function can be called by the SensorManager thread instead
of by each SensorContainer thread. This change will avoid the
deadlock.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
jacobperron pushed a commit to ros-simulation/gazebo_ros_pkgs that referenced this issue Feb 3, 2021
…1205)

We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
This changes gazebo_ros_init to only subscribe to the gazebo topic
if there are any subscribers to the corresponding ROS topic.
While advertiser callbacks are used in ROS 1 but are not yet in ROS2,
here we use polling in the GazeboRosInitPrivate::PublishSimTime
callback to check for subscribers since it is called for each Gazebo
time step.

This also helps workaround the deadlock documented in #1175 and
gazebosim/gazebo-classic#2902.

This also adds a macro to reduce duplication of the version checking
logic.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
jacobperron pushed a commit to ros-simulation/gazebo_ros_pkgs that referenced this issue Mar 11, 2021
…1205)

We are currently subscribing to the /gazebo/performance_metrics topic
even if there are no subscribers to the ROS topic forwarding this data.
This changes gazebo_ros_init to only subscribe to the gazebo topic
if there are any subscribers to the corresponding ROS topic.
While advertiser callbacks are used in ROS 1 but are not yet in ROS2,
here we use polling in the GazeboRosInitPrivate::PublishSimTime
callback to check for subscribers since it is called for each Gazebo
time step.

This also helps workaround the deadlock documented in #1175 and
gazebosim/gazebo-classic#2902.

This also adds a macro to reduce duplication of the version checking
logic.

Signed-off-by: Steve Peters <scpeters@openrobotics.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant