Fix for lock order inversion between uri_map_mutex and conn_mutex #185

gautvenk · 2020-08-10T16:23:09Z

removing connection mutex lock in sendToOne() to remove order dependency
protecting reference vectors using a new mutex instead of uri_map_mutex

Signed-off-by: Gautam Venkataramanan gautam.chennai@gmail.com

- removing connection mutex lock in sendToOne() to remove order dependency - protecting reference vectors using a new mutex instead of uri_map_mutex Signed-off-by: Gautam Venkataramanan <gautam.chennai@gmail.com>

gautvenk · 2020-08-10T16:23:48Z

Adding more details on the locks with the ordering:

T14 (Policy processing/deserializing thread):

B) Mutex M1775 acquired here while holding mutex M1599 in thread T14:
#0 pthread_mutex_lock (libtsan.so.0+0x41d5b)
#1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749 (libopflex.so.0+0x153c76)
#2 std::mutex::lock() /usr/include/c++/9/bits/std_mutex.h:100 (libopflex.so.0+0x153c76)
#3 std::lock_guardstd::mutex::lock_guard(std::mutex&) /usr/include/c++/9/bits/std_mutex.h:159 (libopflex.so.0+0x153c76) <-- Needed. Added as part of GbpServer integration - 2eb1b4c
#4 opflex::engine::internal::OpflexServerConnection::getUri(opflex::modb::URI const&) /home/noiro/work/opflex/libopflex/engine/OpflexServerConnection.cpp:170 (libopflex.so.0+0x153c76)
#5 opflex::engine::internal::OpflexListener::addPendingUpdate(unsigned long, opflex::modb::URI const&, opflex::gbp::PolicyUpdateOp) /home/noiro/work/opflex/libopflex/engine/OpflexListener.cpp:198 (libopflex.so.0+0x155a74)
#6 opflex::engine::internal::GbpOpflexServerImpl::remoteObjectUpdated(unsigned long, opflex::modb::URI const&, opflex::gbp::PolicyUpdateOp) /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:349 (libopflex.so.0+0x1628c1)
#7 opflex::engine::internal::MOSerializer::deserialize(rapidjson::GenericValue<rapidjson::UTF8, rapidjson::MemoryPoolAllocatorrapidjson::CrtAllocator > const&, opflex::modb::mointernal::StoreClient&, bool, std::unordered_map<o
pflex::modb::URI, unsigned long, std::hashopflex::modb::URI, std::equal_toopflex::modb::URI, std::allocator<std::pair<opflex::modb::URI const, unsigned long> > >) /home/noiro/work/opflex/libopflex/engine/MOSerializer.cpp:343 (libopfl
ex.so.0+0x10250a)
#8 opflex::engine::internal::OpflexServerHandler::handleEPDeclareReq(rapidjson::GenericValue<rapidjson::UTF8, rapidjson::MemoryPoolAllocatorrapidjson::CrtAllocator > const&, rapidjson::GenericValue<rapidjson::UTF8, rapid
json::MemoryPoolAllocatorrapidjson::CrtAllocator > const&) /home/noiro/work/opflex/libopflex/engine/OpflexServerHandler.cpp:433 (libopflex.so.0+0x17904a)
#9 yajr::rpc::InbReq<&yajr::rpc::method::endpoint_declare>::process() const /home/noiro/work/opflex/libopflex/engine/OpflexHandler.cpp:214 (libopflex.so.0+0x131551)
#10 yajr::comms::internal::CommunicationPeer::readBufferZ(char const, unsigned long) /home/noiro/work/opflex/libopflex/comms/CommunicationPeer.cpp:279 (libopflex.so.0+0xe7359)
#11 yajr::comms::internal::CommunicationPeer::readBuffer(char*, unsigned long, bool) /home/noiro/work/opflex/libopflex/comms/CommunicationPeer.cpp:256 (libopflex.so.0+0xe75ae)
#12 yajr::transport::Cbyajr::transport::PlainText::on_read(uv_stream_s*, long, uv_buf_t const*) transport/PlainText.cpp:85 (libopflex.so.0+0xc3e38)
#13 (libuv.so.1+0x155ce)
#14 (libtsan.so.0+0x2aba6)

A) Mutex M1599 previously acquired by the same thread here: <-- Called to give policy update to each agent connection
#0 pthread_mutex_lock (libtsan.so.0+0x41d5b)
#1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749 (libopflex.so.0+0x1559fd)
#2 __gthread_recursive_mutex_lock /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811 (libopflex.so.0+0x1559fd)
#3 std::recursive_mutex::lock() /usr/include/c++/9/mutex:106 (libopflex.so.0+0x1559fd)
#4 std::lock_guardstd::recursive_mutex::lock_guard(std::recursive_mutex&) /usr/include/c++/9/bits/std_mutex.h:159 (libopflex.so.0+0x1559fd)
<-- Needed, fixed recently via 8af14ce
#5 opflex::engine::internal::OpflexListener::addPendingUpdate(unsigned long, opflex::modb::URI const&, opflex::gbp::PolicyUpdateOp) /home/noiro/work/opflex/libopflex/engine/OpflexListener.cpp:196 (libopflex.so.0+0x1559fd)
#6 opflex::engine::internal::GbpOpflexServerImpl::remoteObjectUpdated(unsigned long, opflex::modb::URI const&, opflex::gbp::PolicyUpdateOp) /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:349 (libopflex.so.0+0x1628c1)
#7 opflex::engine::internal::MOSerializer::deserialize(rapidjson::GenericValue<rapidjson::UTF8, rapidjson::MemoryPoolAllocatorrapidjson::CrtAllocator > const&, opflex::modb::mointernal::StoreClient&, bool, std::unordered_map<o
pflex::modb::URI, unsigned long, std::hashopflex::modb::URI, std::equal_toopflex::modb::URI, std::allocator<std::pair<opflex::modb::URI const, unsigned long> > >) /home/noiro/work/opflex/libopflex/engine/MOSerializer.cpp:343 (libopfl
ex.so.0+0x10250a)
#8 opflex::engine::internal::OpflexServerHandler::handleEPDeclareReq(rapidjson::GenericValue<rapidjson::UTF8, rapidjson::MemoryPoolAllocatorrapidjson::CrtAllocator > const&, rapidjson::GenericValue<rapidjson::UTF8, rapid
json::MemoryPoolAllocatorrapidjson::CrtAllocator > const&) /home/noiro/work/opflex/libopflex/engine/OpflexServerHandler.cpp:433 (libopflex.so.0+0x17904a)
#9 yajr::rpc::InbReq<&yajr::rpc::method::endpoint_declare>::process() const /home/noiro/work/opflex/libopflex/engine/OpflexHandler.cpp:214 (libopflex.so.0+0x131551)
#10 yajr::comms::internal::CommunicationPeer::readBufferZ(char const, unsigned long) /home/noiro/work/opflex/libopflex/comms/CommunicationPeer.cpp:279 (libopflex.so.0+0xe7359)
#11 yajr::comms::internal::CommunicationPeer::readBuffer(char*, unsigned long, bool) /home/noiro/work/opflex/libopflex/comms/CommunicationPeer.cpp:256 (libopflex.so.0+0xe75ae)
#12 yajr::transport::Cbyajr::transport::PlainText::on_read(uv_stream_s*, long, uv_buf_t const*) transport/PlainText.cpp:85 (libopflex.so.0+0xc3e38)
#13 (libuv.so.1+0x155ce)
#14 (libtsan.so.0+0x2aba6)

T16 (Thread to update policies for every agent connection):

B) Mutex M1599(conn_mutex) acquired here while holding mutex M1775 in thread T16:
#0 pthread_mutex_lock (libtsan.so.0+0x41d5b)
#1 opflex::engine::internal::OpflexListener::sendToOne(opflex::engine::internal::OpflexServerConnection*, opflex::engine::internal::OpflexMessage*) /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749 (libopflex.so.0+0x156b39)
#2 opflex::engine::internal::GbpOpflexServerImpl::policyUpdate(opflex::engine::internal::OpflexServerConnection*, std::vector<std::pair<unsigned long, opflex::modb::URI>, std::allocator<std::pair<unsigned long, opflex::modb::URI> > >
const&, std::vector<std::pair<unsigned long, opflex::modb::URI>, std::allocator<std::pair<unsigned long, opflex::modb::URI> > > const&, std::vector<std::pair<unsigned long, opflex::modb::URI>, std::allocator<std::pair<unsigned long, opfle
x::modb::URI> > > const&) /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:278 (libopflex.so.0+0x16318a)
#3 opflex::engine::internal::OpflexServerConnection::on_policy_update_async(uv_async_s*) /home/noiro/work/opflex/libopflex/engine/OpflexServerConnection.cpp:242 (libopflex.so.0+0x152137)
#4 (libuv.so.1+0xb3b3)
#5 (libtsan.so.0+0x2aba6)

185 void OpflexListener::sendToOne(OpflexServerConnection* conn, OpflexMessage* message) {
186 std::unique_ptr messagep(message);
187 const std::lock_guardstd::recursive_mutex lock(conn_mutex); <-- We dont need locking here
188 if (!active) return;
189 conn->sendMessage(message->clone());

A) Mutex M1775 (uri_map_mutex) previously acquired by the same thread here:
#0 pthread_mutex_lock (libtsan.so.0+0x41d5b)
#1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749 (libopflex.so.0+0x1520db)
#2 std::mutex::lock() /usr/include/c++/9/bits/std_mutex.h:100 (libopflex.so.0+0x1520db)
#3 std::lock_guardstd::mutex::lock_guard(std::mutex&) /usr/include/c++/9/bits/std_mutex.h:159 (libopflex.so.0+0x1520db)
#4 opflex::engine::internal::OpflexServerConnection::on_policy_update_async(uv_async_s*) /home/noiro/work/opflex/libopflex/engine/OpflexServerConnection.cpp:237 (libopflex.so.0+0x1520db)
#5 (libuv.so.1+0xb3b3)
#6 (libtsan.so.0+0x2aba6)

233 void OpflexServerConnection::on_policy_update_async(uv_async_t* handle) {
234 OpflexServerConnection* conn = (OpflexServerConnection )handle->data;
235 GbpOpflexServerImpl server = dynamic_cast<GbpOpflexServerImpl*>
236 (conn->listener->getHandlerFactory());
237 std::lock_guardstd::mutex lock(conn->uri_map_mutex); <-- This was changed from uri_map_mutex to uri_update_mutex in 3351d64.
238
239 if (conn->replace.empty() && conn->merge.empty() && conn->deleted.empty())
240 return;
241
242 server->policyUpdate(conn, conn->replace, conn->merge, conn->deleted);
243
244 conn->replace.clear();
245 conn->merge.clear();
246 conn->deleted.clear();

Thread T14 (tid=7642, running) created by main thread at:
#0 pthread_create (libtsan.so.0+0x2d3be)
#1 uv_thread_create (libuv.so.1+0x17cc0)
#2 opflex::engine::internal::GbpOpflexServerImpl::start() /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:127 (libopflex.so.0+0x165b62)
#3 opflex::test::GbpOpflexServer::start() /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:50 (libopflex.so.0+0x165c32)
#4 main server/opflex_server.cpp:249 (opflex_server+0x1f7fa)

Thread T16 (tid=7644, running) created by thread T14 at:
#0 pthread_create (libtsan.so.0+0x2d3be)
#1 uv_thread_create (libuv.so.1+0x17cc0)
#2 opflex::engine::internal::OpflexListener::on_new_connection(yajr::Listener*, void*, int) /home/noiro/work/opflex/libopflex/engine/OpflexListener.cpp:161 (libopflex.so.0+0x15690e)
#3 yajr::comms::internal::ListeningPeer::getConnectionHandlerData() ../include/opflex/yajr/internal/comms.hpp:1116 (libopflex.so.0+0xb9ce8)
#4 yajr::comms::internal::ListeningTcpPeer::getNewPassive() /home/noiro/work/opflex/libopflex/comms/passive_listener.cpp:333 (libopflex.so.0+0xb9ce8)
#5 yajr::comms::internal::on_passive_connection(uv_stream_s*, int) /home/noiro/work/opflex/libopflex/comms/passive_listener.cpp:296 (libopflex.so.0+0xb74b4)
#6 uv__server_io (libuv.so.1+0x1679b)
#7 (libtsan.so.0+0x2aba6)

mchalla · 2020-08-10T17:01:31Z

libopflex/engine/OpflexListener.cpp

@@ -184,7 +184,6 @@ void OpflexListener::sendToAll(OpflexMessage* message) {

 void OpflexListener::sendToOne(OpflexServerConnection* conn, OpflexMessage* message) {
    std::unique_ptr<OpflexMessage> messagep(message);
-    const std::lock_guard<std::recursive_mutex> lock(conn_mutex);


This needs a lock when we have multiple agents.

Just checked code, it needs the lock even with single agent, since we could be closing the conn when sendToOne is in progress (similar to sendToAll) I kind of follow what TSAN is complaining about. Will see how to fix it and get back.

Discussing this with Madhu. If we need to protect access to conn, then we need to put the lock outside SendToOne(), where conn continues to get accessed - in OpflexServerConnection::on_policy_update_async(). And I will influence the lock order of conn_mutex and ref_vec_mutex to address deadlock. However, the critical section will be longer here. We are going to discuss further and see how to address this clean. Currently we are passing connection pointers in the handle. May be we need to pass conn ID and check if its a valid connection every time we access and reduce critical section. Will raise an issue to track the cleanup/addressing large critical section.

mchalla · 2020-08-10T17:06:32Z

libopflex/engine/OpflexServerConnection.cpp

@@ -234,7 +234,7 @@ void OpflexServerConnection::on_policy_update_async(uv_async_t* handle) {
    OpflexServerConnection* conn = (OpflexServerConnection *)handle->data;
    GbpOpflexServerImpl* server = dynamic_cast<GbpOpflexServerImpl*>
        (conn->listener->getHandlerFactory());
-    std::lock_guard<std::mutex> lock(conn->uri_map_mutex);
+    std::lock_guard<std::mutex> lock(conn->ref_vec_mutex);


This is fine, but i am not sure why a single lock is not good enough. Is it due to your other commit ?

This just avoids the inconsistency of using uri_map_mutex for the reference vector. This was changed from uri_update_mutex (original commit) to uri_map_mutex in 3351d64. The change is to just restore the original changes. Used ref_vec_mutex instead of uri_update_mutex for clarity. This wont address the lock ordering issue though.

- I am influencing the ordering now: conn_mutex obtained before ref_vec_mutex. This will take care of deadlock. - Critical section for conn_mutex is larger. We will discuss further and optimize the critical section and also may be move to connID-->connPtr map for more safety. Signed-off-by: Gautam Venkataramanan <gautam.chennai@gmail.com>

…ar to OpflexServerConnection::on_policy_update_async() Signed-off-by: Gautam Venkataramanan <gautam.chennai@gmail.com>

gautvenk · 2020-08-11T01:19:47Z

Opflex-cni-test passed as well.

Fix for lock order inversion between uri_map_mutex and conn_mutex

9c83472

- removing connection mutex lock in sendToOne() to remove order dependency - protecting reference vectors using a new mutex instead of uri_map_mutex Signed-off-by: Gautam Venkataramanan <gautam.chennai@gmail.com>

gautvenk requested review from tomflynn and mchalla August 10, 2020 16:23

mchalla reviewed Aug 10, 2020

View reviewed changes

gautvenk added 2 commits August 10, 2020 13:12

lock conn_mutex in OpflexServerConnection::on_prr_timer_async() simil…

e38ddab

…ar to OpflexServerConnection::on_policy_update_async() Signed-off-by: Gautam Venkataramanan <gautam.chennai@gmail.com>

mchalla approved these changes Aug 10, 2020

View reviewed changes

gautvenk merged commit 8187930 into master Aug 11, 2020

gautvenk deleted the os-deadlock-8-10 branch August 11, 2020 01:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for lock order inversion between uri_map_mutex and conn_mutex #185

Fix for lock order inversion between uri_map_mutex and conn_mutex #185

gautvenk commented Aug 10, 2020

gautvenk commented Aug 10, 2020

mchalla Aug 10, 2020

mchalla Aug 10, 2020

gautvenk Aug 10, 2020

mchalla Aug 10, 2020

gautvenk Aug 10, 2020

gautvenk commented Aug 11, 2020

Fix for lock order inversion between uri_map_mutex and conn_mutex #185

Fix for lock order inversion between uri_map_mutex and conn_mutex #185

Conversation

gautvenk commented Aug 10, 2020

gautvenk commented Aug 10, 2020

mchalla Aug 10, 2020

Choose a reason for hiding this comment

mchalla Aug 10, 2020

Choose a reason for hiding this comment

gautvenk Aug 10, 2020

Choose a reason for hiding this comment

mchalla Aug 10, 2020

Choose a reason for hiding this comment

gautvenk Aug 10, 2020

Choose a reason for hiding this comment

gautvenk commented Aug 11, 2020