Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for lock order inversion between uri_map_mutex and conn_mutex #185

Merged
merged 3 commits into from
Aug 11, 2020

Conversation

gautvenk
Copy link
Contributor

  • removing connection mutex lock in sendToOne() to remove order dependency
  • protecting reference vectors using a new mutex instead of uri_map_mutex

Signed-off-by: Gautam Venkataramanan gautam.chennai@gmail.com

- removing connection mutex lock in sendToOne() to remove order dependency
- protecting reference vectors using a new mutex instead of uri_map_mutex

Signed-off-by: Gautam Venkataramanan <gautam.chennai@gmail.com>
@gautvenk
Copy link
Contributor Author

Adding more details on the locks with the ordering:

T14 (Policy processing/deserializing thread):

B) Mutex M1775 acquired here while holding mutex M1599 in thread T14:
#0 pthread_mutex_lock (libtsan.so.0+0x41d5b)
#1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749 (libopflex.so.0+0x153c76)
#2 std::mutex::lock() /usr/include/c++/9/bits/std_mutex.h:100 (libopflex.so.0+0x153c76)
#3 std::lock_guardstd::mutex::lock_guard(std::mutex&) /usr/include/c++/9/bits/std_mutex.h:159 (libopflex.so.0+0x153c76) <-- Needed. Added as part of GbpServer integration - 2eb1b4c
#4 opflex::engine::internal::OpflexServerConnection::getUri(opflex::modb::URI const&) /home/noiro/work/opflex/libopflex/engine/OpflexServerConnection.cpp:170 (libopflex.so.0+0x153c76)
#5 opflex::engine::internal::OpflexListener::addPendingUpdate(unsigned long, opflex::modb::URI const&, opflex::gbp::PolicyUpdateOp) /home/noiro/work/opflex/libopflex/engine/OpflexListener.cpp:198 (libopflex.so.0+0x155a74)
#6 opflex::engine::internal::GbpOpflexServerImpl::remoteObjectUpdated(unsigned long, opflex::modb::URI const&, opflex::gbp::PolicyUpdateOp) /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:349 (libopflex.so.0+0x1628c1)
#7 opflex::engine::internal::MOSerializer::deserialize(rapidjson::GenericValue<rapidjson::UTF8, rapidjson::MemoryPoolAllocatorrapidjson::CrtAllocator > const&, opflex::modb::mointernal::StoreClient&, bool, std::unordered_map<o
pflex::modb::URI, unsigned long, std::hashopflex::modb::URI, std::equal_toopflex::modb::URI, std::allocator<std::pair<opflex::modb::URI const, unsigned long> > >) /home/noiro/work/opflex/libopflex/engine/MOSerializer.cpp:343 (libopfl
ex.so.0+0x10250a)
#8 opflex::engine::internal::OpflexServerHandler::handleEPDeclareReq(rapidjson::GenericValue<rapidjson::UTF8, rapidjson::MemoryPoolAllocatorrapidjson::CrtAllocator > const&, rapidjson::GenericValue<rapidjson::UTF8, rapid
json::MemoryPoolAllocatorrapidjson::CrtAllocator > const&) /home/noiro/work/opflex/libopflex/engine/OpflexServerHandler.cpp:433 (libopflex.so.0+0x17904a)
#9 yajr::rpc::InbReq<&yajr::rpc::method::endpoint_declare>::process() const /home/noiro/work/opflex/libopflex/engine/OpflexHandler.cpp:214 (libopflex.so.0+0x131551)
#10 yajr::comms::internal::CommunicationPeer::readBufferZ(char const
, unsigned long) /home/noiro/work/opflex/libopflex/comms/CommunicationPeer.cpp:279 (libopflex.so.0+0xe7359)
#11 yajr::comms::internal::CommunicationPeer::readBuffer(char*, unsigned long, bool) /home/noiro/work/opflex/libopflex/comms/CommunicationPeer.cpp:256 (libopflex.so.0+0xe75ae)
#12 yajr::transport::Cbyajr::transport::PlainText::on_read(uv_stream_s*, long, uv_buf_t const*) transport/PlainText.cpp:85 (libopflex.so.0+0xc3e38)
#13 (libuv.so.1+0x155ce)
#14 (libtsan.so.0+0x2aba6)

A) Mutex M1599 previously acquired by the same thread here: <-- Called to give policy update to each agent connection
#0 pthread_mutex_lock (libtsan.so.0+0x41d5b)
#1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749 (libopflex.so.0+0x1559fd)
#2 __gthread_recursive_mutex_lock /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:811 (libopflex.so.0+0x1559fd)
#3 std::recursive_mutex::lock() /usr/include/c++/9/mutex:106 (libopflex.so.0+0x1559fd)
#4 std::lock_guardstd::recursive_mutex::lock_guard(std::recursive_mutex&) /usr/include/c++/9/bits/std_mutex.h:159 (libopflex.so.0+0x1559fd)
<-- Needed, fixed recently via 8af14ce
#5 opflex::engine::internal::OpflexListener::addPendingUpdate(unsigned long, opflex::modb::URI const&, opflex::gbp::PolicyUpdateOp) /home/noiro/work/opflex/libopflex/engine/OpflexListener.cpp:196 (libopflex.so.0+0x1559fd)
#6 opflex::engine::internal::GbpOpflexServerImpl::remoteObjectUpdated(unsigned long, opflex::modb::URI const&, opflex::gbp::PolicyUpdateOp) /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:349 (libopflex.so.0+0x1628c1)
#7 opflex::engine::internal::MOSerializer::deserialize(rapidjson::GenericValue<rapidjson::UTF8, rapidjson::MemoryPoolAllocatorrapidjson::CrtAllocator > const&, opflex::modb::mointernal::StoreClient&, bool, std::unordered_map<o
pflex::modb::URI, unsigned long, std::hashopflex::modb::URI, std::equal_toopflex::modb::URI, std::allocator<std::pair<opflex::modb::URI const, unsigned long> > >) /home/noiro/work/opflex/libopflex/engine/MOSerializer.cpp:343 (libopfl
ex.so.0+0x10250a)
#8 opflex::engine::internal::OpflexServerHandler::handleEPDeclareReq(rapidjson::GenericValue<rapidjson::UTF8, rapidjson::MemoryPoolAllocatorrapidjson::CrtAllocator > const&, rapidjson::GenericValue<rapidjson::UTF8, rapid
json::MemoryPoolAllocatorrapidjson::CrtAllocator > const&) /home/noiro/work/opflex/libopflex/engine/OpflexServerHandler.cpp:433 (libopflex.so.0+0x17904a)
#9 yajr::rpc::InbReq<&yajr::rpc::method::endpoint_declare>::process() const /home/noiro/work/opflex/libopflex/engine/OpflexHandler.cpp:214 (libopflex.so.0+0x131551)
#10 yajr::comms::internal::CommunicationPeer::readBufferZ(char const
, unsigned long) /home/noiro/work/opflex/libopflex/comms/CommunicationPeer.cpp:279 (libopflex.so.0+0xe7359)
#11 yajr::comms::internal::CommunicationPeer::readBuffer(char*, unsigned long, bool) /home/noiro/work/opflex/libopflex/comms/CommunicationPeer.cpp:256 (libopflex.so.0+0xe75ae)
#12 yajr::transport::Cbyajr::transport::PlainText::on_read(uv_stream_s*, long, uv_buf_t const*) transport/PlainText.cpp:85 (libopflex.so.0+0xc3e38)
#13 (libuv.so.1+0x155ce)
#14 (libtsan.so.0+0x2aba6)

T16 (Thread to update policies for every agent connection):

B) Mutex M1599(conn_mutex) acquired here while holding mutex M1775 in thread T16:
#0 pthread_mutex_lock (libtsan.so.0+0x41d5b)
#1 opflex::engine::internal::OpflexListener::sendToOne(opflex::engine::internal::OpflexServerConnection*, opflex::engine::internal::OpflexMessage*) /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749 (libopflex.so.0+0x156b39)
#2 opflex::engine::internal::GbpOpflexServerImpl::policyUpdate(opflex::engine::internal::OpflexServerConnection*, std::vector<std::pair<unsigned long, opflex::modb::URI>, std::allocator<std::pair<unsigned long, opflex::modb::URI> > >
const&, std::vector<std::pair<unsigned long, opflex::modb::URI>, std::allocator<std::pair<unsigned long, opflex::modb::URI> > > const&, std::vector<std::pair<unsigned long, opflex::modb::URI>, std::allocator<std::pair<unsigned long, opfle
x::modb::URI> > > const&) /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:278 (libopflex.so.0+0x16318a)
#3 opflex::engine::internal::OpflexServerConnection::on_policy_update_async(uv_async_s*) /home/noiro/work/opflex/libopflex/engine/OpflexServerConnection.cpp:242 (libopflex.so.0+0x152137)
#4 (libuv.so.1+0xb3b3)
#5 (libtsan.so.0+0x2aba6)

185 void OpflexListener::sendToOne(OpflexServerConnection* conn, OpflexMessage* message) {
186 std::unique_ptr messagep(message);
187 const std::lock_guardstd::recursive_mutex lock(conn_mutex); <-- We dont need locking here
188 if (!active) return;
189 conn->sendMessage(message->clone());

A) Mutex M1775 (uri_map_mutex) previously acquired by the same thread here:
#0 pthread_mutex_lock (libtsan.so.0+0x41d5b)
#1 __gthread_mutex_lock /usr/include/x86_64-linux-gnu/c++/9/bits/gthr-default.h:749 (libopflex.so.0+0x1520db)
#2 std::mutex::lock() /usr/include/c++/9/bits/std_mutex.h:100 (libopflex.so.0+0x1520db)
#3 std::lock_guardstd::mutex::lock_guard(std::mutex&) /usr/include/c++/9/bits/std_mutex.h:159 (libopflex.so.0+0x1520db)
#4 opflex::engine::internal::OpflexServerConnection::on_policy_update_async(uv_async_s*) /home/noiro/work/opflex/libopflex/engine/OpflexServerConnection.cpp:237 (libopflex.so.0+0x1520db)
#5 (libuv.so.1+0xb3b3)
#6 (libtsan.so.0+0x2aba6)

233 void OpflexServerConnection::on_policy_update_async(uv_async_t* handle) {
234 OpflexServerConnection* conn = (OpflexServerConnection )handle->data;
235 GbpOpflexServerImpl
server = dynamic_cast<GbpOpflexServerImpl*>
236 (conn->listener->getHandlerFactory());
237 std::lock_guardstd::mutex lock(conn->uri_map_mutex); <-- This was changed from uri_map_mutex to uri_update_mutex in 3351d64.
238
239 if (conn->replace.empty() && conn->merge.empty() && conn->deleted.empty())
240 return;
241
242 server->policyUpdate(conn, conn->replace, conn->merge, conn->deleted);
243
244 conn->replace.clear();
245 conn->merge.clear();
246 conn->deleted.clear();

Thread T14 (tid=7642, running) created by main thread at:
#0 pthread_create (libtsan.so.0+0x2d3be)
#1 uv_thread_create (libuv.so.1+0x17cc0)
#2 opflex::engine::internal::GbpOpflexServerImpl::start() /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:127 (libopflex.so.0+0x165b62)
#3 opflex::test::GbpOpflexServer::start() /home/noiro/work/opflex/libopflex/engine/GbpOpflexServer.cpp:50 (libopflex.so.0+0x165c32)
#4 main server/opflex_server.cpp:249 (opflex_server+0x1f7fa)

Thread T16 (tid=7644, running) created by thread T14 at:
#0 pthread_create (libtsan.so.0+0x2d3be)
#1 uv_thread_create (libuv.so.1+0x17cc0)
#2 opflex::engine::internal::OpflexListener::on_new_connection(yajr::Listener*, void*, int) /home/noiro/work/opflex/libopflex/engine/OpflexListener.cpp:161 (libopflex.so.0+0x15690e)
#3 yajr::comms::internal::ListeningPeer::getConnectionHandlerData() ../include/opflex/yajr/internal/comms.hpp:1116 (libopflex.so.0+0xb9ce8)
#4 yajr::comms::internal::ListeningTcpPeer::getNewPassive() /home/noiro/work/opflex/libopflex/comms/passive_listener.cpp:333 (libopflex.so.0+0xb9ce8)
#5 yajr::comms::internal::on_passive_connection(uv_stream_s*, int) /home/noiro/work/opflex/libopflex/comms/passive_listener.cpp:296 (libopflex.so.0+0xb74b4)
#6 uv__server_io (libuv.so.1+0x1679b)
#7 (libtsan.so.0+0x2aba6)

@@ -184,7 +184,6 @@ void OpflexListener::sendToAll(OpflexMessage* message) {

void OpflexListener::sendToOne(OpflexServerConnection* conn, OpflexMessage* message) {
std::unique_ptr<OpflexMessage> messagep(message);
const std::lock_guard<std::recursive_mutex> lock(conn_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a lock when we have multiple agents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked code, it needs the lock even with single agent, since we could be closing the conn when sendToOne is in progress (similar to sendToAll) I kind of follow what TSAN is complaining about. Will see how to fix it and get back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing this with Madhu. If we need to protect access to conn, then we need to put the lock outside SendToOne(), where conn continues to get accessed - in OpflexServerConnection::on_policy_update_async(). And I will influence the lock order of conn_mutex and ref_vec_mutex to address deadlock. However, the critical section will be longer here. We are going to discuss further and see how to address this clean. Currently we are passing connection pointers in the handle. May be we need to pass conn ID and check if its a valid connection every time we access and reduce critical section. Will raise an issue to track the cleanup/addressing large critical section.

@@ -234,7 +234,7 @@ void OpflexServerConnection::on_policy_update_async(uv_async_t* handle) {
OpflexServerConnection* conn = (OpflexServerConnection *)handle->data;
GbpOpflexServerImpl* server = dynamic_cast<GbpOpflexServerImpl*>
(conn->listener->getHandlerFactory());
std::lock_guard<std::mutex> lock(conn->uri_map_mutex);
std::lock_guard<std::mutex> lock(conn->ref_vec_mutex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but i am not sure why a single lock is not good enough. Is it due to your other commit ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just avoids the inconsistency of using uri_map_mutex for the reference vector. This was changed from uri_update_mutex (original commit) to uri_map_mutex in 3351d64. The change is to just restore the original changes. Used ref_vec_mutex instead of uri_update_mutex for clarity. This wont address the lock ordering issue though.

- I am influencing the ordering now: conn_mutex obtained before ref_vec_mutex. This will take care of deadlock.
- Critical section for conn_mutex is larger. We will discuss further and optimize the critical section and also may be move to connID-->connPtr map for more safety.

Signed-off-by: Gautam Venkataramanan <gautam.chennai@gmail.com>
…ar to OpflexServerConnection::on_policy_update_async()

Signed-off-by: Gautam Venkataramanan <gautam.chennai@gmail.com>
@gautvenk gautvenk merged commit 8187930 into master Aug 11, 2020
@gautvenk gautvenk deleted the os-deadlock-8-10 branch August 11, 2020 01:19
@gautvenk
Copy link
Contributor Author

Opflex-cni-test passed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants