Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

absl: Potential Mutex deadlock - when using signal handler #24884

Closed
hollste opened this issue Dec 2, 2020 · 18 comments
Closed

absl: Potential Mutex deadlock - when using signal handler #24884

hollste opened this issue Dec 2, 2020 · 18 comments

Comments

@hollste
Copy link

hollste commented Dec 2, 2020

What version of gRPC and what language are you using?

1.33.2 / C++

What operating system (Linux, Windows,...) and version?

Ubuntu 18.04

What runtime / compiler are you using (e.g. python version or version of gcc)

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

What did you do?

Tried to implement shutdown of a grpc::Server with a signal handler when pressing ctrl+c (SIGINT).
See repro case below.

#include <csignal>

static std::unique_ptr<grpc::Server> _server;

void shutdown(int)
{
    _server->Shutdown();
}

int main()
{
    signal(SIGINT, &shutdown);

    grpc::ServerBuilder builder;
    builder.AddListeningPort("127.0.0.1:5001", grpc::InsecureServerCredentials());
    testing::EchoTestService::Service service;
    builder.RegisterService(&service);
    _server = builder.BuildAndStart();
    _server->Wait();
    return 0;
}

What did you expect to see?

Normal shutdown and exit.

What did you see instead?

When built with dbg

^C[mutex.cc : 1356] RAW: Potential Mutex deadlock: 
	@ 0x55fa744bdc5a absl::lts_2020_02_25::DebugOnlyDeadlockCheck()
	@ 0x55fa744bdf30 absl::lts_2020_02_25::Mutex::Lock()
	@ 0x55fa744a5dc2 gpr_mu_lock
	@ 0x55fa742f677e grpc::CoreCodegen::gpr_mu_lock()
	@ 0x55fa74124077 grpc::internal::MutexLock::MutexLock()
	@ 0x55fa742da51b grpc::Server::ShutdownInternal()
	@ 0x55fa741216d2 grpc::ServerInterface::Shutdown()
	@ 0x55fa7412140f shutdown()
	@ 0x7f15c2bb3f20 
	@ 0x55fa744c435b absl::lts_2020_02_25::synchronization_internal::Waiter::Wait()
	@ 0x55fa744c4097 AbslInternalPerThreadSemWait
	@ 0x55fa744c2a92 absl::lts_2020_02_25::synchronization_internal::PerThreadSem::Wait()
	@ 0x55fa744c2e30 absl::lts_2020_02_25::Mutex::DecrementSynchSem()
	@ 0x55fa744c1c11 absl::lts_2020_02_25::CondVar::WaitCommon()
	@ 0x55fa744c1db9 absl::lts_2020_02_25::CondVar::Wait()
	@ 0x55fa744a5e9c gpr_cv_wait
	@ 0x55fa742f6822 grpc::CoreCodegen::gpr_cv_wait()
	@ 0x55fa742dc11b grpc::internal::CondVar::Wait()
	@ 0x55fa742dc0b5 grpc::internal::CondVar::Wait()
	@ 0x55fa742da9f6 grpc::Server::Wait()
	@ 0x55fa74121552 main
	@ 0x7f15c2b96b97 __libc_start_main

[mutex.cc : 1366] RAW: Acquiring 0x55fa75cabe78    Mutexes held:  0x55fa75cabe78
[mutex.cc : 1367] RAW: Cycle: 
[mutex.cc : 1381] RAW: mutex@0x55fa75cabe78 stack: 
	@ 0x55fa744bdc5a absl::lts_2020_02_25::DebugOnlyDeadlockCheck()
	@ 0x55fa744bdf30 absl::lts_2020_02_25::Mutex::Lock()
	@ 0x55fa744a5dc2 gpr_mu_lock
	@ 0x55fa742f677e grpc::CoreCodegen::gpr_mu_lock()
	@ 0x55fa74124077 grpc::internal::MutexLock::MutexLock()
	@ 0x55fa742da51b grpc::Server::ShutdownInternal()
	@ 0x55fa741216d2 grpc::ServerInterface::Shutdown()
	@ 0x55fa7412140f shutdown()
	@ 0x7f15c2bb3f20 
	@ 0x55fa744c435b absl::lts_2020_02_25::synchronization_internal::Waiter::Wait()
	@ 0x55fa744c4097 AbslInternalPerThreadSemWait
	@ 0x55fa744c2a92 absl::lts_2020_02_25::synchronization_internal::PerThreadSem::Wait()
	@ 0x55fa744c2e30 absl::lts_2020_02_25::Mutex::DecrementSynchSem()
	@ 0x55fa744c1c11 absl::lts_2020_02_25::CondVar::WaitCommon()
	@ 0x55fa744c1db9 absl::lts_2020_02_25::CondVar::Wait()
	@ 0x55fa744a5e9c gpr_cv_wait
	@ 0x55fa742f6822 grpc::CoreCodegen::gpr_cv_wait()
	@ 0x55fa742dc11b grpc::internal::CondVar::Wait()
	@ 0x55fa742dc0b5 grpc::internal::CondVar::Wait()
	@ 0x55fa742da9f6 grpc::Server::Wait()
	@ 0x55fa74121552 main
	@ 0x7f15c2b96b97 __libc_start_main

[mutex.cc : 1386] RAW: dying due to potential deadlock
Aborted (core dumped)

When built with opt


^C[mutex.cc : 2043] RAW: Check waitp == nullptr || waitp->thread->waitp == nullptr || waitp->thread->suppress_fatal_errors failed: detected illegal recursion into Mutex code
Aborted (core dumped)

Anything else we should know about your project / environment?

Works as expected on grpc 1.28.1 which was the previous version used before upgrading to 1.33.2.

@li-wu
Copy link

li-wu commented Dec 7, 2020

I have the same issue.

  • gcc (Debian 10.2.0-16) 10.2.0
  • grpc 1.33.1/C++

@veblush
Copy link
Contributor

veblush commented Dec 23, 2020

This is reproducible on the HEAD and it appears that this happens because the custom shutdown callback which is going to call server's shutdown on the its wait-thread which already holds the its mutex. Server's shutdown also tries to get the same mutex and that's why abseil complaint. (This isn't reproducible on Mac, though)

@vjpai Do you think this needs a fix? and is there a way to handle shutdown using signal properly?

@vjpai
Copy link
Member

vjpai commented Jan 11, 2021

This is interesting. AFAIK the actual CV waiting process should be releasing the mutex (since that would block out any further progress in any case) but in general signal handlers can be processed at any thread and if we happen to be in the moment between entering the CV::wait and the time that the mu is released, we could still see this deadlock. Maybe we need to investigate mutex-free shutdown or waiting.

@vjpai
Copy link
Member

vjpai commented Jan 11, 2021

A workaround is to have the signal handling function do the actual work in a new detached thread so I'm comfortable calling it a P2 or even P3.

@vjpai
Copy link
Member

vjpai commented Jan 12, 2021

Note that in general, even for libc, you can't use arbitrary library functions inside signal handlers. For example, Linux only supports the use of certain standard functions: http://manpages.ubuntu.com/manpages/bionic/man7/signal-safety.7.html . You should just assume that all gRPC functions are not async-signal-safe. If we find something that is async-signal-safe, we can document that at some point. Otherwise, if you want to do real gRPC work via signals, a safe way to do so is to start a "signal handler thread" at the beginning of main that you join just before exit and use signal masks to make sure that your signals get delivered only to that thread (being careful to only perform activities that are at least multithreaded-safe wrt other gRPC activities at the time). Since this is by no means a gRPC-specific issue, I'd recommend closing. @veblush I'll let you decide on closure.

@vjpai vjpai removed their assignment Jan 12, 2021
@veblush
Copy link
Contributor

veblush commented Mar 4, 2021

Closing this since calling gRPC functions in the signal handler is discouraged.

@orzel
Copy link

orzel commented May 7, 2021

I'm hit by this as well. I've had exactly this code (^c -> signal handler -> _server->Shutdown(); ), for very long. It has worked until grpc 1.32.0 included, and we're stuck with this version because of this bug.

It seems like I've misunderstood something important, and that I'm not alone. Hence two important questions I have:

  1. What's the recommended way to stop a server ? To send a "quit" message from a client ??
  2. if we want to be able to stop the server, either by ^C from a user, or from an init script, what's the recommende method ?

@vjpai
Copy link
Member

vjpai commented May 11, 2021

Your handler for SIGINT should just set some global variable, let's call it sigint_received . When you enter main, spawn a thread that does nothing but check the value of sigint_received (maybe sleep for 1 second between checks) and do your shutdown processing once it becomes true. You can also check for any other exit conditions that you want in that thread. And then remember to join that thread before exit'ing main. Or if possible, just do all that work in main itself, if main doesn't do any blocking operations. Those are all operations that are async-safe from within the signal handler, and then you synchronize it against the main thread with that check and the join.

@vjpai
Copy link
Member

vjpai commented May 11, 2021

Note that if directly calling shutdown from signal handler ever worked, it was mere coincidence. Even libc doesn't have async-signal-safe except for a fraction of functions, and we haven't made that guarantee for any of our functions.

@orzel
Copy link

orzel commented Apr 9, 2022

For the record, here is what I do, and it works with recent grpc (1.43.0).

bool shutdown_required = false;
void handle_signal(int sig)
{
    std::cout << "Got signal: " << strsignal(sig) << std::endl;
    shutdown_required = true;
}
void thread_check_shutdown(void)
{
    while (!shutdown_required) sleep(1);
    server->Shutdown();
}

int main(int, char *[])
{
    signal(SIGINT, handle_signal); // Interrupt from keyboard ^C
    signal(SIGQUIT, handle_signal); // Dump core
    signal(SIGTERM, handle_signal); // 'termination signal'... ??
    std::thread t(thread_check_shutdown);
    RunServer();
    t.join();
    return 0;
}

@HasanMdArik
Copy link

HasanMdArik commented Oct 9, 2022

For the record, here is what I do, and it works with recent grpc (1.43.0).

bool shutdown_required = false;
void handle_signal(int sig)
{
    std::cout << "Got signal: " << strsignal(sig) << std::endl;
    shutdown_required = true;
}
void thread_check_shutdown(void)
{
    while (!shutdown_required) sleep(1);
    server->Shutdown();
}

int main(int, char *[])
{
    signal(SIGINT, handle_signal); // Interrupt from keyboard ^C
    signal(SIGQUIT, handle_signal); // Dump core
    signal(SIGTERM, handle_signal); // 'termination signal'... ??
    std::thread t(thread_check_shutdown);
    RunServer();
    t.join();
    return 0;
}

Your handler for SIGINT should just set some global variable, let's call it sigint_received . When you enter main, spawn a thread that does nothing but check the value of sigint_received (maybe sleep for 1 second between checks) and do your shutdown processing once it becomes true. You can also check for any other exit conditions that you want in that thread. And then remember to join that thread before exit'ing main. Or if possible, just do all that work in main itself, if main doesn't do any blocking operations. Those are all operations that are async-safe from within the signal handler, and then you synchronize it against the main thread with that check and the join.

@vjpai @orzel Running a sort of infinite while loop seems very inefficient. The task can be easily achieved by conditional variables. Here is my more efficient approach...

bool shutdown_required = false;
std::mutex mutex;
std::condition_variable cv;

void handleSignal(int sig)
{
    std::cout << "Got signal: " << strsignal(sig) << std::endl;
    shutdown_required = true;
    cv.notify_one();
}
void shutdownCheckingThread(void)
{
    std::unique_lock<std::mutex> lock(mutex);
    cv.wait(lock, []()
            { return shutdown_required; });
    server->Shutdown();
}

int main(int, char *[])
{
    signal(SIGINT, handleSignal); 
    signal(SIGQUIT, handleSignal); 
    signal(SIGTERM, handleSignal);
    std::thread t(shutdownCheckingThread);
    RunServer();
    t.join();
    return 0;
}

@orzel
Copy link

orzel commented Oct 11, 2022

@vjpai @orzel Running a sort of infinite while loop seems very inefficient. The task can be easily achieved by conditional variables. Here is my more efficient approach...

Hi @MohammadArik, thanks for your input.

I'm not familiar with those conditional variables. But according to cppreference, you should also acquire a mutex in handle_signal before modifying the variable. Which makes sense, your mutex was used in only one place, quite useless, isn't it ?

So i have this now, and it kinda seems to work (just did a few quick tests):

void handle_signal(int sig)
{
    std::cout << "Got signal: " << strsignal(sig) << std::endl;
    const std::lock_guard lock(mutex);
    shutdown_required = true;
    cv.notify_one();
}

(and you can remove the <std::mutex> in shutdownCheckingThread.)

@HasanMdArik
Copy link

HasanMdArik commented Oct 12, 2022

@vjpai @orzel Running a sort of infinite while loop seems very inefficient. The task can be easily achieved by conditional variables. Here is my more efficient approach...

Hi @MohammadArik, thanks for your input.

I'm not familiar with those conditional variables. But according to cppreference, you should also acquire a mutex in handle_signal before modifying the variable. Which makes sense, your mutex was used in only one place, quite useless, isn't it ?

So i have this now, and it kinda seems to work (just did a few quick tests):

void handle_signal(int sig)
{
    std::cout << "Got signal: " << strsignal(sig) << std::endl;
    const std::lock_guard lock(mutex);
    shutdown_required = true;
    cv.notify_one();
}

(and you can remove the <std::mutex> in shutdownCheckingThread.)

Hello @orzel! Nice to see your response.

It's actually not necessary as it could have be done without mutex. There is no scope of data-race. Using the mutex just to use condition_variable as the wait method requires a unique_lock. So not required...

@orzel
Copy link

orzel commented Oct 13, 2022

It's actually not necessary as it could have be done without mutex. There is no scope of data-race. Using the mutex just to use condition_variable as the wait method requires a unique_lock. So not required...

Oh, you think ? That's because shutdown_required can only be modified in one place, and only once, anyway ?

@vjpai
Copy link
Member

vjpai commented Oct 13, 2022

It's actually not necessary as it could have be done without mutex. There is no scope of data-race. Using the mutex just to use condition_variable as the wait method requires a unique_lock. So not required...

Oh, you think ? That's because shutdown_required can only be modified in one place, and only once, anyway ?

I haven't responded to this thread despite being tagged since I'm not in gRPC anymore, but I feel strongly about data races so I wanted to put this out there. The suggestion about dropping the mutex is not correct because the resulting code snippet has a race on it between the write of shutdown_required in your mutex-less handle_signal code, and the read in the CV predicate (even though that has a lock). In general, any read and write of a non-atomic variable that could happen in different threads must be protected under the same lock. But also recall that mutex is not async-signal safe in general.

@vjpai
Copy link
Member

vjpai commented Oct 13, 2022

And, to clarify that I'm not just in the doom and gloom business, you can solve some of this using atomic. In most systems, atomic<bool> is mutex-free (I've never seen any to the contrary) and you can static_assert that in your code so that any system to the contrary will fail to build. Or if you're on C++20, you can wait on an atomic_flag which is always mutex-free (that class exists in C++11 too but not with wait functionality).

@HasanMdArik
Copy link

And, to clarify that I'm not just in the doom and gloom business, you can solve some of this using atomic. In most systems, atomic<bool> is mutex-free (I've never seen any to the contrary) and you can static_assert that in your code so that any system to the contrary will fail to build. Or if you're on C++20, you can wait on an atomic_flag which is always mutex-free (that class exists in C++11 too but not with wait functionality).

Yes the atomic var. with wait is better. I missed it. Thanks @vjpai

@fboranek
Copy link

Well, my workaround is

std::signal(SIGINT, [](int) {
    std::async(std::launch::async, [] {
        if (_server) _server->Shutdown();
    }).get();
});
_server->Wait();

not pretty, but less code. Call Shutdown() from new thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants