New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jack_client_close is locking when the jack server is periodically restarts #395
Comments
Hi, I can't reproduce this. I use the test program above plus this non-DBus JACK start/stop script:
It seems to run "forever". |
I still can reproduce this bug.
Occasionally it requires quite some time before this bug is reproduced. |
I can't test it with DBus ATM. Do you get the same result when removing every pthread related line? Eg. remove all spin locks. It seems to work in practically the same way. I let it run for a long time, without signs of a halt. |
Also: do you get the same result when starting/stopping jackd directly? This could reduce the possible sources of the issue. |
Tested it without anything related to pthread - same result.
The bug still appears. |
Interesting find |
Here's a possibly related way to create the deadlock (only tested on dbus-enabled jackd): while true; do parallel jack_wait ::: -w -w -w -w -w -w -w -w -w || break; done This requires gnu parallel. Run this while the jack server is running. After a couple of seconds one of the jack_wait processes will hang with precisely the same backtraces (IIRC. Definitly the second one of the traces is precisely the same). |
Happens also in non-dbus-enabled jackd. |
can you tell me how you debug this? I am not used with the parallel tool |
Well, you can just install gnu parallel (it's packaged on every major distribution) and then just run the above command. On this raspberry pi 4 it just took about 15 seconds until one of them got stuck, and in turn parallel getting stuck, so no more "server is running" output. Now just head over to another console and attach gdb to the only remaining jack_wait process. |
ok, here's a more detailed bt: Attaching to process 17634 [New LWP 17637] [Thread debugging using libthread_db enabled] Using host libthread_db library "/nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libthread_db.so.1". 0x0000007f9e91a6f0 in __lll_lock_wait () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0 (gdb) thread apply all bt Thread 2 (Thread 0x7f9e48c1c0 (LWP 17637)): #0 0x0000007f9e917014 in pthread_cond_wait@@GLIBC_2.17 () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0 #1 0x0000007f9e9ef0fc in Jack::JackPosixProcessSync::Wait (this=this@entry=0x1e288288) at ../posix/JackPosixProcessSync.cpp:81 #2 0x0000007f9e9e6be0 in Jack::JackMessageBuffer::Execute (this=0x1e280050) at ../common/JackMessageBuffer.cpp:104 #3 0x0000007f9e9ee5fc in Jack::JackPosixThread::ThreadHandler (arg=0x1e288268) at ../posix/JackPosixThread.cpp:63 #4 0x0000007f9e910834 in start_thread () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0 #5 0x0000007f9e795a3c in thread_start () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libc.so.6 Thread 1 (Thread 0x7f9ea42010 (LWP 17634)): #0 0x0000007f9e91a6f0 in __lll_lock_wait () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0 #1 0x0000007f9e912df0 in pthread_mutex_lock () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0 #2 0x0000007f9e9ef8ec in Jack::JackPosixMutex::Lock (this=) at ../posix/JackPosixMutex.cpp:112 #3 0x0000007f9e9d1240 in Jack::JackClient::Close (this=0x1e288490) at ../common/JackClient.cpp:118 #4 0x0000007f9e9f1600 in jack_client_close (ext_client=0x1e288490) at ../common/JackLibAPI.cpp:211 #5 0x0000000000400d90 in main (argc=, argv=) at ../example-clients/wait.c:114 (gdb) |
Observation: The deadlock happens more quickly when using a smaller period size: At 128/2 it happens in about 15 seconds. At 1024/2 never at all (in my experiments).It's easier to trigger on my raspberry pi 4 than on my i7 desktop PC. |
Observation: Above report is for version 1.9.14. Current git branch develop behaves differently: This is what happens when i run jack_wait -w often enough (see above gnu parallel usage): [ogfx@ogfx-dev:~/nixpkgs]$ JACK_NO_AUDIO_RESERVATION=1 jackd -d alsa -d hw:iXR -p 128 -n 2 jackdmp 1.9.14 Copyright 2001-2005 Paul Davis and others. Copyright 2004-2016 Grame. Copyright 2016-2019 Filipe Coelho. jackdmp comes with ABSOLUTELY NO WARRANTY This is free software, and you are welcome to redistribute it under certain conditions; see the file COPYING for details no message buffer overruns no message buffer overruns no message buffer overruns JACK server starting in realtime mode with priority 10 self-connect-mode is "Don't restrict self connect requests" creating alsa driver ... hw:iXR|hw:iXR|128|2|48000|0|0|nomon|swmeter|-|32bit configuring for 48000Hz, period = 128 frames (2.7 ms), buffer = 2 periods ALSA: final selected sample format for capture: 32bit integer little-endian ALSA: use 2 periods for capture ALSA: final selected sample format for playback: 32bit integer little-endian ALSA: use 2 periods for playback ALSA: poll time out, polled for 3037 usecs, Retrying with a recovery, retry cnt = 1 JackPosixProcessSync::LockedTimedWait error usec = 10664 err = Connection timed out JackEngine::ClientCloseAux wait error ref = 3 ALSA: poll time out, polled for 3053 usecs, Retrying with a recovery, retry cnt = 2 ALSA: poll time out, polled for 3047 usecs, Retrying with a recovery, retry cnt = 3 ALSA: poll time out, polled for 3052 usecs, Retrying with a recovery, retry cnt = 4 ALSA: poll time out, polled for 3039 usecs, Retrying with a recovery, retry cnt = 5 ALSA: poll time out, polled for 3112 usecs, Reached max retry cnt = 5, Exiting JackAudioDriver::ProcessAsync: read error, stopping... JackPosixProcessSync::LockedTimedWait error usec = 5000000 err = Connection timed out Driver is not running Cannot create new client ^CJack main caught signal 2 JackPosixProcessSync::LockedTimedWait error usec = 5000000 err = Connection timed out Driver is not running Cannot create new client [ogfx@ogfx-dev:~/nixpkgs]$ ^C The server starts up fine and is silent after ALSA: use 2 periods for playback Once jack_wait hangs the error messages start coming in.. The console with jack_wait looks like this: server is available server is available server is available server is available Cannot open wait client JackShmReadWritePtr1::~JackShmReadWritePtr1 - Init not done for -1, skipping unlock JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock [0/1801] JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock jack_client_open() failed, status = 0x21 Cannot read socket fd = 3 err = Connection reset by peer CheckRes error Could not read result type = 22 Client name = wait conflits with another running client Cannot connect to the server |
Hmm, if I increase the poll_timeout_ms in linux/alsa/alsa_driver.c to 2.5f * (driver->period_usecs / 1000.0f) then the deadlocking behaviour from 1.9.14 reappears. |
How could a client closing ever cause a poll on an alsa pcm filedescriptor to timeout. I also once tried increasing MAX_RETRY_COUNT in alsa_driver.c to 500. It just took longer for the driver to fail... Very weird. |
This PR #611 and using this shell script: while true; do for n in {1..20}; do echo "-n $(uuidgen) "; done | parallel jack_wait -w; done fixes the server fail in the alsa driver, but the deadlock on jack_client_close() still appears after a while.. |
I was able to reproduce the issue with parallel jack_wait on an non-DBUS jackd. Moreover, it appears that the mutex involved is being held by a dead thread:
Note that the mutex owner is thread 15877, but:
|
The issue disappears if you move the The reason being, |
Hi, i'm working on an application which must automatically reconnect to the jack server. But i'm run into some trouble when started to test it. After several times of restarting the jack server, my application ended up stuck at jack_client_close.
Here's example to reproduce the bug:
Here's stack traces of two non responding threads:
Edit:
I tried to add this before jack_client_close:
Same result.
The text was updated successfully, but these errors were encountered: