Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jack_client_close is locking when the jack server is periodically restarts #395

Open
Gravechapa opened this issue Nov 14, 2018 · 19 comments
Labels
dbus DBus related

Comments

@Gravechapa
Copy link

Gravechapa commented Nov 14, 2018

Hi, i'm working on an application which must automatically reconnect to the jack server. But i'm run into some trouble when started to test it. After several times of restarting the jack server, my application ended up stuck at jack_client_close.
Here's example to reproduce the bug:

#include <jack/jack.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>

jack_port_t *input_port;
jack_port_t *output_port;
jack_client_t *client;
pthread_spinlock_t spinlock;
int stop = 0;

int process (jack_nframes_t nframes, void *arg) {
    jack_default_audio_sample_t *in, *out;

    in = (jack_default_audio_sample_t *)jack_port_get_buffer (input_port, nframes);
    out = (jack_default_audio_sample_t *)jack_port_get_buffer (output_port, nframes);
    memcpy (out, in,
        sizeof (jack_default_audio_sample_t) * nframes);

    return 0;   
}

void jack_shutdown(void *arg) {
    pthread_spin_lock(&spinlock);
    stop = 1;
    pthread_spin_unlock(&spinlock);
}

void  silent_jack_error_callback(const char *msg) {}

int main() {
    pthread_spin_init(&spinlock, 0);
    jack_set_error_function(silent_jack_error_callback);
    const char *client_name = "simple";
    const char *server_name = NULL;
    jack_options_t options = JackNullOption;
    jack_status_t status;
    while(true){
    client = jack_client_open (client_name, JackNoStartServer, &status, server_name);
    if (client == NULL) {
		fprintf (stderr, "jack_client_open() failed, "
			 "status = 0x%2.0x\n", status);
		if (status & JackServerFailed) {
			fprintf (stderr, "Unable to connect to JACK server\n");
		}
		continue;
	}
    
    jack_on_shutdown(client, jack_shutdown, 0);
    jack_set_process_callback (client, process, 0);

    input_port = jack_port_register (client, "input",
                                    JACK_DEFAULT_AUDIO_TYPE,
                                    JackPortIsInput, 0);
    output_port = jack_port_register (client, "output",
                                    JACK_DEFAULT_AUDIO_TYPE,
                                    JackPortIsOutput, 0);
    
    if (jack_activate (client)) {
            fprintf (stderr, "cannot activate client");
            exit (1);
    }
    while (1) {
        pthread_spin_lock(&spinlock);
        if(stop) {
            pthread_spin_unlock(&spinlock);
            break;
        }
        pthread_spin_unlock(&spinlock);
    }
    fprintf (stderr, "stopping\n");
    jack_client_close(client);
    stop = 0;
    fprintf (stderr, "alive\n");
    }
    return 0;
}
#!/bin/bash 
while :
do 
jack_control start
sleep 1
jack_control stop
sleep 1
done

Here's stack traces of two non responding threads:

1 pthread_cond_wait * *GLIBC_2.3.2                                  0x7ffff73cdafc 
2 Jack::JackPosixProcessSync::Wait     JackPosixProcessSync.cpp 81  0x7ffff7b0350a 
3 Jack::JackMessageBuffer::Execute     JackMessageBuffer.cpp    104 0x7ffff7afc6a9 
4 Jack::JackPosixThread::ThreadHandler JackPosixThread.cpp      59  0x7ffff7b02812 
5 start_thread                                                      0x7ffff73c7a9d 
6 clone                                                             0x7ffff72f7b23 
1 __lll_lock_wait                                   0x7ffff73d0edc 
2 pthread_mutex_lock                                0x7ffff73ca336 
3 Jack::JackPosixMutex::Lock JackPosixMutex.cpp 112 0x7ffff7b03ef2 
4 Jack::JackClient::Close    JackClient.cpp     118 0x7ffff7ae0096 
5 jack_client_close          JackLibAPI.cpp     219 0x7ffff7b0772f 
6 monitor                    jack_input.c       126 0x555555558ede 
7 start_thread                                      0x7ffff73c7a9d 
8 clone                                             0x7ffff72f7b23 

Edit:
I tried to add this before jack_client_close:

 jack_deactivate(client);
 jack_port_unregister(client, input_port);
 jack_port_unregister(client, output_port);

Same result.

@7890
Copy link
Contributor

7890 commented Jan 11, 2019

Hi, I can't reproduce this. I use the test program above plus this non-DBus JACK start/stop script:

#!/bin/bash 
while :
do 
jackd -ddummy & pid=$!
sleep 1
kill -9 $pid
sleep 1
done

It seems to run "forever".

@Gravechapa
Copy link
Author

I still can reproduce this bug.
If it help - I'm using arch linux with jack2-dbus 1.9.12-6
Maybe removing delays will increase chances of catching the bug:

#!/bin/bash 
STOP=0
trap STOP=1 SIGINT
while [ $STOP -eq 0 ]
do 
jack_control start
jack_control stop
done

Occasionally it requires quite some time before this bug is reproduced.

@Gravechapa
Copy link
Author

@7890
Copy link
Contributor

7890 commented Jan 12, 2019

I can't test it with DBus ATM. Do you get the same result when removing every pthread related line? Eg. remove all spin locks. It seems to work in practically the same way. I let it run for a long time, without signs of a halt.

@7890
Copy link
Contributor

7890 commented Jan 12, 2019

Also: do you get the same result when starting/stopping jackd directly? This could reduce the possible sources of the issue.

@Gravechapa
Copy link
Author

Tested it without anything related to pthread - same result.
Tried direct use of the DBus functions:

while :
do 
qdbus org.jackaudio.service /org/jackaudio/Controller StartServer
sleep 1
qdbus org.jackaudio.service /org/jackaudio/Controller StopServer
sleep 1
done

The bug still appears.
But with non-DBus jack using your script I can't catch this bug.
It seems like only DBus version is affected.

@7890
Copy link
Contributor

7890 commented Jan 13, 2019

Interesting find

@7890 7890 added the dbus DBus related label Jan 15, 2019
@fps
Copy link
Contributor

fps commented Sep 19, 2019

Here's a possibly related way to create the deadlock (only tested on dbus-enabled jackd):

while true; do parallel jack_wait ::: -w -w -w -w -w -w -w -w -w || break; done

This requires gnu parallel. Run this while the jack server is running. After a couple of seconds one of the jack_wait processes will hang with precisely the same backtraces (IIRC. Definitly the second one of the traces is precisely the same).

@fps
Copy link
Contributor

fps commented Sep 19, 2019

Happens also in non-dbus-enabled jackd.

@falkTX
Copy link
Member

falkTX commented Sep 19, 2019

can you tell me how you debug this? I am not used with the parallel tool

@fps
Copy link
Contributor

fps commented Jul 6, 2020

Well, you can just install gnu parallel (it's packaged on every major distribution) and then just run the above command. On this raspberry pi 4 it just took about 15 seconds until one of them got stuck, and in turn parallel getting stuck, so no more "server is running" output.

Now just head over to another console and attach gdb to the only remaining jack_wait process.

@fps
Copy link
Contributor

fps commented Jul 6, 2020

ok, here's a more detailed bt:

Attaching to process 17634
[New LWP 17637]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libthread_db.so.1".
0x0000007f9e91a6f0 in __lll_lock_wait () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0
(gdb) thread apply all bt

Thread 2 (Thread 0x7f9e48c1c0 (LWP 17637)):
#0  0x0000007f9e917014 in pthread_cond_wait@@GLIBC_2.17 () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0
#1  0x0000007f9e9ef0fc in Jack::JackPosixProcessSync::Wait (this=this@entry=0x1e288288) at ../posix/JackPosixProcessSync.cpp:81
#2  0x0000007f9e9e6be0 in Jack::JackMessageBuffer::Execute (this=0x1e280050) at ../common/JackMessageBuffer.cpp:104
#3  0x0000007f9e9ee5fc in Jack::JackPosixThread::ThreadHandler (arg=0x1e288268) at ../posix/JackPosixThread.cpp:63
#4  0x0000007f9e910834 in start_thread () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0
#5  0x0000007f9e795a3c in thread_start () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libc.so.6

Thread 1 (Thread 0x7f9ea42010 (LWP 17634)):
#0  0x0000007f9e91a6f0 in __lll_lock_wait () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0
#1  0x0000007f9e912df0 in pthread_mutex_lock () from /nix/store/akyy80zkwyiy0n51kc4vx0qpxma77701-glibc-2.30/lib/libpthread.so.0
#2  0x0000007f9e9ef8ec in Jack::JackPosixMutex::Lock (this=) at ../posix/JackPosixMutex.cpp:112
#3  0x0000007f9e9d1240 in Jack::JackClient::Close (this=0x1e288490) at ../common/JackClient.cpp:118
#4  0x0000007f9e9f1600 in jack_client_close (ext_client=0x1e288490) at ../common/JackLibAPI.cpp:211
#5  0x0000000000400d90 in main (argc=, argv=) at ../example-clients/wait.c:114
(gdb) 

@fps
Copy link
Contributor

fps commented Jul 7, 2020

Observation: The deadlock happens more quickly when using a smaller period size: At 128/2 it happens in about 15 seconds. At 1024/2 never at all (in my experiments).It's easier to trigger on my raspberry pi 4 than on my i7 desktop PC.

@fps
Copy link
Contributor

fps commented Jul 7, 2020

Observation: Above report is for version 1.9.14. Current git branch develop behaves differently: This is what happens when i run jack_wait -w often enough (see above gnu parallel usage):

[ogfx@ogfx-dev:~/nixpkgs]$ JACK_NO_AUDIO_RESERVATION=1 jackd -d alsa -d hw:iXR -p 128 -n 2
jackdmp 1.9.14
Copyright 2001-2005 Paul Davis and others.
Copyright 2004-2016 Grame.
Copyright 2016-2019 Filipe Coelho.
jackdmp comes with ABSOLUTELY NO WARRANTY
This is free software, and you are welcome to redistribute it
under certain conditions; see the file COPYING for details
no message buffer overruns
no message buffer overruns
no message buffer overruns
JACK server starting in realtime mode with priority 10
self-connect-mode is "Don't restrict self connect requests"
creating alsa driver ... hw:iXR|hw:iXR|128|2|48000|0|0|nomon|swmeter|-|32bit
configuring for 48000Hz, period = 128 frames (2.7 ms), buffer = 2 periods
ALSA: final selected sample format for capture: 32bit integer little-endian
ALSA: use 2 periods for capture
ALSA: final selected sample format for playback: 32bit integer little-endian
ALSA: use 2 periods for playback
ALSA: poll time out, polled for 3037 usecs, Retrying with a recovery, retry cnt = 1
JackPosixProcessSync::LockedTimedWait error usec = 10664 err = Connection timed out
JackEngine::ClientCloseAux wait error ref = 3
ALSA: poll time out, polled for 3053 usecs, Retrying with a recovery, retry cnt = 2
ALSA: poll time out, polled for 3047 usecs, Retrying with a recovery, retry cnt = 3
ALSA: poll time out, polled for 3052 usecs, Retrying with a recovery, retry cnt = 4
ALSA: poll time out, polled for 3039 usecs, Retrying with a recovery, retry cnt = 5
ALSA: poll time out, polled for 3112 usecs, Reached max retry cnt = 5, Exiting
JackAudioDriver::ProcessAsync: read error, stopping...
JackPosixProcessSync::LockedTimedWait error usec = 5000000 err = Connection timed out
Driver is not running
Cannot create new client
^CJack main caught signal 2
JackPosixProcessSync::LockedTimedWait error usec = 5000000 err = Connection timed out
Driver is not running
Cannot create new client

[ogfx@ogfx-dev:~/nixpkgs]$ ^C

The server starts up fine and is silent after

ALSA: use 2 periods for playback

Once jack_wait hangs the error messages start coming in.. The console with jack_wait looks like this:

server is available                                                                        
server is available                          
server is available
server is available                          
Cannot open wait client                                                                    
JackShmReadWritePtr1::~JackShmReadWritePtr1 - Init not done for -1, skipping unlock        
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock          
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock                                                                                                     
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock                                                                                             [0/1801]
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock          
jack_client_open() failed, status = 0x21                                                   
Cannot read socket fd = 3 err = Connection reset by peer                                   
CheckRes error                               
Could not read result type = 22                                                            
Client name = wait conflits with another running client                                    
Cannot connect to the server                                                               

@fps
Copy link
Contributor

fps commented Jul 7, 2020

Hmm, if I increase the poll_timeout_ms in linux/alsa/alsa_driver.c to 2.5f * (driver->period_usecs / 1000.0f) then the deadlocking behaviour from 1.9.14 reappears.

@fps
Copy link
Contributor

fps commented Jul 7, 2020

How could a client closing ever cause a poll on an alsa pcm filedescriptor to timeout. I also once tried increasing MAX_RETRY_COUNT in alsa_driver.c to 500. It just took longer for the driver to fail... Very weird.

@fps
Copy link
Contributor

fps commented Jul 9, 2020

This PR #611 and using this shell script:

while true; do for n in {1..20}; do echo "-n $(uuidgen) "; done | parallel jack_wait -w; done

fixes the server fail in the alsa driver, but the deadlock on jack_client_close() still appears after a while..

@jeberger
Copy link

I was able to reproduce the issue with parallel jack_wait on an non-DBUS jackd. Moreover, it appears that the mutex involved is being held by a dead thread:

(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0xf767215c in __lll_lock_wait (futex=0x26f34, private=0) at lowlevellock.c:43
#2  0xf766d522 in __GI___pthread_mutex_lock (mutex=0x26f34) at pthread_mutex_lock.c:115
#3  0xf76a8aa8 in Jack::JackPosixMutex::Lock() () from ./staging/usr/lib/libjack.so.0
#4  0xf7695414 in Jack::JackClient::Close() () from ./staging/usr/lib/libjack.so.0
#5  0xf76aaa54 in jack_client_close () from ./staging/usr/lib/libjack.so.0
#6  0x000109a4 in main ()
(gdb) frame 2
#2  0xf766d522 in __GI___pthread_mutex_lock (mutex=0x26f34) at pthread_mutex_lock.c:115
115     pthread_mutex_lock.c: No such file or directory.
(gdb) p *mutex
$1 = {__data = {__lock = 2, __count = 1, __owner = 15877, __kind = 1, __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, __size = "\002\
000\000\000\001\000\000\000\005>\000\000\001\000\000\000\001\000\000\000\000\000\000", __align = 2}

Note that the mutex owner is thread 15877, but:

(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 15872.15872 "jack_wait" 0xf766d522 in __GI___pthread_mutex_lock (mutex=0x26f34) at pthread_mutex_lock.c:115
  2    Thread 15872.15875 "jack_wait" __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46

@jeberger
Copy link

jeberger commented Sep 17, 2020

The issue disappears if you move the JackGlobals::fSynchroMutex->Lock(); line before fChannel->Stop() in JackClient::Close.

The reason being, fChannel->Stop() kills the notifications thread, which may leave the mutex dangling if it was locked at that time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbus DBus related
Projects
None yet
Development

No branches or pull requests

5 participants