Race condition in stats.c ? #22

Closed
Matthias78 opened this Issue Jul 9, 2012 · 8 comments

Projects

None yet

2 participants

@Matthias78

Does any one has seen crashes of 2.3.2 KH31 and 2.3.3 KH1 similar like:

KH31:

Thread 1 (Thread 10096):
#0  signal_segv (signum=11, info=0x7f67e3b0c970, ptr=0x7f67e3b0c840) at ./Common/SIGSEGVHandler.cpp:111
#1  <signal handler called>
#2  0x00007f67e17068fa in pthread_rwlock_wrlock () from /lib/libpthread.so.0
#3  0x000000000042d8cd in thread_rwlock_wlock_c (rwlock=0x40, line=1171, file=0x4390b8 "avl.c") at thread.c:542
#4  0x0000000000415652 in stats_lock (handle=140082310717968, mount=0x1 <Address 0x1 out of bounds>) at stats.c:1304
#5  0x0000000000412d1f in update_source_stats (source=0x7f675d5b1980) at source.c:366
#6  0x00000000004143c8 in source_read (source=0x7f675d5b1980) at source.c:438
#7  0x00000000004181cb in worker (arg=) at client.c:529
#8  0x000000000042dfb9 in _start_routine (arg=0x16bcb30) at thread.c:660
#9  0x00007f67e17028ba in start_thread () from /lib/libpthread.so.0
#10 0x00007f67e19e702d in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

2.3.3 KH1:

Thread 1 (Thread 8295):
#0  signal_segv (signum=11, info=0x7f8dd7fe6930, ptr=0x7f8dd7fe6800) at ./Common/SIGSEGVHandler.cpp:111
#1  <signal handler called>
#2  0x00007f8ddcb278fa in pthread_rwlock_wrlock () from /lib/libpthread.so.0
#3  0x00000000004302fd in thread_rwlock_wlock_c (rwlock=0x2e, line=1171, file=0x43c348 "avl.c") at thread.c:546
#4  0x0000000000416762 in stats_lock (handle=154346384, mount=0x1 <Address 0x1 out of bounds>) at stats.c:1324
#5  0x0000000000413b54 in update_source_stats (source=0x7f8dd003cea0) at source.c:360
#6  0x00000000004153b0 in source_read (source=0x7f8dd003cea0) at source.c:444
#7  0x000000000040ec98 in relay_read (client=0x938d9b0) at slave.c:1306
#8  0x000000000041945b in worker (arg=) at client.c:557
#9  0x00000000004309e9 in _start_routine (arg=0x25f8fa0) at thread.c:664
#10 0x00007f8ddcb238ba in start_thread () from /lib/libpthread.so.0
#11 0x00007f8ddce0802d in clone () from /lib/libc.so.6
#12 0x0000000000000000 in ?? ()

AFAICS there is a problem / race condition in function

long stats_lock (long handle, const char *mount)

while calling

avl_tree_wlock.

This is a sporadic error. Hard to reproduce.
But this happend last time, as I tried to restart the server process, while thousands of clients tried to connect to this slave (relay) server. So it was "during initialisation / connecting sources from master server". Maybe this helps?

Matthias

@karlheyes
Owner

It will be the stats handle that is incorrect or more specifically, invalid. I did fix one case pre 33-kh1. So the interesting case will be the kh1 report. It's obviously a relay. If you can confirm that the relay in question was being restarted (without a delay) then I might know the case that leads to the situation.

karl.

@Matthias78

Hi Karl,

yes, the dump is from a relay server.
There is no mount in .xml - only a "master-server" entry.

And yes. The icecast process was immediately restarted after it crashed.

Matthias

@karlheyes
Owner

I meant was the relay in the process of being restarted when the crash occurred, eg in the error log do you see

INFO ... standing by to restart relay on /....

just before the crash occurs.

karl.

@Matthias78

sorry, can't say.
I've to adjust the logging level. In the moment it's set to "WARN".
I'll change this to "INFO"..

Matthias

@Matthias78

Hi Karl,

FYI: I recompiled the server using the latest "stats.c, stats.h, slave.c" and put this binary in production.
Let's see what happens...

Matthias

@Matthias78

Hi Karl,

as it seems there is still a problem.
Even with the latest "stats.c, stats.h, slave.c".
Here are two new crashes which happend short after (re-)starting the server process while it was under load (around3-4k listeners):

Thread 1 (Thread 25281):
#0  signal_segv (signum=11, info=0x0, ptr=<value optimized out>) at ./Common/SIGSEGVHandler.cpp:128
#1  <signal handler called>
#2  0x000000000041619a in _find_node (stats_tree=0x7f46181eea80, name=0x4377a6 "outgoing_kbitrate") at stats.c:378
#3  0x00000000004168eb in process_source_stat (src_stats=0x7f46183f41b0, event=0x7f462f9c67a0) at stats.c:510
#4  0x00000000004171ef in stats_set (handle=139939029117568, name=<value optimized out>, value=0x7f462f9c67f0 "802") at stats.c:1368
#5  0x000000000041733e in stats_set_args (handle=139939031237040, name=0x4377a6 "outgoing_kbitrate", format=<value optimized out>) at stats.c:1392
#6  0x0000000000413c37 in update_source_stats (source=0x7f46181ee740) at source.c:362
#7  0x0000000000415440 in source_read (source=0x7f46181ee740) at source.c:444
#8  0x000000000040ed28 in relay_read (client=0x7f4618008560) at slave.c:1305
#9  0x000000000041959b in worker (arg=<value optimized out>) at client.c:557
#10 0x0000000000430b29 in _start_routine (arg=0xeb2270) at thread.c:664
#11 0x00007f462d5ea8ba in start_thread () from /lib/libpthread.so.0
#12 0x00007f462d8cf02d in clone () from /lib/libc.so.6
#13 0x0000000000000000 in ?? ()


Last Log-Lines:

[2012-07-15  13:06:35] DBUG stats/modify_node_event update "global" stream_kbytes_sent (29523899)
[2012-07-15  13:06:35] DBUG stats/modify_node_event update "global" stream_kbytes_read (2109960)
[2012-07-15  13:06:35] DBUG client/client_send_bytes Client connection died


-------------------------------------


Thread 1 (Thread 26162):
#0  signal_segv (signum=11, info=<value optimized out>, ptr=<value optimized out>) at ./Common/SIGSEGVHandler.cpp:111
#1  <signal handler called>
#2  0x00007f7413ff68fa in pthread_rwlock_wrlock () from /lib/libpthread.so.0
#3  0x000000000043043d in thread_rwlock_wlock_c (rwlock=0xdf, line=1171, file=0x43c4c8 "avl.c") at thread.c:546
#4  0x00000000004168a2 in stats_lock (handle=140136162217120, mount=0x1 <Address 0x1 out of bounds>) at stats.c:1324
#5  0x0000000000413be4 in update_source_stats (source=0x7f73f4038530) at source.c:360
#6  0x0000000000415440 in source_read (source=0x7f73f4038530) at source.c:444
#7  0x000000000040ed28 in relay_read (client=0x7f73fe421ed0) at slave.c:1305
#8  0x000000000041959b in worker (arg=<value optimized out>) at client.c:557
#9  0x0000000000430b29 in _start_routine (arg=0x15a2270) at thread.c:664
#10 0x00007f7413ff28ba in start_thread () from /lib/libpthread.so.0
#11 0x00007f74142d702d in clone () from /lib/libc.so.6
#12 0x0000000000000000 in ?? ()


Last Log-Lines:

[2012-07-15  13:08:04] DBUG stats/modify_node_event update "global" clients (232)
[2012-07-15  13:08:04] DBUG client/worker 0x15a21b0 now has 232 clients
[2012-07-15  13:08:04] DBUG stats/modify_node_event update "global" clients (233)
[2012-07-15  13:08:04] DBUG stats/modify_node_event update "global" connections (356)
@karlheyes
Owner

well both of these would indicate an invalidated stats handle, the last indicates an invalid source name which is really very odd. It would need access to the core checked against the source tree and the error log to see what transpired.

karl.

@karlheyes karlheyes added a commit that referenced this issue Jul 19, 2012
@karlheyes crash fix, avoid invalidating source stats
issue #22 had a file with same name as stream getting used and the last listener
was triggering a free up of the stats, invalidating one used by source of the same
name that had connected afterwards.

because fallback to file can also cause a name clash with a stream reconnecting,
I've decided to put those stats under a name of "file-/mount", that way there is
no clash and that also avoids being invalidated.
78eded3
@karlheyes
Owner

this should now be fixed. A file of the same name as the mountpoint was being used by a listener and when finished was cleaning up stats under that name, however a source had connected and was running and found that the stats handle it was using had suddenly disappeared.

This particular issue was easily worked around as the stats should only occur in the case of fallback files so is now excluded but another fix was also done involving fallback to files and stats.

karl.

@karlheyes karlheyes closed this Aug 11, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment