Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kamailio freezes with DMQ and dialog #2535

Closed
pwakano opened this issue Oct 28, 2020 · 19 comments
Closed

Kamailio freezes with DMQ and dialog #2535

pwakano opened this issue Oct 28, 2020 · 19 comments
Labels

Comments

@pwakano
Copy link

pwakano commented Oct 28, 2020

Description

When running a load test Kamailio eventually becames unresponsive and stops processing calls.
Kamilio is configured to use the DMQ replication for dialog and usrloc. Also the dialog keepalived is enabled.

Troubleshooting

From investigation, the problem happens faster and easier when there is some network degradation causing packet loss and/or retransmissions but even without any noticeable network issue the freeze eventually happens.

Reproduction

Run a simple load test making calls at a rate of ~5 cps and keep around ~2000 calls connected all the time. A higher cps seems to make it easier to reproduce the problem.
Adding network degradation to the environment makes the problem happens, but when running a tool such as SIPp for the load test, the retransmission can be forced by simply killing the SIPp instance receiving calls which will then force Kamailio to retransmit.

Debugging Data

Output of kamct trap:
gdb_kamailio_20201028_213030.txt

Log Messages

Local generated requests shows up in the log, but are not sent in the network

Possible Solutions

Not found so far

Additional Information

  • Kamailio Version - output of kamailio -v
# kamailio -v
version: kamailio 5.4.2 (x86_64/linux) c3b91f
flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: c3b91f 
compiled on 13:50:37 Oct 27 2020 with gcc 4.8.5
  • Operating System:
# cat /etc/centos-release
CentOS Linux release 7.8.2003 (Core)
# uname -a
Linux hostname 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
@henningw
Copy link
Contributor

Thanks for reproducing it with a recent version and opening the issue.

@pwakano
Copy link
Author

pwakano commented Oct 29, 2020

Hello, sorry for being pushy, but just hoping someone could acknowledge the problem or if this is some config on my side which is causing the issue. This is a major problem, it is worst than a crash, a manual intervention with kill -9 is necessary to restore the service back.

@henningw henningw added the bug label Oct 30, 2020
@henningw
Copy link
Contributor

henningw commented Oct 30, 2020

There is not much to configuration needed regarding DMQ, so my guess its either something regarding the particular environment or a bug in the code. Maybe @charlesrchance can comment on it.

@pwakano
Copy link
Author

pwakano commented Nov 1, 2020

Thanks henningw! I was just wondering about some config in the DMQ or dialog which can be causing the issue because it looks like others in the community are using DMQ + dialog without issues.

@miconda
Copy link
Member

miconda commented Nov 2, 2020

Can you try with couple of variants to try to isolate the cause? Try each one independent of the other.

  1. use modparam("dmq", "num_workers", 1)

  2. use modparam("dmq", "worker_usleep", 1000)

  3. use modparam("dialog", "track_cseq_updates", 0) - this may break authentication done with uac module, but I understood you test and can reproduce it in the lab

@henningw
Copy link
Contributor

henningw commented Nov 2, 2020

Just to copy from the list discussion: the utilization of the feature vsphere Fault Tolerance is linked as a source of network latency and probably CPU allocation latency. This increases the chances of the mentioned issue to happen.

@pwakano
Copy link
Author

pwakano commented Nov 3, 2020

The investigation done so far regarding the dead lock shows that 2 process entered a dead lock, a udp worker and the timer process.
The timer process was trying to delete a dialog reference, and got the dialog lock (d_table->entry lock) and then tried to get the DMQ node list lock.
On the other hand, the UDP worker was trying to broadcast a DMQ message, got the DMQ node list lock and then tried to get the d_table entry lock for the same dialog as the timer process.

N	21068	main process - attendant
Y	21247	udp receiver child=0 sock=124.47.168.242:5060   TRYING TO GET LOCK  0x7f03675bdc48
Y	21248	udp receiver child=1 sock=124.47.168.242:5060   TRYING TO GET LOCK  0x7f03675bdc48
Y	21249	udp receiver child=2 sock=124.47.168.242:5060   TRYING TO GET LOCK  0x7f03675bdc48
Y	21250	udp receiver child=3 sock=124.47.168.242:5060   TRYING TO GET LOCK  0x7f03675bdc48
Y	21251	udp receiver child=4 sock=124.47.168.242:5060   TRYING TO GET LOCK  0x7f03675bdc48
Y	21252	udp receiver child=5 sock=124.47.168.242:5060   TRYING TO GET LOCK  0x7f03675bdc48
Y	21253	udp receiver child=6 sock=124.47.168.242:5060   TRYING TO GET LOCK  0x7f036822441c		HAS LOCK 0x7f03675bdc48
Y	21254	udp receiver child=7 sock=124.47.168.242:5060
Y	21255	udp receiver child=0 sock=10.51.1.72:5060
Y	21257	udp receiver child=1 sock=10.51.1.72:5060
Y	21258	udp receiver child=2 sock=10.51.1.72:5060
Y	21260	udp receiver child=3 sock=10.51.1.72:5060
Y	21261	udp receiver child=4 sock=10.51.1.72:5060
Y	21262	udp receiver child=5 sock=10.51.1.72:5060
Y	21263	udp receiver child=6 sock=10.51.1.72:5060
Y	21264	udp receiver child=7 sock=10.51.1.72:5060
N	21265	slow timer
Y	21266	timer   TRYING TO GET LOCK  0x7f03675bdc48		HAS LOCK 0x7f036822441c
N	21267	secondary timer
N	21268	ctl handler
N	21269	JSONRPCS FIFO
Y	21270	DMQ WORKER  TRYING TO GET LOCK  0x7f03675bdcd4
Y	21271	DMQ WORKER  TRYING TO GET LOCK  0x7f03675bdcec
Y	21273	Dialog KA Timer
Y	21275	Dialog Clean Timer TRYING TO GET LOCK   0x7f036822441c
N	21278	WEBSOCKET KEEPALIVE
N	21279	WEBSOCKET TIMER
N	21281	TIMER NH
N	21282	TIMER NH
Y	21283	TIMER UAC REG
N	21284	tcp receiver (generic) child=0
N	21285	tcp receiver (generic) child=1
N	21286	tcp receiver (generic) child=2
N	21287	tcp receiver (generic) child=3
N	21289	tcp receiver (generic) child=4
N	21291	tcp receiver (generic) child=5
N	21292	tcp receiver (generic) child=6
N	21293	tcp receiver (generic) child=7
N	21294	tcp main process

21266 # timer BT:

#0  0x00007f0386d30bf9 in syscall () from /lib64/libc.so.6
#1  0x00007f035d4effa9 in futex_get (lock=0x7f03675bdc48) at ../../core/futexlock.h:121 <- TRYING TO GET LOCK 0x7f03675bdc48
#2  0x00007f035d4f205b in bcast_dmq_message1 (peer=0x7f0368231ff0, body=0x7ffe95bc0920, except=0x0, resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>, incl_inactive=0)
    at dmq_funcs.c:156
#3  0x00007f035d4f28d4 in bcast_dmq_message (peer=0x7f0368231ff0, body=0x7ffe95bc0920, except=0x0, resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>) at dmq_funcs.c:188
#4  0x00007f035c96704d in dlg_dmq_send (body=0x7ffe95bc0920, node=0x0) at dlg_dmq.c:88
#5  0x00007f035c9709d5 in dlg_dmq_replicate_action (action=DLG_DMQ_RM, dlg=0x7f03684b77f8, needlock=0, node=0x0) at dlg_dmq.c:628
#6  0x00007f035c90a39a in destroy_dlg (dlg=0x7f03684b77f8) at dlg_hash.c:367
#7  0x00007f035c914064 in dlg_unref_helper (dlg=0x7f03684b77f8, cnt=2, fname=0x7f035c9b222d "dlg_handlers.c", fline=1212) at dlg_hash.c:1081 << --- GOT LOCK 0x7f036822441c

		(gdb) p *d_entry
		$2 = {first = 0x7f0369205248, last = 0x7f036855c680, next_id = 30442, lock = {val = 2}, locker_pid = {val = 21266}, rec_lock_level = 0}
	
		(gdb) p dlg->h_entry
		$3 = 2340
	
		(gdb) p &d_entry->lock
		$5 = (gen_lock_t *) 0x7f036822441c


#8  0x00007f035c945bd2 in unref_dlg_from_cb (t=0x7f036936b678, type=131072, param=0x7ffe95bc0e20) at dlg_handlers.c:1212
#9  0x00007f03612d3165 in run_trans_callbacks_internal (cb_lst=0x7f036936b6f0, type=131072, trans=0x7f036936b678, params=0x7ffe95bc0e20) at t_hooks.c:258
#10 0x00007f03612d3297 in run_trans_callbacks (type=131072, trans=0x7f036936b678, req=0x0, rpl=0x0, code=0) at t_hooks.c:285
#11 0x00007f03611f9b8f in free_cell_helper (dead_cell=0x7f036936b678, silent=0, fname=0x7f0361313363 "timer.c", fline=643) at h_table.c:165
#12 0x00007f03612bfc4d in wait_handler (ti=1856451739, wait_tl=0x7f036936b700, data=0x7f036936b678) at timer.c:643
#13 0x00000000004ea094 in timer_list_expire (t=1856451739, h=0x7f036587d380, slow_l=0x7f036587f948, slow_mark=5689) at core/timer.c:857
#14 0x00000000004ea53d in timer_handler () at core/timer.c:922
#15 0x00000000004ea9e7 in timer_main () at core/timer.c:961
#16 0x000000000042a838 in main_loop () at main.c:1753
#17 0x0000000000433a96 in main (argc=10, argv=0x7ffe95bc1968) at main.c:2856


21253 # udp receiver child=6 sock=124.47.168.242:5060 BT:

#0  0x00007f0386d30bf9 in syscall () from /lib64/libc.so.6
#1  0x00007f035c9061a7 in futex_get (lock=0x7f036822441c) at ../../core/futexlock.h:121
#2  0x00007f035c90ee00 in dlg_lookup_mode (h_entry=2340, h_id=30422, lmode=0) at dlg_hash.c:781 <- TRYING TO GET LOCK KHERE

		(gdb) p *d_entry
		$1 = {first = 0x7f0369205248, last = 0x7f036855c680, next_id = 30442, lock = {val = 2}, locker_pid = {val = 21266}, 
		  rec_lock_level = 0}

		(gdb) p h_entry
		$2 = 2340

#3  0x00007f035c90fa27 in dlg_get_by_iuid (diuid=0x7f035cbcbd40 <_dlg_ctx+64>) at dlg_hash.c:849
#4  0x00007f035c9abfd1 in dlg_get_ctx_dialog () at dlg_var.c:940
#5  0x00007f035c945c03 in dlg_lookup_msg_dialog (msg=0x7ffe95bbc8c0, dir=0x7ffe95bbc8bc) at dlg_handlers.c:1227
#6  0x00007f035c956f77 in dlg_cseq_msg_sent (evp=0x7ffe95bbd2d0) at dlg_cseq.c:393
#7  0x00000000004ec14b in sr_event_exec (type=2, evp=0x7ffe95bbd2d0) at core/events.c:240
#8  0x00007f03612c1e88 in msg_send_buffer (dst=0x7f036a0a4870, 
    buf=0x7f036bfa1af8 "KDMQ sip:dialog@10.51.1.71:5060 SIP/2.0\r\nVia: SIP/2.0/UDP 10.51.1.72;branch=z9hG4bK6492.a407a601", '0' <repeats 24 times>, ".0\r\nTo: <sip:dialog@10.51.1.71:5060>\r\nFrom: <sip:dialog@10.51.1.72:5060>;tag=cf4"..., len=703, flags=0) at ../../core/forward.h:148
#9  0x00007f03612c743e in send_pr_buffer (rb=0x7f036a0a4820, buf=0x7f036bfa1af8, len=703) at t_funcs.c:69
#10 0x00007f036128d315 in send_prepared_request_impl (request=0x7f036a0a4820, retransmit=1, branch=0) at uac.c:669
#11 0x00007f036128e89e in t_uac_with_ids (uac_r=0x7ffe95bbd8f0, ret_index=0x0, ret_label=0x0) at uac.c:753
#12 0x00007f036128e7ca in t_uac (uac_r=0x7ffe95bbd8f0) at uac.c:721
#13 0x00007f0361291a62 in request (uac_r=0x7ffe95bbd8f0, ruri=0x7ffe95bbd8c0, to=0x7ffe95bbd8c0, 
    from=0x7ffe95bbd8d0, next_hop=0x0) at uac.c:1089
#14 0x00007f035d4f4053 in dmq_send_message (peer=0x7f0368231ff0, body=0x7ffe95bbdca0, node=0x7f0368464068, 
    resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, 
    content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>) at dmq_funcs.c:251
#15 0x00007f035d4f24ff in bcast_dmq_message1 (peer=0x7f0368231ff0, body=0x7ffe95bbdca0, except=0x0, 
    resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, 
    content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>, incl_inactive=0) at dmq_funcs.c:170   <----- GOT LOCK 0x7f03675bdc48

		(gdb) p &dmq_node_list->lock
		$1 = (gen_lock_t *) 0x7f03675bdc48

#16 0x00007f035d4f28d4 in bcast_dmq_message (peer=0x7f0368231ff0, body=0x7ffe95bbdca0, except=0x0, 
    resp_cback=0x7f035cbc9af0 <dlg_dmq_resp_callback>, max_forwards=1, 
    content_type=0x7f035cbc9ab0 <dlg_dmq_content_type>) at dmq_funcs.c:188
#17 0x00007f035c96704d in dlg_dmq_send (body=0x7ffe95bbdca0, node=0x0) at dlg_dmq.c:88
#18 0x00007f035c9709d5 in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7f0369205248, needlock=0, node=0x0)
    at dlg_dmq.c:628
#19 0x00007f035c94ba78 in dlg_onroute (req=0x7f03860006e0, route_params=0x7ffe95bbe030, param=0x0)
    at dlg_handlers.c:1559
#20 0x00007f0360b665c6 in run_rr_callbacks (req=0x7f03860006e0, rr_param=0x7ffe95bbe120) at rr_cb.c:96
#21 0x00007f0360b799b2 in after_loose (_m=0x7f03860006e0, preloaded=0) at loose.c:984
#22 0x00007f0360b7a0dc in loose_route (_m=0x7f03860006e0) at loose.c:1018
#23 0x00007f0360b7e240 in w_loose_route (msg=0x7f03860006e0, p1=0x0, p2=0x0) at rr_mod.c:276
#24 0x00000000005d0bc5 in do_action (h=0x7ffe95bbecd0, a=0x7f0385daa880, msg=0x7f03860006e0) at core/action.c:1076
#25 0x00000000005ddf78 in run_actions (h=0x7ffe95bbecd0, a=0x7f0385daa880, msg=0x7f03860006e0) at core/action.c:1581
#26 0x00000000005de5f4 in run_actions_safe (h=0x7ffe95bc09f0, a=0x7f0385daa880, msg=0x7f03860006e0)
    at core/action.c:1645
#27 0x0000000000446834 in rval_get_int (h=0x7ffe95bc09f0, msg=0x7f03860006e0, i=0x7ffe95bbf5fc, rv=0x7f0385daaac0, 
    cache=0x0) at core/rvalue.c:915
#28 0x000000000044b365 in rval_expr_eval_int (h=0x7ffe95bc09f0, msg=0x7f03860006e0, res=0x7ffe95bbf5fc, 
    rve=0x7f0385daaab8) at core/rvalue.c:1913
#29 0x00000000005d06a6 in do_action (h=0x7ffe95bc09f0, a=0x7f0385db3280, msg=0x7f03860006e0) at core/action.c:1052
#30 0x00000000005ddf78 in run_actions (h=0x7ffe95bc09f0, a=0x7f0385da9ee8, msg=0x7f03860006e0) at core/action.c:1581
#31 0x00000000005cd315 in do_action (h=0x7ffe95bc09f0, a=0x7f0385d24ea0, msg=0x7f03860006e0) at core/action.c:700
#32 0x00000000005ddf78 in run_actions (h=0x7ffe95bc09f0, a=0x7f0385d24c68, msg=0x7f03860006e0) at core/action.c:1581
#33 0x00000000005d0b34 in do_action (h=0x7ffe95bc09f0, a=0x7f0385d3e420, msg=0x7f03860006e0) at core/action.c:1067
#34 0x00000000005ddf78 in run_actions (h=0x7ffe95bc09f0, a=0x7f0385d1ae68, msg=0x7f03860006e0) at core/action.c:1581
#35 0x00000000005de6bc in run_top_route (a=0x7f0385d1ae68, msg=0x7f03860006e0, c=0x0) at core/action.c:1666
#36 0x00000000005e5c40 in receive_msg (
    buf=0xabcb80 <buf.7133> "ACK sip:61283214906@10.51.1.50:5060;alias=10.51.1.50~5060~1 SIP/2.0\r\nVia: SIP/2.0/UDP 124.47.168.253:6080;branch=z9hG4bK-18807-894-4\r\nFrom: test_894 <sip:61289993111@124.47.168.253:6080>;tag=18807SIPp"..., len=724, rcv_info=0x7ffe95bc10e0) at core/receive.c:423
#37 0x00000000004c1442 in udp_rcv_loop () at core/udp_server.c:543
#38 0x0000000000429d47 in main_loop () at main.c:1683
#39 0x0000000000433a96 in main (argc=10, argv=0x7ffe95bc1968) at main.c:2856

Full kamctl trap:
gdb_kamailio_20201102_110450.txt

@pwakano
Copy link
Author

pwakano commented Nov 3, 2020

Hi Daniel, thanks for your attention!
Regarding the tests you suggested, we ran all changes individually and both dm_worker and worker_usleep didn't work and problem still happened.
The track_cseq_updates is still under test but is already running for 7 hours without having the issue. Usually the freeze was happening within the first 2 hours of test. Also considering the previous analysis we've done, we can see the function dlg_cseq_msg_sent is being called when the dead lock happened, so I guess this is a strong evidence there is something wrong with this feature + DMQ?
Also I saw this commit a309a0e from yesterday, is this potentially a fix for this problem?

@pwakano
Copy link
Author

pwakano commented Nov 3, 2020

Actually I forget to mention that the track_cseq_updates was set to 0, not 1 as per your comment 0 is what breaks the authentication and also it was already set to 1

@miconda
Copy link
Member

miconda commented Nov 3, 2020

Indeed track_cseq_updates had to be set to 0, forgot to update after copy&paste from readme (I updated my comment for future references).

My commit from yesterday was because I discovered that cseq update processing was done for dmq requests. It's useless, but I didn't get the time to dig in further if that can expose any issue.

I suggested to disable it because it was executed and it is not a common deployment, so I wanted to isolate if that was somehow leading to the deadlock

@miconda
Copy link
Member

miconda commented Nov 3, 2020

And indeed, after looking further into this direction, helped also by your troubleshooting findings posted in an comment above, the processing of cseq update for dmq was leading to the deadlock.

The commit I pushed yesterday (which was backported to 5.4 and 5.3 branches as well) should fix it. Can you try with it and track_cseq_updatesset back to 1?

@pwakano
Copy link
Author

pwakano commented Nov 3, 2020

Hi Daniel, thanks for your review, appreciate it!
We will test that change tomorrow but isn't it then also possible to fall in the deadlock when receiving/processing an invite?

@miconda
Copy link
Member

miconda commented Nov 3, 2020

I cannot say that there are no other deadlocks, but this case is not going to happen anymore when processing invite or other requests part of the dialog.

By trying to update the cseq in the locally generated DMQ request (which is for updating the dlg states to other peers), it was a spiral of accessing both dialogs and dmq-peers lists.

Adding a bit more: the design is to go bottom up and back with dialog processing first, then dmq replication finished, then back. This case was dialog processing, dqm notification unfinished and again dialog processing .

@pwakano
Copy link
Author

pwakano commented Nov 4, 2020

Hi Daniel, we've run the test, but unfortunately the problem happened again after about 5 hours of test.
The UDP receiver had the DMQ lock (0x7f28bd321478) and tried to get the dialog lock(0x7f28bdf8b5b4), and the Timer process got the dialog lock (0x7f28bdf8b5b4) and tried to get the DMQ worker lock (0x7f28bd321478)...

Check the backtrace of the processes in deadlock:

1331 udp receiver child=3 sock=45.116.131.61:5060

#0  0x00007f28dca971c9 in syscall () from /lib64/libc.so.6
#1  0x00007f28b269e5cd in futex_get (lock=0x7f28bdf8b5b4) at ../../core/mem/../futexlock.h:121
#2  0x00007f28b26a7226 in dlg_lookup_mode (h_entry=2827, h_id=70761, lmode=0) at dlg_hash.c:781
#3  0x00007f28b26a7e4d in dlg_get_by_iuid (diuid=0x7f28b2958d40 <_dlg_ctx+64>) at dlg_hash.c:849
#4  0x00007f28b27393e0 in dlg_get_ctx_dialog () at dlg_var.c:940
#5  0x00007f28b26c44fb in dlg_lookup_msg_dialog (msg=0x7ffe79df6bd0, dir=0x7ffe79df6bcc) at dlg_handlers.c:1227
#6  0x00007f28b26e4386 in dlg_cseq_msg_sent (evp=0x7ffe79df75e0) at dlg_cseq.c:400
(gdb) p ((struct to_body*)(&msg)->to->parsed)->tag_value.len
$1 = 0
(gdb) p msg.first_line.u.request.method_value
$2 = 16384

(gdb) p *(((struct to_body*)(&msg)->to->parsed))
$9 = {error = 1, body = {
    s = 0x7f28dbbb0714 "<sip:dialog@10.61.187.66:5060>\r\nFrom: <sip:dialog@10.61.187.75:5060>;tag=9bf3c67f7017dfd69402f08ee6e7069c-d6bc8b8a\r\nCSeq: 10 KDMQ\r\nCall-ID: 4a4bd13913afec15-1331@45.116.131.61\r\nContent-Length: 310\r\nUs"..., len = 32}, uri = {
    s = 0x7f28dbbb0715 "sip:dialog@10.61.187.66:5060>\r\nFrom: <sip:dialog@10.61.187.75:5060>;tag=9bf3c67f7017dfd69402f08ee6e7069c-d6bc8b8a\r\nCSeq: 10 KDMQ\r\nCall-ID: 4a4bd13913afec15-1331@45.116.131.61\r\nContent-Length: 310\r\nUse"..., len = 28}, display = {s = 0x0, len = 0}, 
  tag_value = {s = 0x0, len = 0}, parsed_uri = {user = {s = 0x0, len = 0}, passwd = {s = 0x0, len = 0}, host = {s = 0x0, len = 0}, port = {
      s = 0x0, len = 0}, params = {s = 0x0, len = 0}, sip_params = {s = 0x0, len = 0}, headers = {s = 0x0, len = 0}, port_no = 0, 
    proto = 0, type = ERROR_URI_T, flags = (unknown: 0), transport = {s = 0x0, len = 0}, ttl = {s = 0x0, len = 0}, user_param = {s = 0x0, 
      len = 0}, maddr = {s = 0x0, len = 0}, method = {s = 0x0, len = 0}, lr = {s = 0x0, len = 0}, r2 = {s = 0x0, len = 0}, gr = {s = 0x0, 
      len = 0}, transport_val = {s = 0x0, len = 0}, ttl_val = {s = 0x0, len = 0}, user_param_val = {s = 0x0, len = 0}, maddr_val = {
      s = 0x0, len = 0}, method_val = {s = 0x0, len = 0}, lr_val = {s = 0x0, len = 0}, r2_val = {s = 0x0, len = 0}, gr_val = {s = 0x0, 
      len = 0}}, param_lst = 0x0, last_param = 0x0}

#7  0x0000000000503088 in sr_event_exec (type=2, evp=0x7ffe79df75e0) at core/events.c:240
#8  0x00007f28b7035a43 in msg_send_buffer (dst=0x7f28bf3a8c28, 
    buf=0x7f28bee53140 "KDMQ sip:dialog@10.61.187.66:5060 SIP/2.0\r\nVia: SIP/2.0/UDP 10.61.187.75;branch=z9hG4bK4646.29f4a7c1", '0' <repeats 24 times>, ".0\r\nTo: <sip:dialog@10.61.187.66:5060>\r\nFrom: <sip:dialog@10.61.187.75:5060>"..., len=716, flags=0)
    at ../../core/forward.h:148
#9  0x00007f28b703aff9 in send_pr_buffer (rb=0x7f28bf3a8bd8, buf=0x7f28bee53140, len=716) at t_funcs.c:69
#10 0x00007f28b700d8eb in send_prepared_request_impl (request=0x7f28bf3a8bd8, retransmit=1, branch=0) at uac.c:669
#11 0x00007f28b700ee74 in t_uac_with_ids (uac_r=0x7ffe79df7c00, ret_index=0x0, ret_label=0x0) at uac.c:753
#12 0x00007f28b700eda0 in t_uac (uac_r=0x7ffe79df7c00) at uac.c:721
#13 0x00007f28b7012038 in request (uac_r=0x7ffe79df7c00, ruri=0x7ffe79df7bd0, to=0x7ffe79df7bd0, from=0x7ffe79df7be0, next_hop=0x0)
    at uac.c:1089
#14 0x00007f28b3281053 in dmq_send_message (peer=0x7f28bdf954a8, body=0x7ffe79df7fb0, node=0x7f28be1bf270, 
    resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>)
    at dmq_funcs.c:251
#15 0x00007f28b327f4ff in bcast_dmq_message1 (peer=0x7f28bdf954a8, body=0x7ffe79df7fb0, except=0x0, 
    resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>, incl_inactive=0)
    at dmq_funcs.c:170
        (gdb) p &dmq_node_list->lock
        $1 = (gen_lock_t *) 0x7f28bd321478

#16 0x00007f28b327f8d4 in bcast_dmq_message (peer=0x7f28bdf954a8, body=0x7ffe79df7fb0, except=0x0, 
    resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>)
    at dmq_funcs.c:188
#17 0x00007f28b26f445c in dlg_dmq_send (body=0x7ffe79df7fb0, node=0x0) at dlg_dmq.c:88
#18 0x00007f28b26fdde4 in dlg_dmq_replicate_action (action=DLG_DMQ_STATE, dlg=0x7f28be36a058, needlock=0, node=0x0) at dlg_dmq.c:628
#19 0x00007f28b26ca370 in dlg_onroute (req=0x7f28dbee2088, route_params=0x7ffe79df8340, param=0x0) at dlg_handlers.c:1559
#20 0x00007f28b68f35c6 in run_rr_callbacks (req=0x7f28dbee2088, rr_param=0x7ffe79df8430) at rr_cb.c:96
#21 0x00007f28b69069b2 in after_loose (_m=0x7f28dbee2088, preloaded=0) at loose.c:984
#22 0x00007f28b69070dc in loose_route (_m=0x7f28dbee2088) at loose.c:1018
#23 0x00007f28b690b240 in w_loose_route (msg=0x7f28dbee2088, p1=0x0, p2=0x0) at rr_mod.c:276
#24 0x0000000000472e48 in do_action (h=0x7ffe79df8fe0, a=0x7f28dbb11a38, msg=0x7f28dbee2088) at core/action.c:1076
#25 0x00000000004801fb in run_actions (h=0x7ffe79df8fe0, a=0x7f28dbb11a38, msg=0x7f28dbee2088) at core/action.c:1581
#26 0x0000000000480877 in run_actions_safe (h=0x7ffe79dfad00, a=0x7f28dbb11a38, msg=0x7f28dbee2088) at core/action.c:1645
#27 0x0000000000446834 in rval_get_int (h=0x7ffe79dfad00, msg=0x7f28dbee2088, i=0x7ffe79df990c, rv=0x7f28dbb11c78, cache=0x0)
    at core/rvalue.c:915
#28 0x000000000044b365 in rval_expr_eval_int (h=0x7ffe79dfad00, msg=0x7f28dbee2088, res=0x7ffe79df990c, rve=0x7f28dbb11c70)
    at core/rvalue.c:1913
#29 0x0000000000472929 in do_action (h=0x7ffe79dfad00, a=0x7f28dbb1a438, msg=0x7f28dbee2088) at core/action.c:1052
#30 0x00000000004801fb in run_actions (h=0x7ffe79dfad00, a=0x7f28dbb110a0, msg=0x7f28dbee2088) at core/action.c:1581
#31 0x000000000046f598 in do_action (h=0x7ffe79dfad00, a=0x7f28dba8c058, msg=0x7f28dbee2088) at core/action.c:700
#32 0x00000000004801fb in run_actions (h=0x7ffe79dfad00, a=0x7f28dba8be20, msg=0x7f28dbee2088) at core/action.c:1581
#33 0x0000000000472db7 in do_action (h=0x7ffe79dfad00, a=0x7f28dbaa55d8, msg=0x7f28dbee2088) at core/action.c:1067
#34 0x00000000004801fb in run_actions (h=0x7ffe79dfad00, a=0x7f28dba82020, msg=0x7f28dbee2088) at core/action.c:1581
#35 0x000000000048093f in run_top_route (a=0x7f28dba82020, msg=0x7f28dbee2088, c=0x0) at core/action.c:1666
#36 0x00000000005ecc77 in receive_msg (
    buf=0xabca80 <buf.7134> "ACK sip:61283214906@10.61.162.48:5090;alias=10.61.162.48~5090~1 SIP/2.0\r\nVia: SIP/2.0/UDP 45.116.131.37:6060;branch=z9hG4bK-13506-36982-4\r\nFrom: test_36982 <sip:61289993111@45.116.131.37:6060>;tag=135"..., len=740, rcv_info=0x7ffe79dfb3f0)
    at core/receive.c:423
#37 0x00000000004d837f in udp_rcv_loop () at core/udp_server.c:543
#38 0x0000000000429d47 in main_loop () at main.c:1683
#39 0x0000000000433a96 in main (argc=12, argv=0x7ffe79dfbc78) at main.c:2856

1345 timer:

#0  0x00007f28dca971c9 in syscall () from /lib64/libc.so.6
#1  0x00007f28b327cfa9 in futex_get (lock=0x7f28bd321478) at ../../modules/tm/../../core/mem/../futexlock.h:121
#2  0x00007f28b327f05b in bcast_dmq_message1 (peer=0x7f28bdf954a8, body=0x7ffe79dfac30, except=0x0, 
    resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>, incl_inactive=0)
    at dmq_funcs.c:156
#3  0x00007f28b327f8d4 in bcast_dmq_message (peer=0x7f28bdf954a8, body=0x7ffe79dfac30, except=0x0, 
    resp_cback=0x7f28b2956af0 <dlg_dmq_resp_callback>, max_forwards=1, content_type=0x7f28b2956ab0 <dlg_dmq_content_type>)
    at dmq_funcs.c:188
#4  0x00007f28b26f445c in dlg_dmq_send (body=0x7ffe79dfac30, node=0x0) at dlg_dmq.c:88
#5  0x00007f28b26fdde4 in dlg_dmq_replicate_action (action=DLG_DMQ_RM, dlg=0x7f28be825d20, needlock=0, node=0x0) at dlg_dmq.c:628
#6  0x00007f28b26a27c0 in destroy_dlg (dlg=0x7f28be825d20) at dlg_hash.c:367
#7  0x00007f28b26ac867 in dlg_unref_helper (dlg=0x7f28be825d20, cnt=2, fname=0x7f28b273d21d "dlg_handlers.c", fline=431) at dlg_hash.c:1095
        (gdb) p &d_entry->lock
        $1 = (gen_lock_t *) 0x7f28bdf8b5b4

#8  0x00007f28b26b9b29 in dlg_ontdestroy (t=0x7f28bef7d488, type=131072, param=0x7ffe79dfb130) at dlg_handlers.c:431
#9  0x00007f28b6fb8286 in run_trans_callbacks_internal (cb_lst=0x7f28bef7d500, type=131072, trans=0x7f28bef7d488, params=0x7ffe79dfb130)
    at t_hooks.c:258
#10 0x00007f28b6fb83b8 in run_trans_callbacks (type=131072, trans=0x7f28bef7d488, req=0x0, rpl=0x0, code=0) at t_hooks.c:285
#11 0x00007f28b6f8c96c in free_cell_helper (dead_cell=0x7f28bef7d488, silent=0, fname=0x7f28b709c2d3 "timer.c", fline=643) at h_table.c:165
#12 0x00007f28b7033808 in wait_handler (ti=241915320, wait_tl=0x7f28bef7d510, data=0x7f28bef7d488) at timer.c:643
#13 0x0000000000500fd1 in timer_list_expire (t=241915320, h=0x7f28bb5e4378, slow_l=0x7f28bb5e80e0, slow_mark=14259) at core/timer.c:857
#14 0x000000000050147a in timer_handler () at core/timer.c:922
#15 0x0000000000501924 in timer_main () at core/timer.c:961
#16 0x000000000042a838 in main_loop () at main.c:1753
#17 0x0000000000433a96 in main (argc=12, argv=0x7ffe79dfbc78) at main.c:2856

Full kamctl trap:
gdb_kamailio_20201104_035531.txt

@pwakano
Copy link
Author

pwakano commented Nov 4, 2020

Also continuing the investigation we saw that the function dlg_cseq_msg_sent was processing a KDMQ message and went past the condition of your commit. However from the conversation it seemed DMQ message should not go through the track cseq logic. So after checking the patch, the condition of checking the TO tag len > 0 will always be false for the initial KDMQ, so based on the comment we changed the check to == 0. Is that a correct assumption?

@miconda
Copy link
Member

miconda commented Nov 4, 2020

Right, the commit with supposed fix had the wrong condition for to-tag length. I pushed a follow up commit to fix that.

@pwakano
Copy link
Author

pwakano commented Nov 5, 2020

After our change to == 0, the load test ran for over 20 hours with no problems!
Just to be aligned we compiled the latest from the 5.4 branch and started running the test again. So far, 3h+ in with no problems. I think we are safe to say the deadlock is not happening anymore.
Thank you for your effort Daniel!
Also may I ask when is it expected the next 5.4 release?

@pwakano
Copy link
Author

pwakano commented Nov 5, 2020

It is worth mentioning that the INVITE authentication which depends on the cseq tracking, still works fine after the change.

@miconda
Copy link
Member

miconda commented Nov 6, 2020

New releases will be in the near future, during the next few weeks. You can use nightly builds from stable branches, at least debian/ubuntu packages are built every night, see deb.kamailio.org site.

Closing this one. Thanks!

@miconda miconda closed this as completed Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants