New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kamailio freezes with DMQ and dialog #2535
Comments
Thanks for reproducing it with a recent version and opening the issue. |
Hello, sorry for being pushy, but just hoping someone could acknowledge the problem or if this is some config on my side which is causing the issue. This is a major problem, it is worst than a crash, a manual intervention with kill -9 is necessary to restore the service back. |
There is not much to configuration needed regarding DMQ, so my guess its either something regarding the particular environment or a bug in the code. Maybe @charlesrchance can comment on it. |
Thanks henningw! I was just wondering about some config in the DMQ or dialog which can be causing the issue because it looks like others in the community are using DMQ + dialog without issues. |
Can you try with couple of variants to try to isolate the cause? Try each one independent of the other.
|
Just to copy from the list discussion: the utilization of the feature vsphere Fault Tolerance is linked as a source of network latency and probably CPU allocation latency. This increases the chances of the mentioned issue to happen. |
The investigation done so far regarding the dead lock shows that 2 process entered a dead lock, a udp worker and the timer process.
21266 # timer BT:
21253 # udp receiver child=6 sock=124.47.168.242:5060 BT:
Full kamctl trap: |
Hi Daniel, thanks for your attention! |
Actually I forget to mention that the track_cseq_updates was set to 0, not 1 as per your comment 0 is what breaks the authentication and also it was already set to 1 |
Indeed My commit from yesterday was because I discovered that cseq update processing was done for dmq requests. It's useless, but I didn't get the time to dig in further if that can expose any issue. I suggested to disable it because it was executed and it is not a common deployment, so I wanted to isolate if that was somehow leading to the deadlock |
And indeed, after looking further into this direction, helped also by your troubleshooting findings posted in an comment above, the processing of cseq update for dmq was leading to the deadlock. The commit I pushed yesterday (which was backported to 5.4 and 5.3 branches as well) should fix it. Can you try with it and |
Hi Daniel, thanks for your review, appreciate it! |
I cannot say that there are no other deadlocks, but this case is not going to happen anymore when processing invite or other requests part of the dialog. By trying to update the cseq in the locally generated DMQ request (which is for updating the dlg states to other peers), it was a spiral of accessing both dialogs and dmq-peers lists. Adding a bit more: the design is to go bottom up and back with dialog processing first, then dmq replication finished, then back. This case was dialog processing, dqm notification unfinished and again dialog processing . |
Hi Daniel, we've run the test, but unfortunately the problem happened again after about 5 hours of test. Check the backtrace of the processes in deadlock: 1331 udp receiver child=3 sock=45.116.131.61:5060
1345 timer:
Full kamctl trap: |
Also continuing the investigation we saw that the function dlg_cseq_msg_sent was processing a KDMQ message and went past the condition of your commit. However from the conversation it seemed DMQ message should not go through the track cseq logic. So after checking the patch, the condition of checking the TO tag len > 0 will always be false for the initial KDMQ, so based on the comment we changed the check to == 0. Is that a correct assumption? |
Right, the commit with supposed fix had the wrong condition for to-tag length. I pushed a follow up commit to fix that. |
After our change to == 0, the load test ran for over 20 hours with no problems! |
It is worth mentioning that the INVITE authentication which depends on the cseq tracking, still works fine after the change. |
New releases will be in the near future, during the next few weeks. You can use nightly builds from stable branches, at least debian/ubuntu packages are built every night, see deb.kamailio.org site. Closing this one. Thanks! |
Description
When running a load test Kamailio eventually becames unresponsive and stops processing calls.
Kamilio is configured to use the DMQ replication for dialog and usrloc. Also the dialog keepalived is enabled.
Troubleshooting
From investigation, the problem happens faster and easier when there is some network degradation causing packet loss and/or retransmissions but even without any noticeable network issue the freeze eventually happens.
Reproduction
Run a simple load test making calls at a rate of ~5 cps and keep around ~2000 calls connected all the time. A higher cps seems to make it easier to reproduce the problem.
Adding network degradation to the environment makes the problem happens, but when running a tool such as SIPp for the load test, the retransmission can be forced by simply killing the SIPp instance receiving calls which will then force Kamailio to retransmit.
Debugging Data
Output of kamct trap:
gdb_kamailio_20201028_213030.txt
Log Messages
Local generated requests shows up in the log, but are not sent in the network
Possible Solutions
Not found so far
Additional Information
kamailio -v
The text was updated successfully, but these errors were encountered: