New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dialog DMQ replication worker stops processing msgs and consume all shmmem #2547
Labels
Comments
henningw
changed the title
DMQ workers stops processing msgs and consume all shmmem
dialog DMQ replication worker stops processing msgs and consume all shmmem
Nov 10, 2020
Hello there, just checking if there are any feedback about this issue and our analysis? |
@pwakano - can you create a pull request for the development/master branch? It is easier to analyze and then just merge. See contributing guidelines for the commit message format: Thanks! |
10 tasks
Created the pull request @miconda. I hope it is all right! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
Kamailio is configured to use the DMQ replication for dialog.
When running a load test Kamailio DMQ workers eventually becomes unresponsive and stops processing DMQ messages. As a side effect the shm memory is fully consumed then SIP processing stops happening properly and everything else stops working due to that.
Troubleshooting
Investigation showed that the DMQ workers are locked in the dmq_send_all_dlgs.
Debugging and studying the function it was identified that it is prone to data inconsistency because it uses a copy of the (d_table->entries)[index] and not a pointer to it. This gives a small window where a change in the d_table for that entry can occur and changing the lock information before the copy can actually get the lock making the process to get stuck in the lock get.
There is a need for some expert to validate this conclusion.
Reproduction
Run a simple load test making calls at a rate of ~5 cps and keep around ~2000 calls connected all the time. A higher cps seems to make it easier to reproduce the problem.
Adding network degradation to the environment makes the problem happens, but when running a tool such as SIPp for the load test, the retransmission can be forced by simply killing the SIPp instance receiving calls which will then force Kamailio to retransmit.
Debugging Data
Possible Solutions
Suggested patch (was not compiled and tested yet):
Additional Information
kamailio 5.4.2 (x86_64/linux) 5bd72f
The text was updated successfully, but these errors were encountered: