Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in rpc_mod_print function while shutting down server. #2433

Closed
korizza opened this issue Aug 10, 2020 · 13 comments
Closed

Comments

@korizza
Copy link

korizza commented Aug 10, 2020

Description

We have segfault in Kamailio v5.3.1 installed on Debain 9.x 64 bit occured while kamailio was shutting down while our script tryed to get metric using kamcmd utility at the same time.

Troubleshooting

No troubleshooting was done, since it happened on a production server. We simply restarted the server.

Reproduction

The problem periodically happens on production servers during restart. Kamailio crashes when one of our scripts tried getting statistics about websocket and tls modules using kamcmd during server restart. As I see in core dump, shared memory was already freed when rpc_mod_print called in the child process.

Debugging Data

core.kamailio.17684

/var/lib/ums/sbin/kamailio


GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /var/lib/ums/sbin/kamailio...done.
[New LWP 17684]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/var/lib/ums/sbin/kamailio -m 2048 -M 12 -P /var/run/kamailio/kamailio.pid -f /'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __strcmp_ssse3 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1197
#0  __strcmp_ssse3 () at ../sysdeps/x86_64/multiarch/../strcmp.S:1197
#1  0x00007ff4d6ddb2f5 in rpc_mod_print (rpc=0x7ff4d66f0540 <binrpc_callbacks>, ctx=0x7fff11cd5658, mname=0x1f140b8 "tls", stats=0x1f2f180, flag=2) at mod_stats.c:117
#2  0x00007ff4d6ddb0eb in rpc_mod_print_one (rpc=0x7ff4d66f0540 <binrpc_callbacks>, ctx=0x7fff11cd5658, mname=0x1f140b8 "tls", pkg_stats=0x1f2d9c0, shm_stats=0x1f2f180, flag=2) at mod_stats.c:159
#3  0x00007ff4d6ddaee1 in rpc_mod_mem_stats_mode (rpc=0x7ff4d66f0540 <binrpc_callbacks>, ctx=0x7fff11cd5658, fmode=0) at mod_stats.c:239
#4  0x00007ff4d6dda84f in rpc_mod_mem_stats (rpc=0x7ff4d66f0540 <binrpc_callbacks>, ctx=0x7fff11cd5658) at mod_stats.c:251
#5  0x00007ff4d64bfc80 in process_rpc_req (buf=0x1f140a4 "\241\003\026t\256\214=\221\nmod.stats", size=29, bytes_needed=0x7fff11cd5aa0, sh=0x7fff11cd5a10, saved_state=0x1f240a8) at binrpc_run.c:678
#6  0x00007ff4d64ad72f in handle_stream_read (s_c=0x1f14070, idx=-1) at io_listener.c:511
#7  0x00007ff4d64a9121 in handle_io (fm=0x7ff55a4e4ae0, events=1, idx=-1) at io_listener.c:706
#8  0x00007ff4d64a793a in io_wait_loop_epoll (h=0x7ff4d66f0348 <io_h>, t=10, repeat=0) at ./../../core/io_wait.h:1062
#9  0x00007ff4d649b62c in io_listen_loop (fd_no=2, cs_lst=0x1e28940) at io_listener.c:281
#10 0x00007ff4d64d172c in mod_child (rank=0) at ctl.c:338
#11 0x0000000000638c14 in init_mod_child (m=0x7ff55a3634b0, rank=0) at core/sr_module.c:780
#12 0x000000000063862d in init_mod_child (m=0x7ff55a363b78, rank=0) at core/sr_module.c:776
#13 0x000000000063862d in init_mod_child (m=0x7ff55a364018, rank=0) at core/sr_module.c:776
#14 0x000000000063862d in init_mod_child (m=0x7ff55a364528, rank=0) at core/sr_module.c:776
#15 0x000000000063862d in init_mod_child (m=0x7ff55a3649c8, rank=0) at core/sr_module.c:776
#16 0x000000000063862d in init_mod_child (m=0x7ff55a365140, rank=0) at core/sr_module.c:776
#17 0x000000000063862d in init_mod_child (m=0x7ff55a3657b0, rank=0) at core/sr_module.c:776
#18 0x000000000063862d in init_mod_child (m=0x7ff55a365c38, rank=0) at core/sr_module.c:776
#19 0x00000000006385b2 in init_child (rank=0) at core/sr_module.c:825
#20 0x000000000043140c in main_loop () at main.c:1753
#21 0x000000000043df6f in main (argc=9, argv=0x7fff11cd9dd8) at main.c:2802

Log Messages

No any useful logs available.

SIP Traffic

No SIP traffic available.

Possible Solutions

Additional Information

  • Kamailio Version - output of kamailio -v
version: kamailio 5.3.1 (x86_64/linux) 283e46
flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: 283e46
compiled on 14:23:37 Jul 28 2020 with clang 9.0

  • Operating System:
Linux devhpbx005-1.vx 4.9.0-12-amd64 #1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux
@miconda
Copy link
Member

miconda commented Aug 10, 2020

This is a race that can show up in a multi-process application and is hard to avoid it completely if you do not "cool down" the server (stop/reduce other interactions with kamailio) -- you run RPC commands as the same time with triggering shut down.

I pushed the commit referenced above (which depends on another commit to the core) to narrow down the window of this specific case, but it is not completely avoidable, as the rpc command can be in the middle of iterating some internal lists when kill is performed and some structures can be destroyed by other processes.

If someone wants to propose other solutions, make PRs and can be discussed there.

@miconda miconda closed this as completed Aug 10, 2020
@korizza
Copy link
Author

korizza commented Aug 11, 2020

Thanks Daniel. I'll add this fix as a patch to our kamailio build.

@korizza
Copy link
Author

korizza commented Aug 20, 2020

Updates:

A few days ago we've got the same segfault on production server, but in this case it happened without shut down. I think there is a race when calling process_rpc_req function and it does not relate to server cooldown.

Core dump:

GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /var/lib/ums/sbin/kamailio...done.
[New LWP 6226]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/var/lib/ums/sbin/kamailio -m 2048 -M 12 -P /var/run/kamailio/kamailio.pid -f /'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __strcmp_sse2_unaligned ()
at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:32
#0 __strcmp_sse2_unaligned () at ../sysdeps/x86_64/multiarch/strcmp-sse2-unaligned.S:32
#1 0x00007f327f55c2f5 in rpc_mod_print (rpc=0x7f327ee71540 <binrpc_callbacks>, ctx=0x7ffe1418a9b8, mname=0x1dca0c8 "tls", stats=0x1de5310, flag=2) at mod_stats.c:117
#2 0x00007f327f55c0eb in rpc_mod_print_one (rpc=0x7f327ee71540 <binrpc_callbacks>, ctx=0x7ffe1418a9b8, mname=0x1dca0c8 "tls", pkg_stats=0x1de39d0, shm_stats=0x1de5310, flag=2) at mod_stats.c:159
#3 0x00007f327f55bee1 in rpc_mod_mem_stats_mode (rpc=0x7f327ee71540 <binrpc_callbacks>, ctx=0x7ffe1418a9b8, fmode=0) at mod_stats.c:239
#4 0x00007f327f55b84f in rpc_mod_mem_stats (rpc=0x7f327ee71540 <binrpc_callbacks>, ctx=0x7ffe1418a9b8) at mod_stats.c:251
#5 0x00007f327ec40c80 in process_rpc_req (buf=0x1dca0b4 "\241\003\026F\270\067\r\221\nmod.stats", size=29, bytes_needed=0x7ffe1418ae00, sh=0x7ffe1418ad70, saved_state=0x1dda0b8) at binrpc_run.c:678
#6 0x00007f327ec2e72f in handle_stream_read (s_c=0x1dca080, idx=-1) at io_listener.c:511
#7 0x00007f327ec2a121 in handle_io (fm=0x7f3302c66cc0, events=1, idx=-1) at io_listener.c:706
#8 0x00007f327ec2893a in io_wait_loop_epoll (h=0x7f327ee71348 <io_h>, t=10, repeat=0) at ./../../core/io_wait.h:1062
#9 0x00007f327ec1c62c in io_listen_loop (fd_no=2, cs_lst=0x1cde940) at io_listener.c:281
#10 0x00007f327ec5272c in mod_child (rank=0) at ctl.c:338
#11 0x0000000000638c14 in init_mod_child (m=0x7f3302ae54b0, rank=0) at core/sr_module.c:780
#12 0x000000000063862d in init_mod_child (m=0x7f3302ae5b78, rank=0) at core/sr_module.c:776
#13 0x000000000063862d in init_mod_child (m=0x7f3302ae6018, rank=0) at core/sr_module.c:776
#14 0x000000000063862d in init_mod_child (m=0x7f3302ae6528, rank=0) at core/sr_module.c:776
#15 0x000000000063862d in init_mod_child (m=0x7f3302ae69c8, rank=0) at core/sr_module.c:776
#16 0x000000000063862d in init_mod_child (m=0x7f3302ae7140, rank=0) at core/sr_module.c:776
#17 0x000000000063862d in init_mod_child (m=0x7f3302ae77b0, rank=0) at core/sr_module.c:776
#18 0x000000000063862d in init_mod_child (m=0x7f3302ae7c38, rank=0) at core/sr_module.c:776
#19 0x00000000006385b2 in init_child (rank=0) at core/sr_module.c:825
#20 0x000000000043140c in main_loop () at main.c:1753
#21 0x000000000043df6f in main (argc=9, argv=0x7ffe1418f138) at main.c:2802

Logs:

2020-08-18T12:36:03.755718-07:00 hpbx031-1.va /var/lib/ums/sbin/kamailio[6204]: INFO: <script>: sip_call_id=9bf05e23-bdab-4c7e-93fd-975f1d27d3a9@hpbx031-1.va; Received ACK in a dialog
2020-08-18T12:36:03.755756-07:00 hpbx031-1.va /var/lib/ums/sbin/kamailio[6204]: INFO: <script>: sip_call_id=9bf05e23-bdab-4c7e-93fd-975f1d27d3a9@hpbx031-1.va; Relaying ACK from b2bua to sip:54171324@127.0.0.1:5060
2020-08-18T12:36:03.757855-07:00 hpbx031-1.va /var/lib/ums/sbin/kamailio[6197]: INFO: <script>: [75229096,R,0_1478685494]: Received, method: REGISTER, R-URI: sip:hpbx031.telecomsvc.com:5060, ct: sip:75229096@192.168.10.116:5100, cseq: 14166, s-ip: 74.89.64.126, s-port: 5100
2020-08-18T12:37:29.073600-07:00 hpbx031-1.va /var/lib/ums/sbin/kamailio[6193]: INFO: [core/sctp_core.c:53]: sctp_core_destroy(): SCTP API not initialized
2020-08-18T12:37:39.583593-07:00 hpbx031-1.va kamailio: INFO: [core/sctp_core.c:75]: sctp_core_check_support(): SCTP API not enabled - if you want to use it, load sctp module
2020-08-18T12:37:39.780830-07:00 hpbx031-1.va kamailio: INFO: [core/tcp_main.c:5043]: init_tcp(): using epoll_lt as the io watch method (auto detected)
2020-08-18T12:37:39.781363-07:00 hpbx031-1.va kamailio[23718]: loading modules under config path: /var/lib/ums/lib64/kamailio/modules/
2020-08-18T12:37:39.781546-07:00 hpbx031-1.va kamailio[23718]: Listening on

@miconda
Copy link
Member

miconda commented Aug 20, 2020

Was this a case of running the rpc command many times at the same time?

@korizza
Copy link
Author

korizza commented Aug 20, 2020

Was this a case of running the rpc command many times at the same time?

I'll ask our LSA about it tomorow. Do you think it could be a problem? I thought that RPC handlers have thread safe (interprocess, in this case) access to shared memory, is it?

@korizza
Copy link
Author

korizza commented Aug 21, 2020

This is a sequence of commands that python script runs every 10 seconds:

/var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 stats.get_statistics websocket:
/var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 stats.get_statistics tcp:
/var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 stats.get_statistics shmem:
/var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats tls pkg
/var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats tls shm
/var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats websocket pkg
/var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats websocket shm
/var/lib/ums/sbin/kamcmd -s tcp:localhost:2048 mod.stats core shm

This script parses output of every command and gets needed metrics. I see at least one problem here - the script should use udp instead of tcp.

@korizza
Copy link
Author

korizza commented Sep 1, 2020

Bump! Any updates?

@miconda
Copy link
Member

miconda commented Sep 1, 2020

Access to shared memory is unlocked if shutdown is triggered, otherwise some worker processes can be killed while it acquired the lock, leaving kamailio shutdown worker in deadlock. So there are potential races if you run the rpc commands when you trigger the shut down. Probably you can make your stats fetch script to execute only if shutdown is not triggered, by using a state file created on shutdown command.

Shut down multiprocess races with shared memory access are hard to avoid completely, being a trade between complexity and being able to do a fast restart. If anyone wants to work on this, he/she is more than welcome to propose a pull request.

@korizza
Copy link
Author

korizza commented Sep 1, 2020

I understand, but we've got a lot of such crashes during normal kamailio operation, NOT during shutdown. This is the issue I'm concerned about.

@korizza
Copy link
Author

korizza commented Sep 1, 2020

The problem is not reproducible on the hosts where we disabled metrics gathering on kamailio.

@miconda
Copy link
Member

miconda commented Sep 1, 2020

Somehow I failed to notice it happens at runtime as well, I thought it was only during shut down.

Open a new issue adding the output of bt full from a recent core file for a runtime crash, to start fresh on troubleshooting.

@korizza
Copy link
Author

korizza commented Sep 1, 2020

Created issue #2460

miconda added a commit that referenced this issue Sep 9, 2020
@miconda
Copy link
Member

miconda commented Sep 9, 2020

I pushed a few commits in master branch trying to address this issue. I haven't implemented the memory mod stats, but as I could see in the code, it didn't seem to be protected for races on accessing the shm fragments.

You would need to test with master branch or by using patches from the next commits:

henningw pushed a commit that referenced this issue Oct 2, 2020
- related to GH #2433

(cherry picked from commit 761eb06)
miconda added a commit that referenced this issue Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants