Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kamailio-5.4.0 : Crashes from keepalive module #2448

Closed
sagarmalam opened this issue Aug 19, 2020 · 33 comments
Closed

Kamailio-5.4.0 : Crashes from keepalive module #2448

sagarmalam opened this issue Aug 19, 2020 · 33 comments
Labels

Comments

@sagarmalam
Copy link

sagarmalam commented Aug 19, 2020

Description

I am using keepalive module to send keepalives to all the destinations registered with an application server which sits behind kamailio.
When Kamailio recieves 200 Ok. I take URI contact from 200 OK and pass it as a param to ka_add_destination function.Most of the time it works fine but sometimes kamailio crashes when it recieves 200 OK of OPTIONS.

Troubleshooting

Reproduction

I am able to reproduce it easily in below scenario :
All the URIs which we are feeding to KA module has this syntax :
sip:2003@10.50.8.11:9090;alias=10.50.8.1~5060~1;pb-ip=35.185.177.49;pb-pt=5060;tp=udp

  1. So request will be generate by KA module and it will be send to 10.50.8.11:9090 ( which is another port of same kamailio instance ) via UDP protocol.
  2. On recieving OPTION on 9090 port ,
    - We are calling handle_ruri_alias() function which will set $du =sip:10.50.8.1:5060;transport=udp
    - Modify RURI from sip:2003@10.50.8.11:9090;alias=10.50.8.1~5060~1;pb-ip=35.185.177.49;pb-pt=5060;tp=udp to
    sip:2003@35.185.177.49:5060;pb-ip=35.185.177.49;pb-pt=5060;transport=udp
    - and relay to 10.50.8.1
  3. 10.50.8.1 is another instance of Kamailio which will relay OPTION packet to final destination

Debugging Data

[root@SBC-4-2 /]# gdb kamailio core.67771
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7                                                                                                  
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/sbin/kamailio...done.
[New LWP 67771]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/sbin/kamailio -m 5000 -M 500 -P /run/kamailio/kamailio.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000abcdefed in ?? ()
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libcurl-7.29.0-51.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-12.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 libstdc++-4.8.5-36.el7_6.2.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 mariadb-libs-5.5.64-1.el7.x86_64 nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-5.el7_5.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.el7_5.x86_64 openldap-2.4.44-15.el7_5.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt full
#0  0x00000000abcdefed in ?? ()
No symbol table info available.
#1  0x00007f931dc69c7e in ka_options_callback (t=0x7f91e58f05c0, type=1024, ps=0x7ffda0977bd0) at keepalive_core.c:126
        uri = {
          s = 0x7f91e58f63f0 "sip:2003@10.50.8.11:9090;alias=10.50.8.1~5060~1;pb-ip=35.185.177.49;pb-pt=5060;tp=udp>\r\nFrom: <sip:keepalive@fromsbc>;tag=c7bfd876be500fff196414658218fc27-d6989f6d\r\nCSeq: 10 OPTIONS\r\nCall-ID: 3f2c3c7c"..., len = 85}
        msg = 0x0
        state = 2
        state_routes = {0x7f931dc6f844 "", 0x7f931dc6f93d "keepalive:dst-up", 0x7f931dc6f94e "keepalive:dst-down"}
        ka_dest = 0x7f91e5898988
        __FUNCTION__ = "ka_options_callback"
#2  0x00007f9324bfddd8 in run_trans_callbacks_internal (cb_lst=0x7f91e58f0638, type=1024, trans=0x7f91e58f05c0, params=0x7ffda0977bd0)
    at t_hooks.c:258
        cbp = 0x7f91e58f6590
        backup_from = 0xb2a750 <def_list+16>
        backup_to = 0xb2a758 <def_list+24>
        backup_dom_from = 0xb2a760 <def_list+32>
        backup_dom_to = 0xb2a768 <def_list+40>
        backup_uri_from = 0xb2a740 <def_list>
        backup_uri_to = 0xb2a748 <def_list+8>
        backup_xavps = 0xb29fd0 <_xavp_list_head>
        backup_xavus = 0xb29fd8 <_xavu_list_head>
        backup_xavis = 0xb29fe0 <_xavi_list_head>
        __FUNCTION__ = "run_trans_callbacks_internal"
#3  0x00007f9324bfdf0a in run_trans_callbacks (type=1024, trans=0x7f91e58f05c0, req=0x0, rpl=0xffffffffffffffff, code=408) at t_hooks.c:285
        params = {req = 0x0, rpl = 0xffffffffffffffff, param = 0x7f91e58f65a0, code = 408, flags = 0, branch = 0, t_rbuf = 0x0, dst = 0x0, 
          send_buf = {s = 0x0, len = 0}}
#4  0x00007f9324b943f1 in local_reply (t=0x7f91e58f05c0, p_msg=0xffffffffffffffff, branch=0, msg_status=408, cancel_data=0x7ffda0977d50)
    at t_reply.c:2265
        local_store = 0
        local_winner = 0
        reply_status = RPS_COMPLETED
        winning_msg = 0xffffffffffffffff
        winning_code = 408
        totag_retr = 0
---Type <return> to continue, or q <return> to quit---
        __FUNCTION__ = "local_reply"
#5  0x00007f9324bcc9fd in fake_reply (t=0x7f91e58f05c0, branch=0, code=408) at timer.c:290
        cancel_data = {cancel_bitmap = 0, reason = {cause = 0, u = {text = {s = 0x0, len = -443611712}, e2e_cancel = 0x0, packed_hdrs = {s = 0x0, 
                len = -443611712}}}}
        do_cancel_branch = 0
        reply_status = 67771
#6  0x00007f9324bcce95 in final_response_handler (r_buf=0x7f91e58f0860, t=0x7f91e58f05c0) at timer.c:462
        silent = 0
        branch_ret = 0
        prev_branch = 0
        now = 0
#7  0x00007f9324bccf56 in retr_buf_handler (ticks=112894435, tl=0x7f91e58f0880, p=0xfa0) at timer.c:518
        rbuf = 0x7f91e58f0860
        fr_remainder = 1
        retr_remainder = 32657
        retr_interval = 3847088576
        new_retr_interval_ms = 6989250208
        crt_retr_interval_ms = 140264594074048
        t = 0x7f91e58f05c0
        __FUNCTION__ = "retr_buf_handler"
#8  0x00000000004b7bcd in slow_timer_main () at core/timer.c:1105
        n = 12
        ret = 4294967295
        tl = 0x7f91e58f0880
        i = 700
        __FUNCTION__ = "slow_timer_main"
#9  0x000000000042a4c3 in main_loop () at main.c:1737
        i = 16
        pid = 0
        si = 0x0
        si_desc = "udp receiver child=15 sock=10.50.8.11:9090\000:0:0:11]:5060\000\061:5060)\000\000\000\000\000\000\000\000\000\000F\345\221\177\000\000\000\000\000\000\000\000\000\000\bLv\345\221\177\000\000P\201\227\240\375\177\000\000\210\212K\000\000\000\000\000\260\270A\000\000\000\000\000\b{\017(\223\177\000"
        nrprocs = 16
        woneinit = 1
---Type <return> to continue, or q <return> to quit---
        __FUNCTION__ = "main_loop"
#10 0x0000000000433a66 in main (argc=7, argv=0x7ffda09786b8) at main.c:2856
        cfg_stream = 0x27e7010
        c = -1
        r = 0
        tmp = 0x7ffda0979f26 ""
        tmp_len = 0
        port = 0
        proto = 0
        ahost = 0x0
        aport = 0
        options = 0x7d2498 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
        ret = -1
        seed = 791137056
        rfd = 4
        debug_save = 0
        debug_flag = 0
        dont_fork_cnt = 0
        n_lst = 0x7f934718ea00 <intel_02_known>
        p = 0x0
        st = {st_dev = 23, st_ino = 30676, st_nlink = 2, st_mode = 16877, st_uid = 0, st_gid = 5001, __pad0 = 0, st_rdev = 0, st_size = 40, 
          st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1597841333, tv_nsec = 176009291}, st_mtim = {tv_sec = 1597841333, 
            tv_nsec = 166009247}, st_ctim = {tv_sec = 1597841333, tv_nsec = 176009291}, __unused = {0, 0, 0}}
        tbuf = "\377\377\377\377", '\000' <repeats 12 times>, "\340\263\001G\223\177\000\000\310T\320G\223\177", '\000' <repeats 90 times>, "p\342\252\000\000\000\000\000\260\270A\000\000\000\000\000\260\206\227\240\375\177", '\000' <repeats 26 times>, "\036_\260G\223\177\000\000\001", '\000' <repeats 23 times>...
        option_index = 0
        long_options = {{name = 0x7d468f "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x7cfc94 "version", has_arg = 0, flag = 0x0, 
            val = 118}, {name = 0x7d4694 "alias", has_arg = 1, flag = 0x0, val = 1024}, {name = 0x7d469a "subst", has_arg = 1, flag = 0x0, 
            val = 1025}, {name = 0x7d46a0 "substdef", has_arg = 1, flag = 0x0, val = 1026}, {name = 0x7d46a9 "substdefs", has_arg = 1, flag = 0x0, 
            val = 1027}, {name = 0x7d46b3 "server-id", has_arg = 1, flag = 0x0, val = 1028}, {name = 0x7d46bd "loadmodule", has_arg = 1, flag = 0x0, 
            val = 1029}, {name = 0x7d46c8 "modparam", has_arg = 1, flag = 0x0, val = 1030}, {name = 0x7d46d1 "log-engine", has_arg = 1, flag = 0x0, 
            val = 1031}, {name = 0x7d46dc "debug", has_arg = 1, flag = 0x0, val = 1032}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        __FUNCTION__ = "main"
(gdb) 

Log Messages

Aug 19 12:58:26 SBC-4-2 /usr/local/sbin/kamailio[67772]: CRITICAL: <core> [core/mem/q_malloc.c:138]: qm_debug_check_frag(): BUG: qm: fragm. 0x7f91e5898950 (address 0x7f91e5898988) end overwritten (c0c0c0c1, abcdefed)! Memory allocator was called from core: core/usr_avp.c:626. Fragment marked by core: core/usr_avp.c:175. Exec from core/mem/q_malloc.c:511.
Aug 19 12:58:38 SBC-4-2 /usr/local/sbin/kamailio[67793]: CRITICAL: <core> [core/pass_fd.c:277]: receive_fd(): EOF on 94

SIP Traffic

(paste your sip traffic here)

Possible Solutions

Additional Information

  • Kamailio Version - output of kamailio -v
[root@SBC-4-2 /]# kamailio -v
version: kamailio 5.4.0 (x86_64/linux) 6c4fce                                                                                                         
flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: 6c4fce 
compiled on 14:16:47 Aug 14 2020 with gcc 4.8.5
  • Operating System:
[root@SBC-4-2 /]# uname -a
Linux SBC-4-2.netcarrier.net 5.1.0 #1 SMP Mon May 6 13:44:38 GMT 2019 x86_64 x86_64 x86_64 GNU/Linux
@miconda
Copy link
Member

miconda commented Aug 19, 2020

Do you use ka_add_destination() in kamailio configuration file? Or have you developed a module that uses internally this function?

Because looking at the code, when using the function in config, is not setting any state change callback function, which seems to be the one missing at the top of the backtrace (by looking what is at keepalive_core.c:126).

@sagarmalam
Copy link
Author

I have used ka_add_destination in config file.

@sagarmalam
Copy link
Author

Hi, @miconda
So it a quick fix or complex one ? Can i expect it to get fixed in next stable version ?

@miconda
Copy link
Member

miconda commented Aug 20, 2020

I am not the developer of the module, I just looked quickly in the code and it seemed that it should not execute the callback that caused the crash if the function was used in the config. That's why I asked if you use it inside a custom module.

If I get some spare time, I will look again, but I cannot say if it is going to be fixed before the next minor release in 5.4 -- this is expected to happen soon, maximum 1-2 weeks and I expect to be quite busy getting Kamailio World online event organized.

The original developer changed the job and is no longer active, iirc, but there were other devs contributing to this module lately, maybe they can have a look as well -- pinging @luismartingil for it.

@sagarmalam
Copy link
Author

@miconda , Thanks i understand.

@miconda
Copy link
Member

miconda commented Aug 30, 2020

Supposing you still have the core file, can you get the output of next commands executed inside gdb:

frame 1
p *ka_dest

@sagarmalam
Copy link
Author

sagarmalam commented Aug 30, 2020 via email

@miconda
Copy link
Member

miconda commented Aug 31, 2020

In the first backtrace, the frame 1 was different:

#1  0x00007f931dc69c7e in ka_options_callback (t=0x7f91e58f05c0, type=1024, ps=0x7ffda0977bd0) at keepalive_core.c:126

Did you get another core file? I need the output of p *ka_dest in the frame showing execution of ka_options_callback(). Get also the output of list.

@luismartingil
Copy link
Member

@NGSegovia FYI

@sagarmalam
Copy link
Author

I have generated fresh core dump :

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/sbin/kamailio...done.
[New LWP 77082]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/sbin/kamailio -m 5000 -M 500 -P /run/kamailio/kamailio.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fee0197dc7c in ka_options_callback (t=0x7fecc97f8450, type=1024, ps=0x7ffc051e10f0) at keepalive_core.c:126
126                             ka_dest->statechanged_clb(&ka_dest->uri, state, ka_dest->user_attr);
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libcurl-7.29.0-51.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-12.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 libstdc++-4.8.5-36.el7_6.2.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 mariadb-libs-5.5.64-1.el7.x86_64 nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-5.el7_5.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.el7_5.x86_64 openldap-2.4.44-15.el7_5.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt full
#0  0x00007fee0197dc7c in ka_options_callback (t=0x7fecc97f8450, type=1024, ps=0x7ffc051e10f0) at keepalive_core.c:126
        uri = {
          s = 0x7fecc97fa612 "sip:1006@10.50.8.11:9090;alias=10.50.8.1~5060~1;pb-ip=43.228.229.106;pb-pt=39524;tp=tcp>\r\nFrom: <sip:keepalive@fromsbc>;tag=c7bfd876be500fff196414658218fc27-ab6132c4\r\nCSeq: 10 OPTIONS\r\nCall-ID: 7d20b9"..., len = 87}
        msg = 0x0
        state = 1
        state_routes = {0x7fee01983844 "", 0x7fee0198393d "keepalive:dst-up", 0x7fee0198394e "keepalive:dst-down"}
        ka_dest = 0x7fecc98b1350
        __FUNCTION__ = "ka_options_callback"
#1  0x00007fee08911dd8 in run_trans_callbacks_internal (cb_lst=0x7fecc97f84c8, type=1024, trans=0x7fecc97f8450, params=0x7ffc051e10f0)
    at t_hooks.c:258
        cbp = 0x7fecc97fa7b8
        backup_from = 0xb2a750 <def_list+16>
        backup_to = 0xb2a758 <def_list+24>
        backup_dom_from = 0xb2a760 <def_list+32>
        backup_dom_to = 0xb2a768 <def_list+40>
        backup_uri_from = 0xb2a740 <def_list>
        backup_uri_to = 0xb2a748 <def_list+8>
        backup_xavps = 0xb29fd0 <_xavp_list_head>
        backup_xavus = 0xb29fd8 <_xavu_list_head>
        backup_xavis = 0xb29fe0 <_xavi_list_head>
        __FUNCTION__ = "run_trans_callbacks_internal"
#2  0x00007fee08911f0a in run_trans_callbacks (type=1024, trans=0x7fecc97f8450, req=0x0, rpl=0x7fee0be2d230, code=200) at t_hooks.c:285
        params = {req = 0x0, rpl = 0x7fee0be2d230, param = 0x7fecc97fa7c8, code = 200, flags = 0, branch = 0, t_rbuf = 0x0, dst = 0x0, send_buf = {
            s = 0x0, len = 0}}
#3  0x00007fee088a83f1 in local_reply (t=0x7fecc97f8450, p_msg=0x7fee0be2d230, branch=0, msg_status=200, cancel_data=0x7ffc051e14e0)
    at t_reply.c:2265
        local_store = 0
        local_winner = 0
        reply_status = RPS_COMPLETED
        winning_msg = 0x7fee0be2d230
        winning_code = 200
        totag_retr = 0
        __FUNCTION__ = "local_reply"
#4  0x00007fee088ab28d in reply_received (p_msg=0x7fee0be2d230) at t_reply.c:2648
---Type <return> to continue, or q <return> to quit---
        msg_status = 200
        last_uac_status = 0
        ack = 0x5adbb1 <sr_event_exec+415> "\211E\374\213E\374\351\270\002"
        ack_len = 198791384
        branch = 0
        reply_status = 198791440
        onreply_route = 0
        cancel_data = {cancel_bitmap = 0, reason = {cause = 200, u = {text = {s = 0x0, len = 11322688}, e2e_cancel = 0x0, packed_hdrs = {s = 0x0, 
                len = 11322688}}}}
        uac = 0x7fecc97f86e0
        t = 0x7fecc97f8450
        lack_dst = {send_sock = 0xacc673 <buf.7133+723>, to = {s = {sa_family = 50496, 
              sa_data = "\254\000\000\000\000\000@\025\036\005\374\177\000"}, sin = {sin_family = 50496, sin_port = 172, sin_addr = {s_addr = 0}, 
              sin_zero = "@\025\036\005\374\177\000"}, sin6 = {sin6_family = 50496, sin6_port = 172, sin6_flowinfo = 0, sin6_addr = {__in6_u = {
                  __u6_addr8 = "@\025\036\005\374\177\000\000\240\327o\000\000\000\000", __u6_addr16 = {5440, 1310, 32764, 0, 55200, 111, 0, 0}, 
                  __u6_addr32 = {85857600, 32764, 7329696, 0}}}, sin6_scope_id = 85857600}, sas = {ss_family = 50496, 
              __ss_padding = "\254\000\000\000\000\000@\025\036\005\374\177\000\000\240\327o\000\000\000\000\000@\025\036\005\374\177\000\000l=s\000\000\000\000\000\220N\205\000\000\000\000\000\006-\001\000\207\000\000\000dr\205\000\000\000\000\000x\365\332\v\356\177\000\000sƬ\000\000\000\000\000\066Ŭ\000\000\000\000\000\000\025\036\005\374\177\000\000\020Q\331\v\356\177\000\000sƬ\000\000\000\000\000\060Ŭ\000\000\000\000", 
              __ss_align = 140720394343680}}, id = 0, send_flags = {f = 0, blst_imask = 0}, proto = -73 '\267', proto_pad0 = -60 '\304', 
          proto_pad1 = 172}
        backup_user_from = 0x533d48 <get_send_socket2+38>
        backup_user_to = 0x52be18 <init_su+941>
        backup_domain_from = 0x3ce0c0
        backup_domain_to = 0x737024 <get_hdr_field+6416>
        backup_uri_from = 0x16
        backup_uri_to = 0x8
        backup_xavps = 0x7ffc051e16f0
        backup_xavus = 0x0
        backup_xavis = 0x0
        replies_locked = 1
        branch_ret = 524288000
        prev_branch = 0
        blst_503_timeout = 7582578
        hf = 0x854e90
---Type <return> to continue, or q <return> to quit---
        onsend_params = {req = 0x7ffc051e1540, rpl = 0x6c640e <qm_malloc+2479>, param = 0x854e90, code = 7562879, flags = 0, branch = 0, 
          t_rbuf = 0x3cd9f8, dst = 0x12d06, send_buf = {s = 0x41b8b0 <_start> "1\355I\211\321^H\211\342H\203\344\360PTI\307\300p\367|", 
            len = 85862032}}
        ctx = {rec_lev = 0, run_flags = 0, last_retcode = 1, jmp_env = {{__jmpbuf = {77062, -1461319875154594989, 8736400, 7140757, 140720394343312, 
                140660373053456, 140660377843008, 140660373053456}, __mask_was_saved = 85857232, __saved_mask = {__val = {140660210063967, 
                  140660373061248, 0, 524288000, 7312777, 3989744, 5422768, 5455176, 8, 140660377843104, 140660377843088, 140720394343424, 5954481, 
                  0, 21483460608, 656}}}}}
        bctx = 0x7fee0be2d230
        keng = 0x0
        ret = 0
        evname = {s = 0x7fee08944dcc "on_sl_reply", len = 11}
        __FUNCTION__ = "reply_received"
#5  0x000000000053bac3 in do_forward_reply (msg=0x7fee0be2d230, mode=0) at core/forward.c:757
        new_buf = 0x0
        dst = {send_sock = 0x0, to = {s = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, sin = {sin_family = 0, sin_port = 0, sin_addr = {
                s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {
                __in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, 
              sin6_scope_id = 0}, sas = {ss_family = 0, __ss_padding = '\000' <repeats 117 times>, __ss_align = 0}}, id = 0, send_flags = {f = 0, 
            blst_imask = 0}, proto = 0 '\000', proto_pad0 = 0 '\000', proto_pad1 = 0}
        new_len = 1
        r = 1
        ip = {af = 4307120, len = 0, u = {addrl = {77062, 4259919}, addr32 = {77062, 0, 4259919, 0}, addr16 = {11526, 1, 0, 0, 79, 65, 0, 0}, 
            addr = "\006-\001\000\000\000\000\000O\000A\000\000\000\000"}}
        s = 0x7fee0be2da60 ""
        len = 32764
        __FUNCTION__ = "do_forward_reply"
#6  0x000000000053d795 in forward_reply (msg=0x7fee0be2d230) at core/forward.c:858
No locals.
#7  0x00000000005c70a4 in receive_msg (
    buf=0xacc3a0 <buf.7133> "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 198.136.226.11:5060;received=10.50.8.11;branch=z9hG4bK6e61.3b138063", '0' <repeats 24 times>, ".0\r\nRecord-Route: <sip:198.136.226.1:5060;lr>\r\nTo: <sip:1006@10.50.8.11:9090"..., len=723, rcv_info=0x7ffc051e1e10)
    at core/receive.c:509
        msg = 0x7fee0be2d230
        ctx = {rec_lev = 0, run_flags = 0, last_retcode = 0, jmp_env = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = {
                __val = {0 <repeats 13 times>, 77062, 140660911038800, 126}}}}}
---Type <return> to continue, or q <return> to quit---
        bctx = 0x0
        ret = -1
        tvb = {tv_sec = 0, tv_usec = 0}
        tve = {tv_sec = 0, tv_usec = 0}
        diff = 0
        inb = {
          s = 0xacc3a0 <buf.7133> "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 198.136.226.11:5060;received=10.50.8.11;branch=z9hG4bK6e61.3b138063", '0' <repeats 24 times>, ".0\r\nRecord-Route: <sip:198.136.226.1:5060;lr>\r\nTo: <sip:1006@10.50.8.11:9090"..., len = 723}
        netinfo = {data = {s = 0x0, len = 0}, rcv = 0x0, dst = 0x0}
        keng = 0x0
        evp = {data = 0x7ffc051e1940, obuf = {s = 0x0, len = 0}, rcv = 0x7ffc051e1e10, dst = 0x0, req = 0x0, rpl = 0x0, rplcode = 0, mode = 0}
        cidlockidx = 0
        cidlockset = 0
        errsipmsg = 0
        exectime = 0
        __FUNCTION__ = "receive_msg"
#8  0x000000000048c09e in udp_rcv_loop () at core/udp_server.c:543
        len = 723
        buf = "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 198.136.226.11:5060;received=10.50.8.11;branch=z9hG4bK6e61.3b138063", '0' <repeats 24 times>, ".0\r\nRecord-Route: <sip:198.136.226.1:5060;lr>\r\nTo: <sip:1006@10.50.8.11:9090"...
        tmp = 0x7fecc91ae520 ""
        fromaddr = 0x7fee0bdcb8a8
        fromaddrlen = 16
        rcvi = {src_ip = {af = 2, len = 4, u = {addrl = {185086474, 0}, addr32 = {185086474, 0, 0, 0}, addr16 = {12810, 2824, 0, 0, 0, 0, 0, 0}, 
              addr = "\n2\b\v", '\000' <repeats 11 times>}}, dst_ip = {af = 2, len = 4, u = {addrl = {185086474, 0}, addr32 = {185086474, 0, 0, 0}, 
              addr16 = {12810, 2824, 0, 0, 0, 0, 0, 0}, addr = "\n2\b\v", '\000' <repeats 11 times>}}, src_port = 9090, dst_port = 5060, 
          proto_reserved1 = 0, proto_reserved2 = 0, src_su = {s = {sa_family = 2, sa_data = "#\202\n2\b\v\000\000\000\000\000\000\000"}, sin = {
              sin_family = 2, sin_port = 33315, sin_addr = {s_addr = 185086474}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {
              sin6_family = 2, sin6_port = 33315, sin6_flowinfo = 185086474, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, 
                  __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, 
              __ss_padding = "#\202\n2\b\v", '\000' <repeats 111 times>, __ss_align = 0}}, bind_address = 0x7fee0b9b2d88, proto = 1 '\001', 
          proto_pad0 = 0 '\000', proto_pad1 = 0}
        evp = {data = 0x0, obuf = {s = 0x0, len = 0}, rcv = 0x0, dst = 0x0, req = 0x0, rpl = 0x0, rplcode = 0, mode = 0}
        printbuf = "\000\000\000\000\r\000\000\000\220\356\234\v\356\177", '\000' <repeats 90 times>, "L\206G\311\354\177\000\000\260\270A\000\000\000\000\000\220&\036\005\374\177", '\000' <repeats 18 times>, "`\035\036\005\374\177\000\000XxT", '\000' <repeats 93 times>...
---Type <return> to continue, or q <return> to quit---
        i = -1
        j = 112640
        l = 1
        __FUNCTION__ = "udp_rcv_loop"
#9  0x0000000000429d17 in main_loop () at main.c:1683
        i = 12
        pid = 0
        si = 0x7fee0b9b2d88
        si_desc = "udp receiver child=12 sock=10.50.8.11:5060 (198.136.226.11:5060)\000\000\000\000\000\000\000\000\000@\027\311\354\177\000\000\002\000\000\000\000\000\000\000\200\262G\311\354\177\000\000\060!\036\005\374\177\000\000\210\212K\000\000\000\000\000\260\270A\000\000\000\000\000h\372\340\v\356\177\000"
        nrprocs = 16
        woneinit = 1
        __FUNCTION__ = "main_loop"
#10 0x0000000000433a66 in main (argc=7, argv=0x7ffc051e2698) at main.c:2856
        cfg_stream = 0x127d010
        c = -1
        r = 0
        tmp = 0x7ffc051e3f26 ""
        tmp_len = 0
        port = 0
        proto = 0
        ahost = 0x0
        aport = 0
        options = 0x7d2498 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
        ret = -1
        seed = 2273675507
        rfd = 4
        debug_save = 0
        debug_flag = 0
        dont_fork_cnt = 0
        n_lst = 0x0
        p = 0x0
        st = {st_dev = 23, st_ino = 30676, st_nlink = 2, st_mode = 16877, st_uid = 0, st_gid = 5001, __pad0 = 0, st_rdev = 0, st_size = 40, 
          st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1598866491, tv_nsec = 369804494}, st_mtim = {tv_sec = 1598866491, 
---Type <return> to continue, or q <return> to quit---
            tv_nsec = 360804455}, st_ctim = {tv_sec = 1598866491, tv_nsec = 369804494}, __unused = {0, 0, 0}}
        tbuf = '\000' <repeats 88 times>, "p\342\252\000\000\000\000\000\260\270A\000\000\000\000\000\220&\036\005\374\177", '\000' <repeats 26 times>, "\036\237\201+\356\177\000\000\001", '\000' <repeats 23 times>, "\340\363\322*\356\177\000\000`&\036\005\374\177\000\000*\033\202+\356\177\000\000\034\000\000\000\000\000\000\000\000"...
        option_index = 0
        long_options = {{name = 0x7d468f "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x7cfc94 "version", has_arg = 0, flag = 0x0, 
            val = 118}, {name = 0x7d4694 "alias", has_arg = 1, flag = 0x0, val = 1024}, {name = 0x7d469a "subst", has_arg = 1, flag = 0x0, 
            val = 1025}, {name = 0x7d46a0 "substdef", has_arg = 1, flag = 0x0, val = 1026}, {name = 0x7d46a9 "substdefs", has_arg = 1, flag = 0x0, 
            val = 1027}, {name = 0x7d46b3 "server-id", has_arg = 1, flag = 0x0, val = 1028}, {name = 0x7d46bd "loadmodule", has_arg = 1, flag = 0x0, 
            val = 1029}, {name = 0x7d46c8 "modparam", has_arg = 1, flag = 0x0, val = 1030}, {name = 0x7d46d1 "log-engine", has_arg = 1, flag = 0x0, 
            val = 1031}, {name = 0x7d46dc "debug", has_arg = 1, flag = 0x0, val = 1032}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        __FUNCTION__ = "main"
(gdb) 
(gdb) 
(gdb) 
(gdb) 
(gdb) 
(gdb) 
(gdb) 
(gdb) 
(gdb) frame 0
#0  0x00007fee0197dc7c in ka_options_callback (t=0x7fecc97f8450, type=1024, ps=0x7ffc051e10f0) at keepalive_core.c:126
126                             ka_dest->statechanged_clb(&ka_dest->uri, state, ka_dest->user_attr);
(gdb) p *ka_dest
$1 = {uri = {s = 0x113c98b000d <Address 0x113c98b000d out of bounds>, len = 0}, owner = {s = 0x7fecc98b1380 "uuid", len = 4}, flags = -913632379, 
  state = 32748, last_checked = 54, last_up = 4779817840673387893, last_down = 1598866510, counter = 0, ping_interval = 1316440388, 
  user_attr = 0x597855444f304544, statechanged_clb = 0x4d695657596b686a, response_clb = 0x4e786b7a4e30516a, sock = 0x4f7a4d324d355132, ip_address = {
    af = 3031364, len = 0, u = {addrl = {3233857728, 2882400237}, addr32 = {3233857728, 0, 2882400237, 0}, addr16 = {49344, 49344, 0, 0, 61421, 
        43981, 0, 0}, addr = "\300\300\300\300\000\000\000\000\355\357ͫ\000\000\000"}}, port = 0, proto = 0, timer = 0x7fecc98b1580, next = 0x70}
(gdb) 

@miconda
Copy link
Member

miconda commented Aug 31, 2020

Can you make a minimal kamailio config to reproduce this issue? The uri field is compromised, maybe due to a free/destroy of the keepalive structure. Are you adding/removing ka destinations on the fly/during runtime?

@NGSegovia
Copy link
Contributor

I'd say the reporter is adding the destination from another module, and not from .cfg. statechanged_clb should be null (and not run) when the destination is added through cfg.

@miconda
Copy link
Member

miconda commented Aug 31, 2020

@NGSegovia - it is what I asked in my first comments, based on what I saw in the code.

But the last backtrace shows also URI field being invalid, so maybe the ka structure is compromised/overwritten -- like already freed/destroyed while the keepalive was ongoing, and another process got the memory zone and wrote there. That's why I asked to see if the destination is eventually removed from the KA list. I am not the developer of the module, but does not look like there is a protection for this case (provided that destinations can be removed at any time)...

@NGSegovia
Copy link
Contributor

In principle, there is a lock used for destination list: https://github.com/kamailio/kamailio/blob/master/src/modules/keepalive/keepalive_api.c#L178
and when freeing the destination the timer is being stopped, https://github.com/kamailio/kamailio/blob/master/src/modules/keepalive/keepalive_api.c#L255

mmm, I wonder why URI seems corrupted, but didn't get to corrupt "owner" in the backtrace.

@miconda
Copy link
Member

miconda commented Aug 31, 2020

The timer for keepalive is stopped, but here it seems to come a pointer which was stored inside sip transaction structure and used when the transaction completes (on 408 timeout or 200ok response). If between the options and final response the ka destination structure is destroyed, then the tm callback will provide a pointer to an address that can be used for something else at that moment.

In such case it is better to store in transaction a unique id for ka destination structure and on tm transaction callback to search and see if it is found, locking the structure while is updated.

@sagarmalam
Copy link
Author

@miconda , Yes i am adding and removing on the fly. I am adding(using ka_add_destination) to KA list on successful registration and removing(using ka_del_destination) when registration expires.


#####################KEEPALIVE#######################################
modparam("keepalive", "ping_interval", 20)
modparam("keepalive", "delete_counter", 5)
modparam("keepalive", "ping_from", "sip:keepalive@fromsbc")

request_route {

##ON RECIEVING OPTIONS FROM KA MODULE ON 9090 PORT, Set $du using handle_ruri_alias , replace tp with transport and relay it to edge proxy from which we recieved register packet
                if ($Rp == "9090" && is_method("OPTIONS")){
                        if (t_newtran()) {
                                if (handle_ruri_alias()){
                                        $avp(request_uri)= $ru ;
                                        $avp(pb-ip)=$(avp(request_uri){param.value,pb-ip});
                                        if (is_ipv6($avp(pb-ip))) {
                                                if (subst_uri('/^sip:(.*)@(.*);pb-ip=(.*);pb-pt=(.*);tp=(.*)$/sip:\1@[\3]:\4;pb-ip=\3;pb-pt=\4;transport=\5/i')){
                                                          xlog("L_INFO","(KA OPTIONS) : ($avp(uuid)) : Destination URI <$du> created from alias.");
                                                         t_relay();
                                              }

                                        } else {
                                                if (subst_uri('/^sip:(.*)@(.*);pb-ip=(.*);pb-pt=(.*);tp=(.*)$/sip:\1@\3:\4;pb-ip=\3;pb-pt=\4;transport=\5/i')){
                                                          xlog("L_INFO","(KA OPTIONS) : ($avp(uuid)) : Destination URI <$du> created from alias.");
                                                         t_relay();
                                                }

                                        }

                                }

                                exit;
                        }

$du=sip:APPLICATION_SERVER_IP:5060; ##IP of application server ( Freeswitch) to which kamailio will relay register packet add_contact_alias();
t_on_reply("reply_register")
if (!t_relay()) {
                sl_reply_error();
        }
exit;
} 
onreply_route[reply_register ]{
 if ((status=~"200") && $rm=="REGISTER" && $(avp(c_ct){param.value,expires}) != 0)
        {
## On receiving fresh registration (i.e with expires !=0) , Kamailio will register a contact address for that registration in KA module ping registry.
        $avp(ka_contact) = 'sip:'+$fU+'@LOCAL_IP:9090;alias='+$(avp(c_ct){param.value,alias})+';pb-ip='+$(avp(c_ct){param.value,pb-ip})+';pb-pt='+$(avp(c_ct){param.value,pb-pt})+';tp='+$(avp(c_ct){param.value,transport}{s.rm,>});
        ka_add_destination("$avp(ka_contact)","config");
} 

if ((status=~"200") && $rm=="REGISTER" && $(avp(c_ct){param.value,expires}) == 0)
 {

        $avp(ka_contact) = 'sip:'+$fU+'@LOCAL_IP:9090;alias='+$(avp(c_ct){param.value,alias})+';pb-ip='+$(avp(c_ct){param.value,pb-ip})+';pb-pt='+$(avp(c_ct){param.value,pb-pt})+';tp='+$(avp(c_ct){param.value,transport}{s.rm,>});
        ka_del_destination("$avp(ka_contact)","config");
}

}

@NGSegovia , I am adding destinations from CFG file.

@miconda
Copy link
Member

miconda commented Sep 1, 2020

@NGSegovia - not related to the issue itself, but is there any specific reason to do this complex logic for keepalives? If it is for registrations, the usrloc module in kamailio v5.4.x can do keepalive (storing also the round trip time).

Otherwise, the issue seems to be related to keeping the pointer to keepalive destinations structure in the transaction and can be destroyed until transaction completes. If no other devel looks at it, I will try to get familiar with the code of the module and come up with a fix after Kamailio World Online event.

@sagarmalam
Copy link
Author

I preferred this module over usrloc module because i dont store registrations on kamailio. It is stored on application server behind Kamailio. So it was easy for me to use keepalive module to send KA instead of usrloc.

@NGSegovia
Copy link
Contributor

@miconda the problem you pointed is the same pointed in this message, right? https://github.com/kamailio/kamailio/blob/master/src/modules/keepalive/keepalive_core.c#L102

BTW I did some code, hopefully fixing this, but I didn't cleaned-up and test properly yet, but you can take a look. https://github.com/NGSegovia/kamailio/tree/keepalive_fix_race_condition_tm_request

@miconda miconda added the bug label Sep 1, 2020
@sagarmalam
Copy link
Author

@NGSegovia , The changes you made has fixed the issue. Kamailio service is running since last 2 hours and I am not able to reproduce issue.
Thanks so much.

@miconda , I understand you are busy with Kamailio world, Is there any chances for this fix to be available in next stable release ?

@miconda
Copy link
Member

miconda commented Sep 1, 2020

@NGSegovia - yes, that kind of problem.

The approach in the fix is correct, just wondering if worth adding dependency on external libuuid library instead of using the internal core/utils/sruid.{c,h}, like it is done for other modules. I planned to extend the internal sruid code to be able to use libuuid via the uuid module, can be something to consider in this development cycle.

By using the internal sruid backporting will be smooth, because there is no extra dependency, so it does not break packaging (debs, rpms, ... stay the same).

@NGSegovia
Copy link
Contributor

Not a problem @miconda , updated.

@sagarmalam
Copy link
Author

@NGSegovia , I tested latest changes under
NGSegovia@aaded16

Now it is crashing again but from different place :

(gdb) bt full
#0  0x00007f3153cd82cc in __strncpy_ssse3 () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f312a7d5132 in ka_str_copy (src=0x7f2ff2441b38, dest=0x7f2ff2416380, prefix=0x0) at keepalive_core.c:203
        lp = 0
        __FUNCTION__ = "ka_str_copy"
#2  0x00007f312a7d2d19 in ka_check_timer (ticks=931009473, tl=0x7f2ff2416618, param=0x7f2ff2416360) at keepalive_core.c:71
        ka_dest = 0x7f2ff2416360
        ka_ping_method = {s = 0x7f312a7d9740 "OPTIONS", len = 7}
        ka_outbound_proxy = {s = 0x0, len = 0}
        uac_r = {method = 0x140b788ad20, headers = 0x7f2ff24162c8, body = 0x0, ssock = 0x0, ssockname = 0x0, dialog = 0x0, cb_flags = 1024, 
          cb = 0x7f312a7d3174 <ka_options_callback>, cbp = 0x7f2ff243f378, callid = 0x0}
        __FUNCTION__ = "ka_check_timer"
        uuid = 0x7f2ff2441b38
#3  0x00000000004c3517 in slow_timer_main () at core/timer.c:1105
        n = 12
        ret = 320
        tl = 0x7f2ff2416618
        i = 1
        __FUNCTION__ = "slow_timer_main"
#4  0x000000000042a5f3 in main_loop () at main.c:1737
        i = 16
        pid = 0
        si = 0x0
        si_desc = "udp receiver child=15 sock=10.50.8.11:9090\000:0:0:11]:5060\000\061:5060)\000\000\000\000\000\000\000\000\000\220\374\361/\177\000\000\002\000\000\000\000\000\000\000\360\004-\362/\177\000\000\060\260\210\267\375\177\000\000\322CL\000\000\000\000\000\340\271A\000\000\000\000\000\250=\306\064\061\177\000"
        nrprocs = 16
        woneinit = 1
        __FUNCTION__ = "main_loop"
#5  0x0000000000433ba5 in main (argc=7, argv=0x7ffdb788b598) at main.c:2861
        cfg_stream = 0x27e1010
        c = -1
        r = 0
        tmp = 0x7ffdb788bf26 ""
        tmp_len = 0
---Type <return> to continue, or q <return> to quit---
        port = 0
        proto = 0
        ahost = 0x0
        aport = 0
        options = 0x7cfaa0 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
        ret = -1
        seed = 862902793
        rfd = 4
        debug_save = 0
        debug_flag = 0
        dont_fork_cnt = 0
        n_lst = 0x0
        p = 0x0
        st = {st_dev = 23, st_ino = 30676, st_nlink = 2, st_mode = 16877, st_uid = 0, st_gid = 5001, __pad0 = 0, st_rdev = 0, st_size = 40, 
          st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1599034463, tv_nsec = 259945261}, st_mtim = {tv_sec = 1599034433, 
            tv_nsec = 160818732}, st_ctim = {tv_sec = 1599034463, tv_nsec = 259945261}, __unused = {0, 0, 0}}
        tbuf = '\000' <repeats 88 times>, "p\262\252\000\000\000\000\000\340\271A\000\000\000\000\000\220\265\210\267\375\177", '\000' <repeats 26 times>, "\036\017gT1\177\000\000\001", '\000' <repeats 23 times>, "\340c\270S1\177\000\000`\265\210\267\375\177\000\000*\213gT1\177\000\000\034\000\000\000\000\000\000\000"...
        option_index = 0
        long_options = {{name = 0x7d1c8f "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x7cd2a1 "version", has_arg = 0, flag = 0x0, 
            val = 118}, {name = 0x7d1c94 "alias", has_arg = 1, flag = 0x0, val = 1024}, {name = 0x7d1c9a "subst", has_arg = 1, flag = 0x0, 
            val = 1025}, {name = 0x7d1ca0 "substdef", has_arg = 1, flag = 0x0, val = 1026}, {name = 0x7d1ca9 "substdefs", has_arg = 1, flag = 0x0, 
            val = 1027}, {name = 0x7d1cb3 "server-id", has_arg = 1, flag = 0x0, val = 1028}, {name = 0x7d1cbd "loadmodule", has_arg = 1, flag = 0x0, 
            val = 1029}, {name = 0x7d1cc8 "modparam", has_arg = 1, flag = 0x0, val = 1030}, {name = 0x7d1cd1 "log-engine", has_arg = 1, flag = 0x0, 
            val = 1031}, {name = 0x7d1cdc "debug", has_arg = 1, flag = 0x0, val = 1032}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        __FUNCTION__ = "main"

@NGSegovia
Copy link
Contributor

@sagarmalam I have updated the PR, please check again.

Thanks!

@sagarmalam
Copy link
Author

@NGSegovia , It is working fine now. Thanks a lot.

Question :
As a developer do you think this module can handle load of sending keepalives to 50 K destinations every 20 seconds ? or you suggest to use ursloc module instead for such a load ?

@NGSegovia
Copy link
Contributor

No idea @sagarmalam

@sagarmalam
Copy link
Author

ok. Thanks. I will share my results.

@sagarmalam
Copy link
Author

Hello Again,

I was able to reproduce again. It crashed from the same point and it seems that frequency of this issue has reduced but it is not fixed yet.
As per my observation, it happens when entry of destination from KA list is removed during an ongoing KA transaction. I am removing destination from KA list when endpoint de registers so this case is possible. I have also observed same on pcaps. Below are the screenshots :

OPTIONS :
Screenshot from 2020-09-04 23-02-14

REGISTER :
Screenshot from 2020-09-04 23-01-25

Kamailio has dispatched 200 OK for register at 06:33:58.187520 and it received 200 OK for OPTIONS at 06:33:58.411501. Kamailio crashed exactly at a same time.

Core Dump :

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/local/sbin/kamailio...done.
[New LWP 12701]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/local/sbin/kamailio -m 5000 -M 500 -P /run/kamailio/kamailio.pid'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f09c138dc7c in ka_options_callback (t=0x7f0896cc1530, type=1024, ps=0x7ffdb22e9d20) at keepalive_core.c:126
126                             ka_dest->statechanged_clb(&ka_dest->uri, state, ka_dest->user_attr);
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-222.el7.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-19.el7.x86_64 libcom_err-1.42.9-12.el7_5.x86_64 libcurl-7.29.0-51.el7.x86_64 libevent-2.0.21-4.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64 libidn-1.28-4.el7.x86_64 libselinux-2.5-12.el7.x86_64 libssh2-1.4.3-10.el7_2.1.x86_64 libstdc++-4.8.5-36.el7_6.2.x86_64 libuuid-2.23.2-52.el7_5.1.x86_64 mariadb-libs-5.5.60-1.el7_5.x86_64 nspr-4.19.0-1.el7_5.x86_64 nss-3.36.0-5.el7_5.x86_64 nss-softokn-freebl-3.36.0-5.el7_5.x86_64 nss-util-3.36.0-1.el7_5.x86_64 openldap-2.4.44-15.el7_5.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt full
#0  0x00007f09c138dc7c in ka_options_callback (t=0x7f0896cc1530, type=1024, ps=0x7ffdb22e9d20) at keepalive_core.c:126
        uri = {
          s = 0x7f0896cc39b7 "sip:7459@10.50.8.6:9090;alias=10.50.8.1~5060~1;pb-ip=174.198.221.83;pb-pt=3939;tp=tcp>\r\nFrom: <sip:keepalive@fromsbc>;tag=ee3b67b160621ee354865fd8162b5b29-1b81a45f\r\nCSeq: 10 OPTIONS\r\nCall-ID: 2557bb53"..., len = 85}
        msg = 0x0
        state = 1
        state_routes = {0x7f09c1393844 "", 0x7f09c139393d "keepalive:dst-up", 0x7f09c139394e "keepalive:dst-down"}
        ka_dest = 0x7f088ac83048
        __FUNCTION__ = "ka_options_callback"
#1  0x00007f09c8321dd8 in run_trans_callbacks_internal (cb_lst=0x7f0896cc15a8, type=1024, trans=0x7f0896cc1530, params=0x7ffdb22e9d20)
    at t_hooks.c:258
        cbp = 0x7f0896cc3b60
        backup_from = 0xb1f170 <def_list+16>
        backup_to = 0xb1f178 <def_list+24>
        backup_dom_from = 0xb1f180 <def_list+32>
        backup_dom_to = 0xb1f188 <def_list+40>
        backup_uri_from = 0xb1f160 <def_list>
        backup_uri_to = 0xb1f168 <def_list+8>
        backup_xavps = 0xb1ea60 <_xavp_list_head>
        backup_xavus = 0xb1ea68 <_xavu_list_head>
        backup_xavis = 0xb1ea70 <_xavi_list_head>
        __FUNCTION__ = "run_trans_callbacks_internal"
#2  0x00007f09c8321f0a in run_trans_callbacks (type=1024, trans=0x7f0896cc1530, req=0x0, rpl=0x7f09cb832030, code=200) at t_hooks.c:285
        params = {req = 0x0, rpl = 0x7f09cb832030, param = 0x7f0896cc3b70, code = 200, flags = 0, branch = 0, t_rbuf = 0x0, dst = 0x0, send_buf = {
            s = 0x0, len = 0}}
#3  0x00007f09c82b2bef in local_reply (t=0x7f0896cc1530, p_msg=0x7f09cb832030, branch=0, msg_status=200, cancel_data=0x7ffdb22ea110)
    at t_reply.c:2265
        local_store = 0
        local_winner = 0
        reply_status = RPS_COMPLETED
        winning_msg = 0x7f09cb832030
        winning_code = 200
        totag_retr = 0
        __FUNCTION__ = "local_reply"
#4  0x00007f09c82b5a8b in reply_received (p_msg=0x7f09cb832030) at t_reply.c:2648
---Type <return> to continue, or q <return> to quit---
        msg_status = 200
        last_uac_status = 0
        ack = 0x5a62bc <sr_event_exec+415> "\211E\374\213E\374\351)\003"
        ack_len = 2989400432
        branch = 0
        reply_status = -881216520
        onreply_route = 0
        cancel_data = {cancel_bitmap = 0, reason = {cause = 200, u = {text = {s = 0x0, len = 11245581}, e2e_cancel = 0x0, packed_hdrs = {s = 0x0, 
                len = 11245581}}}}
        uac = 0x7f0896cc17c0
        t = 0x7f0896cc1530
        lack_dst = {send_sock = 0xab9917 <buf.7136+791>, to = {s = {sa_family = 38925, sa_data = "\253\000\000\000\000\000p\241.\262\375\177\000"}, 
            sin = {sin_family = 38925, sin_port = 171, sin_addr = {s_addr = 0}, sin_zero = "p\241.\262\375\177\000"}, sin6 = {sin6_family = 38925, 
              sin6_port = 171, sin6_flowinfo = 0, sin6_addr = {__in6_u = {__u6_addr8 = "p\241.\262\375\177\000\000\001\344o\000\000\000\000", 
                  __u6_addr16 = {41328, 45614, 32765, 0, 58369, 111, 0, 0}, __u6_addr32 = {2989400432, 32765, 7332865, 0}}}, sin6_scope_id = 0}, 
            sas = {ss_family = 38925, 
              __ss_padding = "\253\000\000\000\000\000p\241.\262\375\177\000\000\001\344o", '\000' <repeats 13 times>, "\370\263y\313\t\177\000\000\000)\205\000\000\000\000\000\375\227\253\000\207\000\000\000\324L\205\000\000\000\000\000(\325w\313\t\177\000\000\027\231\253\000\000\000\000\000\003\230\253\000\000\000\000\000\060\241.\262\375\177\000\000P\262y\313\t\177\000\000\027\231\253\000\000\000\000\000\020\226\253\000\000\000\000", 
              __ss_align = 139680045183032}}, id = 0, send_flags = {f = 0, blst_imask = 0}, proto = -123 '\205', proto_pad0 = -105 '\227', 
          proto_pad1 = 171}
        backup_user_from = 0x53c7a0 <register_module+4320>
        backup_user_to = 0x521ec0 <check_self_port+680>
        backup_domain_from = 0x3c6b40
        backup_domain_to = 0x7365ed <get_hdr_field+6416>
        backup_uri_from = 0x19
        backup_uri_to = 0x8
        backup_xavps = 0x7ffdb22ea320
        backup_xavus = 0x0
        backup_xavis = 0x0
        replies_locked = 1
        branch_ret = -881246328
        prev_branch = 0
        blst_503_timeout = 7579963
        hf = 0x852900
---Type <return> to continue, or q <return> to quit---
        onsend_params = {req = 0x7ffdb22ea170, rpl = 0x6c65c9 <qm_malloc+2479>, param = 0x852900, code = 7560264, flags = 0, branch = 0, 
          t_rbuf = 0x3c6298, dst = 0x3192, send_buf = {s = 0x41b980 <_start> "1\355I\211\321^H\211\342H\203\344\360PTI\307\300`\313|", 
            len = -1305562432}}
        ctx = {rec_lev = 0, run_flags = 0, last_retcode = 1, jmp_env = {{__jmpbuf = {12690, 139680045183568, 8726784, 7141200, 140727592853440, 
                139680040542224, 139680045061360, 139680040542224}, __mask_was_saved = -1305567232, __saved_mask = {__val = {139679877548472, 
                  11245445, 0, 524288000, 7315946, 3959664, 5381872, 5490592, 8, 139680045061456, 139680045061440, 140727592853552, 5923516, 0, 
                  21483449424, 0}}}}}
        bctx = 0x7f09cb832030
        keng = 0x0
        ret = 0
        evname = {s = 0x7f09c8354ffc "on_sl_reply", len = 11}
        __FUNCTION__ = "reply_received"
#5  0x0000000000526903 in do_forward_reply (msg=0x7f09cb832030, mode=0) at core/forward.c:757
        new_buf = 0x0
        dst = {send_sock = 0x0, to = {s = {sa_family = 0, sa_data = '\000' <repeats 13 times>}, sin = {sin_family = 0, sin_port = 0, sin_addr = {
                s_addr = 0}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {sin6_family = 0, sin6_port = 0, sin6_flowinfo = 0, sin6_addr = {
                __in6_u = {__u6_addr8 = '\000' <repeats 15 times>, __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, 
              sin6_scope_id = 0}, sas = {ss_family = 0, __ss_padding = '\000' <repeats 117 times>, __ss_align = 0}}, id = 0, send_flags = {f = 0, 
            blst_imask = 0}, proto = 0 '\000', proto_pad0 = 0 '\000', proto_pad1 = 0}
        new_len = 1
        r = 1
        ip = {af = 4307328, len = 0, u = {addrl = {12690, 4259919}, addr32 = {12690, 0, 4259919, 0}, addr16 = {12690, 0, 0, 0, 79, 65, 0, 0}, 
            addr = "\222\061\000\000\000\000\000\000O\000A\000\000\000\000"}}
        s = 0x7f09cb832860 ""
        len = 0
        __FUNCTION__ = "do_forward_reply"
#6  0x00000000005285d5 in forward_reply (msg=0x7f09cb832030) at core/forward.c:858
No locals.
#7  0x00000000005bc8c2 in receive_msg (
    buf=0xab9600 <buf.7136> "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 198.136.226.6:5060;received=10.50.8.6;branch=z9hG4bK56d5.462830a6", '0' <repeats 24 times>, ".0\r\nRecord-Route: <sip:198.136.226.1:5060;lr>\r\nContact: <sip:100.103.40.117:38"..., len=791, rcv_info=0x7ffdb22eaa40)
    at core/receive.c:509
        msg = 0x7f09cb832030
        ctx = {rec_lev = 0, run_flags = 0, last_retcode = 0, jmp_env = {{__jmpbuf = {0, 0, 0, 0, 0, 0, 0, 0}, __mask_was_saved = 0, __saved_mask = {
                __val = {0 <repeats 13 times>, 12690, 139680578527568, 126}}}}}
---Type <return> to continue, or q <return> to quit---
        bctx = 0x0
        ret = -1
        tvb = {tv_sec = 0, tv_usec = 0}
        tve = {tv_sec = 0, tv_usec = 0}
        diff = 0
        inb = {
          s = 0xab9600 <buf.7136> "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 198.136.226.6:5060;received=10.50.8.6;branch=z9hG4bK56d5.462830a6", '0' <repeats 24 times>, ".0\r\nRecord-Route: <sip:198.136.226.1:5060;lr>\r\nContact: <sip:100.103.40.117:38"..., len = 791}
        netinfo = {data = {s = 0x0, len = 0}, rcv = 0x0, dst = 0x0}
        keng = 0x0
        evp = {data = 0x7ffdb22ea570, obuf = {s = 0x0, len = 0}, rcv = 0x7ffdb22eaa40, dst = 0x0, req = 0x0, rpl = 0x0, rplcode = 0, mode = 0}
        cidlockidx = 0
        cidlockset = 0
        errsipmsg = 0
        exectime = 0
        __FUNCTION__ = "receive_msg"
#8  0x00000000004bb55c in udp_rcv_loop () at core/udp_server.c:543
        len = 791
        buf = "SIP/2.0 200 OK\r\nVia: SIP/2.0/UDP 198.136.226.6:5060;received=10.50.8.6;branch=z9hG4bK56d5.462830a6", '0' <repeats 24 times>, ".0\r\nRecord-Route: <sip:198.136.226.1:5060;lr>\r\nContact: <sip:100.103.40.117:38"...
        tmp = 0x7f0888bbe520 ""
        fromaddr = 0x7f09cb79d880
        fromaddrlen = 16
        rcvi = {src_ip = {af = 2, len = 4, u = {addrl = {101200394, 0}, addr32 = {101200394, 0, 0, 0}, addr16 = {12810, 1544, 0, 0, 0, 0, 0, 0}, 
              addr = "\n2\b\006", '\000' <repeats 11 times>}}, dst_ip = {af = 2, len = 4, u = {addrl = {101200394, 0}, addr32 = {101200394, 0, 0, 
                0}, addr16 = {12810, 1544, 0, 0, 0, 0, 0, 0}, addr = "\n2\b\006", '\000' <repeats 11 times>}}, src_port = 9090, dst_port = 5060, 
          proto_reserved1 = 0, proto_reserved2 = 0, src_su = {s = {sa_family = 2, sa_data = "#\202\n2\b\006\000\000\000\000\000\000\000"}, sin = {
              sin_family = 2, sin_port = 33315, sin_addr = {s_addr = 101200394}, sin_zero = "\000\000\000\000\000\000\000"}, sin6 = {
              sin6_family = 2, sin6_port = 33315, sin6_flowinfo = 101200394, sin6_addr = {__in6_u = {__u6_addr8 = '\000' <repeats 15 times>, 
                  __u6_addr16 = {0, 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id = 0}, sas = {ss_family = 2, 
              __ss_padding = "#\202\n2\b\006", '\000' <repeats 111 times>, __ss_align = 0}}, bind_address = 0x7f09cb3c2d98, proto = 1 '\001', 
          proto_pad0 = 0 '\000', proto_pad1 = 0}
        evp = {data = 0x0, obuf = {s = 0x0, len = 0}, rcv = 0x0, dst = 0x0, req = 0x0, rpl = 0x0, rplcode = 0, mode = 0}
        printbuf = "\000\000\000\000\a\000\000\000\240\356=\313\t\177", '\000' <repeats 90 times>, "Tm\032\211\b\177\000\000\200\271A\000\000\000\000\000\300\262.\262\375\177", '\000' <repeats 18 times>, "\220\251.\262\375\177\000\000\006=T", '\000' <repeats 93 times>...
---Type <return> to continue, or q <return> to quit---
        i = -1
        j = 112640
        l = 1
        __FUNCTION__ = "udp_rcv_loop"
#9  0x0000000000429de7 in main_loop () at main.c:1683
        i = 6
        pid = 0
        si = 0x7f09cb3c2d98
        si_desc = "udp receiver child=6 sock=10.50.8.6:5060 (198.136.226.6:5060)\000\000\000\004\000\000\000\000\000\000\000\000@\270\210\b\177", '\000' <repeats 11 times>, "\235\032\211\b\177\000\000`\255.\262\375\177\000\000mCL\000\000\000\000\000\200\271A\000\000\000\000\000PH\201\313\t\177\000"
        nrprocs = 16
        woneinit = 1
        __FUNCTION__ = "main_loop"
#10 0x0000000000433b40 in main (argc=7, argv=0x7ffdb22eb2c8) at main.c:2859
        cfg_stream = 0x1cfb010
        c = -1
        r = 0
        tmp = 0x7ffdb22ebf26 ""
        tmp_len = 0
        port = 0
        proto = 0
        ahost = 0x0
        aport = 0
        options = 0x7cf8a0 ":f:cm:M:dVIhEeb:l:L:n:vKrRDTN:W:w:t:u:g:P:G:SQ:O:a:A:x:X:Y:"
        ret = -1
        seed = 1260081917
        rfd = 4
        debug_save = 0
        debug_flag = 0
        dont_fork_cnt = 0
        n_lst = 0x7f09ea8b2a00 <intel_02_known>
        p = 0x0
        st = {st_dev = 23, st_ino = 36410, st_nlink = 2, st_mode = 16877, st_uid = 0, st_gid = 5001, __pad0 = 0, st_rdev = 0, st_size = 40, 
          st_blksize = 4096, st_blocks = 0, st_atim = {tv_sec = 1599228627, tv_nsec = 605193865}, st_mtim = {tv_sec = 1599228597, 
            tv_nsec = 485081596}, st_ctim = {tv_sec = 1599228627, tv_nsec = 605193865}, __unused = {0, 0, 0}}
---Type <return> to continue, or q <return> to quit---
        tbuf = "\340\363s\352\t\177\000\000ȔB\353\t\177", '\000' <repeats 90 times>, "p\262\252\000\000\000\000\000\200\271A\000\000\000\000\000\300\262.\262\375\177", '\000' <repeats 26 times>, "\036\237\"\353\t\177\000\000\001", '\000' <repeats 23 times>, "\340\363s\352\t\177\000\000\220\262.\262\375\177\000\000*"...
        option_index = 0
        long_options = {{name = 0x7d1a8f "help", has_arg = 0, flag = 0x0, val = 104}, {name = 0x7cd0a1 "version", has_arg = 0, flag = 0x0, 
            val = 118}, {name = 0x7d1a94 "alias", has_arg = 1, flag = 0x0, val = 1024}, {name = 0x7d1a9a "subst", has_arg = 1, flag = 0x0, 
            val = 1025}, {name = 0x7d1aa0 "substdef", has_arg = 1, flag = 0x0, val = 1026}, {name = 0x7d1aa9 "substdefs", has_arg = 1, flag = 0x0, 
            val = 1027}, {name = 0x7d1ab3 "server-id", has_arg = 1, flag = 0x0, val = 1028}, {name = 0x7d1abd "loadmodule", has_arg = 1, flag = 0x0, 
            val = 1029}, {name = 0x7d1ac8 "modparam", has_arg = 1, flag = 0x0, val = 1030}, {name = 0x7d1ad1 "log-engine", has_arg = 1, flag = 0x0, 
            val = 1031}, {name = 0x7d1adc "debug", has_arg = 1, flag = 0x0, val = 1032}, {name = 0x0, has_arg = 0, flag = 0x0, val = 0}}
        __FUNCTION__ = "main"

Thanks

@sagarmalam
Copy link
Author

Hello
Please ignore above comment.Fix is working.I picked up wrong branch while testing it. Sorry for confusion.
Thanks

@sagarmalam
Copy link
Author

@NGSegovia , While testing latest fixes we have found that Keepalive module is leaking memory. We are testing with 7K destinations with time interval of 20 seconds
This log file has output of : kamcmd mod.stats keepalive shm for every sec :
keepalive_memory_consumption.log

At Wed Sep 9 09:02:07 GMT 2020 it is 3 MB
At Wed Sep 9 09:33:47 GMT 2020 it is 47 MB
So 42 MB consumed within 30 mins and growing.

Kamailio core dumped after running for 24 hours with 5GB SHM memory.
core dump :
coredump.txt

Kamailio logs :
kamailio.log

Please get back to me in case you need any more information.
Thanks a lot

sagarmalam referenced this issue in NGSegovia/kamailio Sep 9, 2020
… struct to tm

- This avoids a race condition that may happen processing tm_request callbacl
- Allows to identify uniquely a ka_dest record
@NGSegovia
Copy link
Contributor

@sagarmalam merged PR fixing this problem. Re-open if there are still issues.

@sagarmalam
Copy link
Author

@NGSegovia , Thanks for all the efforts in fix this issue. I think i have found regression issue. KA timeout parameter is not working anymore with latest changes. It keeps on sending KA even after failing for number of times defined in ka_timeout mod param.

@NGSegovia
Copy link
Contributor

Hi @sagarmalam , do you mean delete_counter parameter? I tried and seems to work for me (The truth is that if I set it to 3 I get 4 attempts but then stops).

BTW id you set that parameter to "0" means no limit. Not sure if that is the case.

Anyway, I can't find the relation with changes done here and this parameter. Can you try again a open a new issue with more information if you still have problems?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants