Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop inside htable module during dmq synch #1863

Closed
fnurglewitz opened this issue Feb 21, 2019 · 10 comments
Closed

Infinite loop inside htable module during dmq synch #1863

fnurglewitz opened this issue Feb 21, 2019 · 10 comments

Comments

@fnurglewitz
Copy link

Description

When restarting kamailio nodes in our infrastructure we noticed that under traffic some nodes started using the 100% of the CPU, with the precious help of @giavac we were able to track down the issue to an infinite loop inside the htable module when synchronizing somewhat big (60K) htables via dmq

Troubleshooting

Reproduction

Have 1 kamailio instance with a 60K+ htable and start a new instance, the first instance will try to send the whole table to the new instance and it will enter an infinite loop which consumes 100% of the CPU

This is caused by a double call to ht_dmq_cell_group_flush which creates a circular list on the json structure hierarchy, the second call happens in this block of code (hence why it's required a 60K htable):
https://github.com/kamailio/kamailio/blob/5.2/src/modules/htable/ht_dmq.c#L509

When this happens ht_dmq_cell_group_flush try to add ht_dmq_jdoc_cell_group.jdoc_cells inside ht_dmq_jdoc_cell_group.jdoc->root but this root already has json_cells as its child
so when srjson_AddItemToObject is called (and in turn srjson_AddItemToArray) it gets appended as the child of itself:
https://github.com/kamailio/kamailio/blob/master/src/lib/srutils/srjson.c#L813

This circular structure then causes a loop when calling srjson_PrintUnformatted because in the print_object function the circular list is looped over:
https://github.com/kamailio/kamailio/blob/master/src/lib/srutils/srjson.c#L679

Possible Solutions

One possible solution could be to destroy and init again the ht_dmq_jdoc_cell_group structure after calling the flush:

if (ht_dmq_jdoc_cell_group.size >= dmq_cell_group_max_size) {
  LM_DBG("sending group count[%d]size[%d]\n", ht_dmq_jdoc_cell_group.count, ht_dmq_jdoc_cell_group.size);
  if (ht_dmq_cell_group_flush(node) < 0) {
    ht_slot_unlock(ht, i);
    goto error;
  }
  ht_dmq_cell_group_destroy();
  ht_dmq_cell_group_init();
}

But we are not sure about the performance implications.

Additional Information

# kamailio -v version: kamailio 5.2.1 (x86_64/linux) 44e488 flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 44e488 compiled on 11:52:58 Feb 21 2019 with gcc 5.4.0

  • Operating System:
    ubuntu:xenial docker container
    # uname -a Linux kama-0 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
@charlesrchance
Copy link
Member

Thanks for reporting - I will take a look and fix accordingly.

charlesrchance added a commit that referenced this issue Feb 21, 2019
@charlesrchance
Copy link
Member

Please can you try with the above and confirm whether or not it is fixed?

@fnurglewitz
Copy link
Author

Thank you @charlesrchance , I did some tests with this setup:
kamailio.cfg (meaningful lines):

fork=yes
children=1
tcp_connection_lifetime=3605
pv_buffer_size=8192

# ----- dmq params -----
modparam("dmq", "server_address", DMQ_SERVER_ADDRESS)
modparam("dmq", "notification_address", DMQ_NOTIFICATION_ADDRESS)
modparam("dmq", "multi_notify", 1)
modparam("dmq", "num_workers", 1)
modparam("dmq", "ping_interval", 15)
modparam("dmq", "worker_usleep", 1000)

# ----- htable params -----
modparam("htable", "enable_dmq", 1)
modparam("htable", "dmq_init_sync", 1)
modparam("htable", "htable", "ht=>size=16;dmqreplicate=1;autoexpire=10800;")               # Keep track of concurrent channels for accounts. Should be same as dialog
modparam("htable", "htable", "ht1=>size=16;dmqreplicate=1;autoexpire=10800;")               # Keep track of concurrent channels for accounts. Should be same as dialog
modparam("htable", "htable", "ht2=>size=16;dmqreplicate=1;autoexpire=10800;")               # Keep track of concurrent channels for accounts. Should be same as dialog
modparam("htable", "htable", "ht3=>size=16;dmqreplicate=1;autoexpire=10800;")               # Keep track of concurrent channels for accounts. Should be same as dialog

#!define ONEK "really 1 k chars, believe me :)"

event_route[htable:mod-init] {
  $var(name) = POD_NAME + "\n";
  xlog("L_ALERT", "$var(name)");
  if(POD_NAME == "kama-0") {
    $var(count) = 0;
    while($var(count) < 99) {
      $sht(ht=>$var(count)) = ONEK;
      $sht(ht1=>$var(count)) = ONEK;
      $sht(ht2=>$var(count)) = ONEK;
      $sht(ht3=>$var(count)) = ONEK;
      $var(count) = $var(count)+1;
    }
  }
}

request_route {
  if ($rm == "KDMQ"){
    dmq_handle_message();
  }
  exit;
}

Started kama-0 which has now 4 htables of ~99K size each
Started 10 kubernetes pods and launched kamailio 100 times with a timeout of 3 seconds on each pod
So we have roughly 1000 kamailios trying to get these htables from kama-0
I didn't see any dangerous CPU spike and the loop doesn't happen anymore.

There's something I'm worried of though: the memory of the DMQ worker (measured from top), which usually stays around 0.1% is now stable at 1.4% and it's not going down again

I fear there's a memory leak somewhere but I'm not sure where, I had some doubts while debugging the loop issue about how the json_t structures are freed but it could be caused by me not knowing well the code; can you give us any hint to help you understand this issue better?

Thanks

@charlesrchance
Copy link
Member

Thanks for testing so extensively!

My (limited) understanding is that the call to srjson_DeleteItemFromObject() - by virtue of its internal call to srjson_Delete() - takes care of freeing the entire "cells" structure.

I will do some testing here to be sure there is no leak but in the meantime, can you confirm whether you see the memory increase linearly in line with the number of times you launch a new Kamailio instance? For example, if you increase from 100 to 200 times, does the memory usage then settle at 2.7%?

@fnurglewitz
Copy link
Author

Hi @charlesrchance , I tried but couldn't reproduce the issue, I'm wondering how it happened

@fnurglewitz
Copy link
Author

@charlesrchance do you need more tests before merging this fix?

@charlesrchance
Copy link
Member

I've been unable to reproduce myself, so if you're also unable to then I think it's good to merge - I'll submit the PR now.

@paolovisintin
Copy link

Hello,
sorry for bothering, just to know what timetable we should expect regarding this issue; (unfortunately) actually we have a production deployment with this latent behaviour that can be experienced simply restarting one instance of kamailio.

@charlesrchance
Copy link
Member

@paolovisintin - as soon as the above PR (#1872) is merged (will also be backported).

charlesrchance added a commit that referenced this issue Feb 28, 2019
- reported by Enrico Bandiera (GH #1863)

(cherry picked from commit a176ad4)
@charlesrchance
Copy link
Member

Fix has been merged/backported. Please reopen if still experiencing issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants