Infinite loop inside htable module during dmq synch #1863

fnurglewitz · 2019-02-21T14:16:19Z

Description

When restarting kamailio nodes in our infrastructure we noticed that under traffic some nodes started using the 100% of the CPU, with the precious help of @giavac we were able to track down the issue to an infinite loop inside the htable module when synchronizing somewhat big (60K) htables via dmq

Troubleshooting

Reproduction

Have 1 kamailio instance with a 60K+ htable and start a new instance, the first instance will try to send the whole table to the new instance and it will enter an infinite loop which consumes 100% of the CPU

This is caused by a double call to ht_dmq_cell_group_flush which creates a circular list on the json structure hierarchy, the second call happens in this block of code (hence why it's required a 60K htable):
https://github.com/kamailio/kamailio/blob/5.2/src/modules/htable/ht_dmq.c#L509

When this happens ht_dmq_cell_group_flush try to add ht_dmq_jdoc_cell_group.jdoc_cells inside ht_dmq_jdoc_cell_group.jdoc->root but this root already has json_cells as its child
so when srjson_AddItemToObject is called (and in turn srjson_AddItemToArray) it gets appended as the child of itself:
https://github.com/kamailio/kamailio/blob/master/src/lib/srutils/srjson.c#L813

This circular structure then causes a loop when calling srjson_PrintUnformatted because in the print_object function the circular list is looped over:
https://github.com/kamailio/kamailio/blob/master/src/lib/srutils/srjson.c#L679

Possible Solutions

One possible solution could be to destroy and init again the ht_dmq_jdoc_cell_group structure after calling the flush:

if (ht_dmq_jdoc_cell_group.size >= dmq_cell_group_max_size) {
  LM_DBG("sending group count[%d]size[%d]\n", ht_dmq_jdoc_cell_group.count, ht_dmq_jdoc_cell_group.size);
  if (ht_dmq_cell_group_flush(node) < 0) {
    ht_slot_unlock(ht, i);
    goto error;
  }
  ht_dmq_cell_group_destroy();
  ht_dmq_cell_group_init();
}

But we are not sure about the performance implications.

Additional Information

# kamailio -v version: kamailio 5.2.1 (x86_64/linux) 44e488 flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144 MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: 44e488 compiled on 11:52:58 Feb 21 2019 with gcc 5.4.0

Operating System:
ubuntu:xenial docker container
# uname -a Linux kama-0 4.4.0-135-generic #161-Ubuntu SMP Mon Aug 27 10:45:01 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

charlesrchance · 2019-02-21T14:35:44Z

Thanks for reporting - I will take a look and fix accordingly.

- reported by Enrico Bandiera (GH #1863)

charlesrchance · 2019-02-21T19:33:59Z

Please can you try with the above and confirm whether or not it is fixed?

fnurglewitz · 2019-02-22T09:56:08Z

Thank you @charlesrchance , I did some tests with this setup:
kamailio.cfg (meaningful lines):

fork=yes
children=1
tcp_connection_lifetime=3605
pv_buffer_size=8192

# ----- dmq params -----
modparam("dmq", "server_address", DMQ_SERVER_ADDRESS)
modparam("dmq", "notification_address", DMQ_NOTIFICATION_ADDRESS)
modparam("dmq", "multi_notify", 1)
modparam("dmq", "num_workers", 1)
modparam("dmq", "ping_interval", 15)
modparam("dmq", "worker_usleep", 1000)

# ----- htable params -----
modparam("htable", "enable_dmq", 1)
modparam("htable", "dmq_init_sync", 1)
modparam("htable", "htable", "ht=>size=16;dmqreplicate=1;autoexpire=10800;")               # Keep track of concurrent channels for accounts. Should be same as dialog
modparam("htable", "htable", "ht1=>size=16;dmqreplicate=1;autoexpire=10800;")               # Keep track of concurrent channels for accounts. Should be same as dialog
modparam("htable", "htable", "ht2=>size=16;dmqreplicate=1;autoexpire=10800;")               # Keep track of concurrent channels for accounts. Should be same as dialog
modparam("htable", "htable", "ht3=>size=16;dmqreplicate=1;autoexpire=10800;")               # Keep track of concurrent channels for accounts. Should be same as dialog

#!define ONEK "really 1 k chars, believe me :)"

event_route[htable:mod-init] {
  $var(name) = POD_NAME + "\n";
  xlog("L_ALERT", "$var(name)");
  if(POD_NAME == "kama-0") {
    $var(count) = 0;
    while($var(count) < 99) {
      $sht(ht=>$var(count)) = ONEK;
      $sht(ht1=>$var(count)) = ONEK;
      $sht(ht2=>$var(count)) = ONEK;
      $sht(ht3=>$var(count)) = ONEK;
      $var(count) = $var(count)+1;
    }
  }
}

request_route {
  if ($rm == "KDMQ"){
    dmq_handle_message();
  }
  exit;
}

Started kama-0 which has now 4 htables of ~99K size each
Started 10 kubernetes pods and launched kamailio 100 times with a timeout of 3 seconds on each pod
So we have roughly 1000 kamailios trying to get these htables from kama-0
I didn't see any dangerous CPU spike and the loop doesn't happen anymore.

There's something I'm worried of though: the memory of the DMQ worker (measured from top), which usually stays around 0.1% is now stable at 1.4% and it's not going down again

I fear there's a memory leak somewhere but I'm not sure where, I had some doubts while debugging the loop issue about how the json_t structures are freed but it could be caused by me not knowing well the code; can you give us any hint to help you understand this issue better?

Thanks

charlesrchance · 2019-02-22T16:59:50Z

Thanks for testing so extensively!

My (limited) understanding is that the call to srjson_DeleteItemFromObject() - by virtue of its internal call to srjson_Delete() - takes care of freeing the entire "cells" structure.

I will do some testing here to be sure there is no leak but in the meantime, can you confirm whether you see the memory increase linearly in line with the number of times you launch a new Kamailio instance? For example, if you increase from 100 to 200 times, does the memory usage then settle at 2.7%?

fnurglewitz · 2019-02-25T11:55:00Z

Hi @charlesrchance , I tried but couldn't reproduce the issue, I'm wondering how it happened

fnurglewitz · 2019-02-27T12:24:32Z

@charlesrchance do you need more tests before merging this fix?

charlesrchance · 2019-02-27T19:48:01Z

I've been unable to reproduce myself, so if you're also unable to then I think it's good to merge - I'll submit the PR now.

paolovisintin · 2019-02-28T16:41:32Z

Hello,
sorry for bothering, just to know what timetable we should expect regarding this issue; (unfortunately) actually we have a production deployment with this latent behaviour that can be experienced simply restarting one instance of kamailio.

charlesrchance · 2019-02-28T16:50:49Z

@paolovisintin - as soon as the above PR (#1872) is merged (will also be backported).

- reported by Enrico Bandiera (GH #1863) (cherry picked from commit a176ad4)

charlesrchance · 2019-02-28T21:55:58Z

Fix has been merged/backported. Please reopen if still experiencing issues.

charlesrchance added a commit that referenced this issue Feb 21, 2019

htable: fix infinite loop during dmq sync of large tables

a176ad4

- reported by Enrico Bandiera (GH #1863)

charlesrchance mentioned this issue Feb 27, 2019

htable: fix infinite loop during dmq sync of large tables #1872

Merged

10 tasks

charlesrchance added a commit that referenced this issue Feb 28, 2019

htable: fix infinite loop during dmq sync of large tables

cb58a13

- reported by Enrico Bandiera (GH #1863) (cherry picked from commit a176ad4)

charlesrchance closed this as completed Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infinite loop inside htable module during dmq synch #1863

Infinite loop inside htable module during dmq synch #1863

fnurglewitz commented Feb 21, 2019

charlesrchance commented Feb 21, 2019

charlesrchance commented Feb 21, 2019

fnurglewitz commented Feb 22, 2019

charlesrchance commented Feb 22, 2019

fnurglewitz commented Feb 25, 2019

fnurglewitz commented Feb 27, 2019

charlesrchance commented Feb 27, 2019

paolovisintin commented Feb 28, 2019

charlesrchance commented Feb 28, 2019

charlesrchance commented Feb 28, 2019

Infinite loop inside htable module during dmq synch #1863

Infinite loop inside htable module during dmq synch #1863

Comments

fnurglewitz commented Feb 21, 2019

Description

Troubleshooting

Reproduction

Possible Solutions

Additional Information

charlesrchance commented Feb 21, 2019

charlesrchance commented Feb 21, 2019

fnurglewitz commented Feb 22, 2019

charlesrchance commented Feb 22, 2019

fnurglewitz commented Feb 25, 2019

fnurglewitz commented Feb 27, 2019

charlesrchance commented Feb 27, 2019

paolovisintin commented Feb 28, 2019

charlesrchance commented Feb 28, 2019

charlesrchance commented Feb 28, 2019