Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispatcher algorithm 11, no more than 25 hosts in group for proper distribution #2698

Closed
E1isIvan opened this issue Mar 30, 2021 · 10 comments
Closed

Comments

@E1isIvan
Copy link

E1isIvan commented Mar 30, 2021

Description

We tried to load share 50+ media servers using dispatching algorithm 11 and faced an unexpected behavior when most calls go to very first host in the group.

Troubleshooting

I tried to start Kamailio like this:
sudo kamailio -f /etc/kamailio/kamailio.cfg -E -d 5 -u 995
but errors were the same when dispatcher group had more than 25 and less than 25 hosts

It was tried different combinations of dispatcher list host group configuration:
100 sip:10.60.27.123:7000 10 rweight=50
100 sip:10.60.27.123:7000 10 weight=50;rweight=50
100 sip:10.60.27.123:7000 10 rweight=50,maxload=80
100 sip:10.60.27.123:7000 0 10 rweight=50
100 sip:10.60.27.123:7000 0 10 weight=50;rweight=50
100 sip:10.60.27.123:7000 0 10 rweight=50,maxload=80

When making 100 calls the quarter of them go to the first host in the group.

Reproduction

modparam("dispatcher", "list_file", "/etc/kamailio/dispatcher.list")
modparam("dispatcher", "flags", 2)
modparam("dispatcher", "ds_ping_method", "OPTIONS")
modparam("dispatcher", "ds_probing_threshold", 3)
modparam("dispatcher", "ds_inactive_threshold", 10)
modparam("dispatcher", "ds_probing_mode", 3)
modparam("dispatcher", "ds_ping_interval", 10)
modparam("dispatcher", "ds_ping_reply_codes", "501,403,404,400,200")
modparam("dispatcher", "ds_ping_from",DS_PING_FROM_PARAM)
modparam("dispatcher", "use_default", 0)

if ( ds_is_from_list("101")) {
sl_send_reply("100","My calls");
ds_select_dst("100", "11");
return;
}

Dispatcher group must have more than 25 hosts, if equal or less than 25 than it is Ok:
100 sip:10.60.27.123:7000 0 10 rweight=50
100 sip:10.60.27.123:7001 0 10 rweight=50
100 sip:10.60.27.123:7002 0 10 rweight=50
100 sip:10.60.27.123:7003 0 10 rweight=50
100 sip:10.60.27.123:7004 0 10 rweight=50
100 sip:10.60.27.123:7005 0 10 rweight=50
100 sip:10.60.27.123:7006 0 10 rweight=50
100 sip:10.60.27.123:7007 0 10 rweight=50
100 sip:10.60.27.123:7008 0 10 rweight=50
100 sip:10.60.27.123:7009 0 10 rweight=50
100 sip:10.60.27.123:7010 0 10 rweight=50
100 sip:10.60.27.123:7011 0 10 rweight=50
100 sip:10.60.27.123:7012 0 10 rweight=50
100 sip:10.60.27.123:7013 0 10 rweight=50
100 sip:10.60.27.123:7014 0 10 rweight=50
100 sip:10.60.27.123:7015 0 10 rweight=50
100 sip:10.60.27.123:7016 0 10 rweight=50
100 sip:10.60.27.123:7017 0 10 rweight=50
100 sip:10.60.27.123:7018 0 10 rweight=50
100 sip:10.60.27.123:7019 0 10 rweight=50
100 sip:10.60.27.123:7020 0 10 rweight=50
100 sip:10.60.27.123:7021 0 10 rweight=50
100 sip:10.60.27.123:7022 0 10 rweight=50
100 sip:10.60.27.123:7023 0 10 rweight=50
100 sip:10.60.27.123:7024 0 10 rweight=50

To emulate call load and multiple media servers, SIPp scenarios were used.

Debugging Data

Log Messages

SIP Traffic

No specific traffic, regular INVITE messages having international numbers.

Possible Solutions

Additional Information

  • Kamailio Version - output of kamailio -v
kamailio -v
version: kamailio 5.4.4 (x86_64/linux) e16352
flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: e16352 
compiled on 15:56:46 Feb 15 2021 with gcc 4.8.5
  • Operating System:
CentOS Linux release 7.7.1908 (Core)

Linux test-carrier-1.prd.tc5.ams.nl.kwebbl.loc 3.10.0-1062.4.1.el7.x86_64 #1 SMP Fri Oct 18 17:15:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
@miconda
Copy link
Member

miconda commented Mar 30, 2021

See the docs of the modules:

Note the remark about exact percentage and redistribution of rest.

If you want to discuss more, then write to sr-users@lists.kamailio.org mailing list. If proves to be a bug in the code, then an issue can be opened here.

@miconda miconda closed this as completed Mar 30, 2021
@henningw
Copy link
Contributor

Reopen as requested on mailing list

Hi Ivan, must be a limitation/problem in the computation, can you open an issue.
Since I have built on top of algorithm 11, I could have a look, we may also notify the original author to see what can be done.

@henningw henningw reopened this Mar 30, 2021
@jchavanton
Copy link
Member

I remember Daniel already answered similar questions, from what I remember, the algo will redistribute the weight to a ratio of stored in 100 slots and fill the remaining slots with the first one.

So there may be a slight offset.

What you are reporting in this issue is that the distribution it not done well when there is more than 25 hosts ?

I find this statement a bit confusing :
"but errors were the same when dispatcher group had more than 25 and less than 25 hosts"

compare to :
"Dispatcher group must have more than 25 hosts, if equal or less than 25 than it is Ok:"

If this is the case, it seems valuable to be able to support more than 25 hosts or at least issue a warning.
I will try to have a look hopefully soon, I found this algo is an interesting building block and a very good option for many scenarios.
The limitations that may complicate things are sometimes introduce by code architecture that would need too much refactoring, lets see.

@jchavanton
Copy link
Member

good explanation from Daniel on the expected limitations :

Hello,

there was an issue on the tracker, but this looks like the redistribution due to percentage not being an integer number.  I referred to the docs (being noted there), closed the issue and pointed to sr-users for further discussions.

So, for example, if there are 30 routes each with rweight of 50, that means total rweight of 1500. With that, each route has the percentage of 50*100/1500=3.33, which is rounded to 3, meaning that each destination gets 3 calls in 100, total 90, with the last one (which can be the first in the dispatcher list file, depending on internal sorting and instert) getting the remaining 10, overall 13 go to this route (so 19 routes will get 3 calls, and 1 will get 13 calls). Now, depending on the number of routes, one of them may get a different value, but the fact is that if the percentage isn't an integer number, then the distribution is not equal for one of them. For example, with the 6 routes:

total rweight => 6 * 50 = 300; then percentage per route is 50*100/300 = 16.6, rounded to 16, meaning one will get 20.

You may want to do a double check of the rweight algorithm code, I haven't implemented it, but for weight the above remarks apply.

That was discussed in the past on the mailing list, and very likely it is not a bug also in this case, but how the distribution can be spread across 100 calls.

Even more, for r/weight algorithms, it really makes no sense to use them when there are so many routes with same weight, practically on more than 50, each one will get 1 call and the last in the memory least will get 51. So one gets more than 50% of the calls. From the other perspective, this is not even feasible to route to all of them in the same step, because by default kamailio can create 12 branches for a request, with a maximum configurable value of 31 (iirc). I think round robin or call load distribution is better for such large number of routes.

Cheers,
Daniel

We may be able to minimize the limitation of 100 slots be at least distributing the remainders.

/* if the array was not completely filled (i.e., the sum of rweights is
 * less than 100 due to truncated), then use last address to fill the rest */
last_insert = t > 0 ? dset->rwlist[t - 1] : (unsigned int)(dset->nr - 1);
for(j = t; j < 100; j++)
      dset->rwlist[j] = last_insert;

@jchavanton
Copy link
Member

Not sure but it seems like without much refactoring we may be able to fix the loss of precision by setting the unused positions to -1/disabled

case DS_ALG_RELWEIGHT: /* 11 - relative weight based distribution */
           hash = idx->rwlist[idx->rwlast];                             // check/skip  here
           idx->rwlast = (idx->rwlast + 1) % 100;
          break;

@miconda
Copy link
Member

miconda commented Mar 30, 2021

@jchavanton: let's not use the issue tracker for discussing the documented behaviour of the stable branches.

I pointed to the docs in my comment closing earlier this issue and provided more details on sr-users mailing lists. If the devs use the issue tracker as a discussion forum, then we cannot request the other people not to do it.

If you want to discuss improvements/changes in behaviour, then use sr-dev mailing list or eventually open a feature request with appropriate details.

If @henningw reopened it because he believes it is an issue, not related to what is explicitly documented, disregarding my comment, then probably he can provide more details here.

@henningw
Copy link
Contributor

@miconda - i just reopened because @jchavanton mentioned that it is indeed an issue and should be tracked on the issue tracker.

@E1isIvan
Copy link
Author

E1isIvan commented Apr 1, 2021

Hello.

I'd like to introduce the point of view of user, while there is accurate feature description, it is hard to take into account the part after the point when you have 25 or more hosts. Let there are 100 calls. Than in case of 8 hosts and equal rweight value distribution will be more or less equal, simply because the difference between 100 and 96 (percentage is 12.5 so 12*8) is not big deal. But it is completely different case when there are 25 or 50 hosts: in this case percentage is 3.8 that gives 78 calls to distribute and 18 to go to single host.
While of-course , I can achieve the same using different algorithm, I can't use algorithm when result is dependent to different aspect than it is expected - not only the rweight value controls the result but the number of hosts as well.

@miconda
Copy link
Member

miconda commented Apr 1, 2021

@henningw: here I already gave a resolution about why is this behaviour and closed the issue.

On mailing list it was a believe that it might be a problem, as a response to a message sent there due to my comment here. If you hurried up to act on the response on the mailing list, you should have replied that it was already an issue opened, with some comments and closed, giving the link to it.

Not to reopen without minimum consideration of what was commented here. Do not reopen the issues which are closed if you do not have any idea about what they are. If you have technical reasons to believe that a closed issues should be reopened, then do it, adding the appropriate details. Now the discussion is split in two places, it is hard to track, and here is not discussing about a bug in the c code.

@miconda
Copy link
Member

miconda commented Apr 1, 2021

@E1isIvan: you have to send your remarks about current documented behaviour to sr users mailing list, so the discussion can be followed in a single thread.

The current behaviour is what it is expected and the algorithm is intended for specific use cases and matches those needs. One can eventually complain that this algorithm is not working with even distribution for more than 100 destinations, obviously not by its design. New algorithms can be contributed if someone needs new type of routing.

@miconda miconda closed this as completed Apr 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants