Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dispatcher: fix weighted sort for n<100 calls #3531

Closed
wants to merge 3 commits into from

Conversation

devopsec
Copy link
Contributor

@devopsec devopsec commented Aug 7, 2023

  • update weighted/relative weighted algos to use probability based sort
  • calls from 0-99 now follow the weighted distributions
  • removed the now unused shuffle_uint100array() function

Pre-Submission Checklist

  • Commit message has the format required by CONTRIBUTING guide
  • Commits are split per component (core, individual modules, libs, utils, ...)
  • Each component has a single commit (if not, squash them into one commit)
  • No commits to README files for modules (changes must be done to docbook files
    in doc/ subfolder, the README file is autogenerated)

Type Of Change

  • Small bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would change existing functionality)

Checklist:

  • PR should be backported to stable branches
  • Tested changes locally
  • Related to issue #XXXX (replace XXXX with an open issue number)

Description

The previous random sort for weights in a dispatcher set distributed correctly only after a minimum of n=100 calls.
This PR fixes the sort algorithm to distribute calls based on those weights, even for n<100 calls.
The real benefit here is customers can see the weighted distribution occur sooner now, when testing.

@miconda
Copy link
Member

miconda commented Aug 7, 2023

Thanks for this PR! However, it brings significant changes to the exiting behaviour/design of the weight distribution, which was done in purpose to be randomised along the 100 calls.

I think your PR should be changed to add a new dispatching, algorithm, leaving the existing one as it is.

A few more comments for coding style, to keep it as much as possible coherent: declare variables at the beginning of function or block; one line comments should be in a single line /* ... */, not three lines.

@devopsec
Copy link
Contributor Author

devopsec commented Aug 7, 2023

@miconda thanks for reviewing this.
I chose to go with replacement since I believe this better represents the original intention of the weighted / relative weighted algorithms.

Now to be honest, I could be completely wrong in thinking that.
At n>100 calls; or simply when scaling, the new algorithm distributes the same percentage of calls to each destination.
The only difference being the ordering the destinations are selected.

If you think that is incorrect then I can just move it to a new dispatching algorithm.

And i'll update the formatting on those comments as well, thanks!

@henningw
Copy link
Contributor

henningw commented Aug 7, 2023

If I understand it correctly, then going for a new algorithm makes only sense if we consider that the algorithm seems to be broken for less than 100 calls a feature of this algorithm. ;-) Its probably a better idea to fix the bug, instead of offering two algorithms for our user, a broken and a fixed one.

- update weighted/relative weighted algos to use probability based sort
- calls from 0-99 now follow the weighted distributions
- removed the now unused shuffle_uint100array() function
@devopsec
Copy link
Contributor Author

devopsec commented Aug 7, 2023

@miconda updated formatting per #3531 (comment)

@henningw I agree. I think it is a bug that calls are not distributed per the weights, until a minimum of 100 calls are made.

@miconda
Copy link
Member

miconda commented Aug 7, 2023

@henningw: the weight being expressed in percentage, you can't ensure proper accuracy for less than 100 calls, maybe in some particular cases, but not for most of the possible variants. Think about a simple to exemplify case of 100 destinations each with weight 1, if you have 2 calls, then two destinations get 50% of the traffic and the rest 0%. So try to fix that, I would love to see a solution.

@devopsec: I haven't checked the code, but I guess it is more like getting to a round robin with some destinations being many times in the group, like for example when there are two addresses, first with weight 75 and second with weight 25: first is added 3 times then the second (... and repeat). Is it randomised within these 4? Or just going to next one for each new call? If my guess is confirmed, then it is a new algorithm, the previous behaviour is lost, it is no longer a randomised distribution across 100 calls, but an ordered distribution (or randomised in group lower than 100).

@devopsec: looking at the updated commit, some fields of pointers are used before checking that the pointer is NULL: declaration of arrays using a field of a structure pointer -- we tried to avoid variable sized arrays in the past for compatibility with old compilers/C standards, I am not really against them, iirc they are allowed by C99, they have to bee relocated after the null check The old code does pkg_malloc/free() for ds_dests_flags for that reason.

@miconda
Copy link
Member

miconda commented Aug 7, 2023

For the records, variable length arrays seems to be no longer mandatory to be supported from C11 on, likely the modern compilers support them, though -- here is a good summary comparing with malloc:

@devopsec
Copy link
Contributor Author

devopsec commented Aug 8, 2023

@miconda yes the new code is not using randomized selection for the destination. It is selecting based on probability throughout (recalculate the percentage of calls to that destination each iteration, out of the total calls made). Think of the sorting loop as if it were "processing a new call" per each iteration of i. Maybe an explanation based on your above example; weight1=75, weight2=25, would clarify.

First iteration of i no calls have been processed, so the call goes to dst1. On the second iteration of i dst1 has 1/1 calls, 100% is greater than the intended weight, so we distribute the call to dst2. for i=3 both dsts have 50% of the calls, but dst1 is intended to have 75% of the calls, so the call goes to dst1. This repeats until 100 "calls" have been processed.

The effect is the exact same in terms of distribution of calls as a percentage, i.e. the intention of the algorithm, but the "bug fix" is that distribution is equivalent at high call volumes and low call volumes. Thus, it is true that the new algorithm does not use random sorting under 100 calls like the last, but I believe that is a bug fix, not a new feature. I think that was a decent explanation, but if not please ask for more clarification.

On the topic of memory allocation, thanks for the reference, I am always happy to learn! In my live tests the compiler happily accepted my code and gdb gave me good output back... I don't tout to be a master of memory management so I will defer to you on whether pkg_malloc/free() is required here to be compliant with all the intended C specs.

@miconda
Copy link
Member

miconda commented Aug 8, 2023

@devopsec: at the end you get more or less to a specific order of selecting the addresses to route to. For example, with two addresses one 99 and the other 1, the one with weight 1 is going to be placed the 2nd (or 1st depending on the initial order). If weights are 90 and 10, it should end up like 1st, 2nd, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 1st, 2nd, 1st, 1st.... For calls of same type (e.g., codecs, audio/video) and duration, could bring some benefits, but for calls of random type and duration the random ordering proved to give pretty good average load of resources over long time.

With the above remarks, considering that random ordering was done on purpose at the time of implementation, I don't consider this PR a bug fix, but a different distribution algorithm (for example it can be called percentage ordered distribution).

@devopsec
Copy link
Contributor Author

devopsec commented Aug 8, 2023

@miconda Hmmm, I see your point there. No worries, I'll resubmit as a new algorithm.

if(weights_total < 100) {
LM_INFO("extra weight %d for last destination in group %d\n",
(100 - weights_total), dset->id);
weights[last_valid_dst] = weights[last_valid_dst] + (100 - weights_total);

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable High

The variable
last_valid_dst
may not be initialized at this access.
if(rw_slices_total < 100) {
LM_INFO("extra rweight percentage %d for last destination in group %d\n",
(100 - rw_slices_total), dset->id);
rw_slices[last_valid_dst] = rw_slices[last_valid_dst] + (100 - rw_slices_total);

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable High

The variable
last_valid_dst
may not be initialized at this access.
- add support for changing the column delimeter in a hash table
- add support for changing the column null character
@miconda
Copy link
Member

miconda commented Oct 5, 2023

Is any work still planned on refactoring to a new algorithm? Maybe this can be closed and a new PR created when it is done.

@devopsec
Copy link
Contributor Author

devopsec commented Oct 5, 2023

@miconda i'll open this as a new PR when I can get back around to it, thanks!

@devopsec devopsec closed this Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants