Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rtpengine: hash table to keep the selected nodes #390

Merged
merged 13 commits into from Dec 10, 2015

Conversation

smititelu
Copy link
Contributor

Shared memory hash table with global hashtable lock.
Add state maintaining the selected rtp node, for a given callid.
Hashtable entry expiration time configurable using hash_entry_tout modparam.
The actual deletion happens on the fly while insert/remove/lookup are called.
Updated doku.

Shared memory hash table with global hashtable lock.
Add state maintaining the selected rtp node, for a given callid.
Hashtable entry expiration time configurable using hash_entry_tout modparam.
The actual deletion happens on the fly while insert/remove/lookup are called.
Updated doku.
@smititelu
Copy link
Contributor Author

Basically tested it, with focus on entries expiration. Further tests will follow.

What do you think about current expired entries deletion implementation? Do you have any other advices/proposals?

Stefan Mititelu added 4 commits November 6, 2015 17:17
This is my understanding of the current shared memory node list implementation.
Correct me if I'm wrong.
Allow configurable table size.
Updated doku.
Print the total number of hash entries in the hash table, at the given moment.
Updated doku.
When the param is enabled, allow current sessions to finish and deny new
sessions for manually deactivated rtpengine nodes via kamctl i.e.
"disabled(permanent)" nodes.
This is useful when deactivating the nodes for maintenance.
Default value is 0, so the current behaviour is maintained
(e.g. don't send commands to any deactivated proxy).
Updated doku.
@smititelu
Copy link
Contributor Author

Updated pull request. Added some modparams and kamctl for the hashtable. Added a modparam to allow session deletion to manually disabled nodes (useful when performing maintenance). Updated doku.

In my opinion tests were ok. Tested this for 100 calls with media, 10 calls/s rate. Sessions were deleted successfully upon BYE. No memory leaks found at the hash table level (used "kamcmd mod.stats rtpengine shm" for this).

@miconda
Copy link
Member

miconda commented Nov 11, 2015

What would happen when there is a restart?

I find the feature useful overall, any limitations need to be documented when it is enabled.

Hopefully @rfuchs can comment from his point of view and merge when all is ok.

@smititelu
Copy link
Contributor Author

On a restart (i.e. rtpengine's mod_destroy() is called), rtpengine_hash_table_destroy() is called which frees the entire shm hashtable.

Possible limitations:

  • overhead added by the rtpengine hashlock (i.e. decrease in calls/sec rate)
  • more shm used (i.e. one has to increase the shm using kamailio's "-m" parameter)

What is the current known limitation for kamailio (calls/sec) ?

@miconda
Copy link
Member

miconda commented Nov 11, 2015

I was looking more at the perspective of the items in hash table being lost and if the restart was in a middle of the call, then after restart, when the BYE is processed, which rtpengine is updated?

@smititelu
Copy link
Contributor Author

Hmm.. I think I get the point. If the kamailio restarts in the middle of a call, the hashtable is flushed. When the BYE comes, no entry is found for that callid and no rtpengine is updated.

@smititelu
Copy link
Contributor Author

One approach to this would be to fallback to the stateless behaviour when no entry is found in the hashtable and just select a rtpengine based on the hashed callid (with 100% matching the rtpengine machine for that call if none crashed in the meantime).


/* Insert the key<->entry from the hashtable */
if (!rtpengine_hash_table_insert(&callid, entry)) {
LM_ERR("rtpengine hash table fail to insert node=%.*s for calllen=%d callid=%.*s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error case needs some shm_free() I think

@gaaf
Copy link
Contributor

gaaf commented Nov 12, 2015

Why not just require the dialog module and store the selected rtpproxy in the dialog?

What happens on spiraled/hairpin calls? The call-id alone does not seem specific enough.

@miconda
Copy link
Member

miconda commented Nov 12, 2015

I would not like a dependency on dialog module, it is quite common to run rtp relay on edge proxy which is typically more or less a lightweight load balancer -- dialog management on those nodes will add a lot of overhead.

Fallback to hasing callid after a restart seams reasonable for me (giving the current behaviour).

For spiralled calls, iirc, there is an option to provide via branch to rtpproxy/rtpengine, but this is independent of selecting what rtpproxy/rtpengine to use.

@smititelu
Copy link
Contributor Author

We also thought of using dialog module for keeping the node. The main reason against it was to keep the rtpengine module as independent as it could be. Maybe for spiraled calls this could be extended to consider also the from/to-tag when calculating the hash index ?!

@juha-h
Copy link
Contributor

juha-h commented Nov 12, 2015 via email

@gaaf
Copy link
Contributor

gaaf commented Nov 12, 2015

So, the dialog module is a no-go/

Have you considered just putting the selected rttproxy in an RR parameter? For me that has worked perfectly for years now.

@juha-h
Copy link
Contributor

juha-h commented Nov 12, 2015 via email

@miconda
Copy link
Member

miconda commented Nov 12, 2015

That is only the setid not the selected rtp relay within the set. Afaik, there is no way to specify the rtp relay within a set from config file.

@juha-h
Copy link
Contributor

juha-h commented Nov 12, 2015 via email

@miconda
Copy link
Member

miconda commented Nov 12, 2015

My understanding was that the patch was adding support for keeping the relation between the selected rtpengine and the callid, when there are more than one rtpengine per set. If one of the rtpengine in the set is unavailable, then the current hashing selection becomes inconsistent if the rtpengine turns back available.

If there is only one rtpengine per set, then no hashing selection is done.

Stefan Mititelu added 2 commits November 16, 2015 13:26
- shm NULL checks and free already alloc'ed shm
- default entry tout to 3600 sec
- return node only, not the whole entry
- zero shm hashtable parts
- lookup and select new node if lookup fails; this is done for all commands
and assures fallback behaviour
- change void to struct specific
- make set_rtp_inst_pvar() static -> used only in rtpengine.c
- fix typos rtpproxy vs rtpengine
- hash table entry contains callid, viabranch
- hash table lookup based on callid, viabranch (useful for branching scenarios);
keep doing the hash table remove right away
- remove op param when select_rtpp_node(); not needed
@smititelu
Copy link
Contributor Author

Updated pull request. Some of the most important updates:

  • free previously allocated shm in case of reached memory limit and solve possible segfaults; tested this using -m 128 and a large table: kamailio didn't crashed when had no memory for the entries and still accepted calls afterwards.
  • always lookup hashtable before selecting a new node; this fallbacks to the previous behaviour
  • check also for viabranch when matching the call in addition to the callid; tested this for simple calls i.e. viabranch was STR_NULL; didn't tested it for real-life branching scenarios as we are not using it; here I'd need some help from your side, if you use those scenarios and have time for it; IMHO the matching should be fine as I'm checking the viabranch value I get from the struct sip_msg.

@smititelu
Copy link
Contributor Author

Accidentally pushed all the commits when trying to "git push upstream p_usrloc_NULL_checks master"; forgot the ":".

Shall I re-push the old files? (can't push -f a reset on previous commit due to locking).

@linuxmaniac
Copy link
Member

git revert is your friend

@linuxmaniac
Copy link
Member

Done. I reverted all your commits

@smititelu
Copy link
Contributor Author

Thank you; didn't think of it :)

@miconda
Copy link
Member

miconda commented Nov 23, 2015

@linuxmaniac -- you should add the commands you executed in the wiki:

It will be useful for devs getting into git.

@linuxmaniac
Copy link
Member

Stefan Mititelu added 5 commits December 4, 2015 13:16
Don't dup a NULL str->s to avoid warning message.
This happened usually when viabranch is not used(default being NULL).
If allow_op modparam enabled, send commands to the disabled machines for the
existing call. So far this was done only for manually deactivated machines.
This is useful because there might be cases of proxy timeout, cases when you
may want to still allow the operations for the existing calls.
This will further increase rtpengine's hash table access.
For consistency with the per row locks, statistics should be also per row.
- struct rtpengine_hash_table now contains the table size.
- rename the entry_list to row_entry_list
@smititelu
Copy link
Contributor Author

Updated the pull request with focus on the hastable API:

  • done per row hashtable locking instead of global hashtable locking; this should further decrease the possible limitations of using hashtable locks;
  • when enabled, the allow_op modparam will allow sessions to finish for both disabled(permanent) and disabled nodes(they might be disabled due to timeout not to real machine break and one might still want the current sessions to finish);

rtpengine_hash_table->row_entry_list = shm_malloc(rtpengine_hash_table->size * sizeof(struct rtpengine_hash_entry*));
if (!rtpengine_hash_table->row_entry_list) {
LM_ERR("no shm left to create rtpengine_hash_table->row_entry_list\n");
rtpengine_hash_table_destroy();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor nitpick: These calls to _destroy() here won't ever actually do anything, because the _sanity_checks() will always fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The _destroy() should do sanity checking, with memory free, when possible. Will correct this.

@rfuchs
Copy link
Member

rfuchs commented Dec 10, 2015

Other than that, looks good to me.

_destroy() sanity checking, with memory free, when possible:
- alloc the locks first.
- free the locks last.
- consider content already hadled for a NULL lock (or NULL lock vector).
- make _free_row_lock() static.
@rfuchs rfuchs merged commit 95cd106 into kamailio:master Dec 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants