Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mod pua (dialoginfo) etag race conditions #2048

Closed
laszlovl opened this issue Aug 27, 2019 · 3 comments
Closed

mod pua (dialoginfo) etag race conditions #2048

laszlovl opened this issue Aug 27, 2019 · 3 comments

Comments

@laszlovl
Copy link

I'm observing the following scenario:

  • mod_dialog callbacks trigger 2 or more times (nearly) simultaneously for the same dialog
  • pua_dialoginfo sends PUBLISH 1, referencing etag A
  • pua_dialoginfo sends PUBLISH 2, referencing etag A
  • presence_dialoginfo processes PUBLISH 1, replies with new etag B
  • presence_dialoginfo processes PUBLISH 2, replies with a 412 (because etag A no longer exists)
  • pua_dialoginfo receives the 412 and re-tries it as PUBLISH 3 ("sent a PUBLISH within a dialog that no longer exists, send again an intial PUBLISH")
  • presence_dialoginfo processes PUBLISH 3, and may or may not accept it

The situation as described is not ideal since it'll fill up your logs with errors, but isn't critical per se. Much more problematic is when there are more than 2 PUBLISHes generated for the same dialog simultaneously, as this can cause a (near) infinite race between the various PUBLISH requests all fighting to update the same etag. For example, 10 PUBLISH are sent out for etag A; all but one are rejected with a 412; then the other 9 keep on bouncing back and forth between pua_dialoginfo and presence_dialoginfo because they do not share the same view on the dialog's latest etag.

Even worse is when presence_dialoginfo is rejecting all incoming PUBLISHes with a 412, for example because of a database/memory/replication problem or a malformed request. A t_reply("412", "Not today") in the presence_dialoginfo server, combined with a single PUBLISH from pua_dialoginfo is enough to reproducibly brick the pua_dialoginfo server because it runs into critical memory fragmentation levels.

I think there are multiple ways to fix or alleviate this problem.

pua generic

  • pua (publ_cback_func) should not retry 412-failed PUBLISHes indefinitely, but e.g. at most once
  • pua should not generate simultaneous PUBLISHes for the same presentity. It should delay PUBLISH 2 until PUBLISH 1 is either (permanently) accepted or rejected; or it should discard PUBLISH 2 immediately when it is generated.
  • Perhaps make handling of 412 replies more fine-grained. Currently every 412 reply is handled like this ("sent a PUBLISH within a dialog that no longer exists"), while that statement doesn't apply to all possible 412 replies.

pua_dialoginfo specific

  • pua_dialoginfo currently subscribes to a lot of mod_dialog callbacks. For example subscribing to both DLGCB_CONFIRMED and DLGCB_CONFIRMED_NA will always get you two (rapidly succeeding) PUBLISHes with exactly the same contents. Subscribing to DLGCB_REQ_WITHIN means you'll get a new PUBLISH for every re-INVITE such as hold/unhold or codec negotiation, which is useless in many usecases. It would be helpful to allow configuring pua_dialoginfo with a list of callbacks to subscribe to.
  • With or without a smaller set of mod_dialog callbacks, pua_dialoginfo can generate multiple PUBLISH requests with exactly the same contents. Since pua is aware of the (last) state that it published for the presentity, it could compare if the newly generated PUBLISH is any different from the last known state, and discard it if it's not.
@henningw
Copy link
Contributor

henningw commented Sep 5, 2019

@laszlovl - thank you for the detailed issue description. Are you planning to work on this further, e.g. by creating pull requests?

It makes probably sense to prioritize the different fixes according to implementation effort. It would be also good to cross-check the implementation with the relevant RFC standards.

This two topics sounds like good next steps to me, let me know what you think.

  • making pua 412 handling more granular (by checking the type of the reply)
  • making pua_dialoginfo callback subscription configurable (if possible with a list of callbacks, otherwise maybe with a mode parameter that adds more and more CBs)

@laszlovl
Copy link
Author

I don't have any immediate/short term plans to work on implementations for these suggestions, though if anyone else does so I'll make sure to test and give feedback.

Copy link

github-actions bot commented Nov 8, 2023

This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.

@github-actions github-actions bot added the Stale label Nov 8, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants