Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase reference to session when handling SIP calls (see #2188) #2216

Merged
merged 5 commits into from
Jun 15, 2020

Conversation

lminiero
Copy link
Member

@lminiero lminiero commented Jun 9, 2020

This is supposed to help with the occasional crashes happening in the SIP plugin when using helpers, described in #2188.

The root cause there is that the janus_sip_session associated to a helper may be freed (helper closed), but it still "lives" in the Sofia SIP loop thread, e.g., because there are still events associated to it. This happens when we a call is originated via nua_invite on a helper, or when we pass an incoming call to a helper and so bind it to the helper session using nua_bind_handle: this causes all events associated to that call to include a pointer to that session, as the hmagic. What seems to be happening is that the call still has some state that can trigger events: since the Sofia loop belongs to the master session, it still runs, and so can still ship those events; trying to dereference the now freed helper session causes the crash.

This patch tries to address that by using reference counters to ensure the session is not destroyed until the calls it handles are gone. Specifically, we add a new reference when starting a new call, and when getting a new one: when helpers are involved, we make sure it's their reference that is updated. The reference is then only decreased when we get the nua_i_terminated callback, which means the call is definitely over as far as the Sofia SIP stack is concerned.

I tested briefly and it seems to be working. That said, I didn't test intensively: no time for that. As such, if you use the SIP plugin, make sure to test this properly. Notice that you should not only make sure it doesn't crash where it did before, but ideally also check whether this introduces leaks: to do that, please build with libasan support, which will print a summary of the leaked memory when you shut Janus down cleanly. You can also uncomment REFCOUNT_DEBUG in refcount.h to more specifically track if there is any reference still dangling when Janus is shut down cleanly.

The sooner you can provide me feedback on that (and why not, fixes if you find anything broken), the sooner we can fix this.

@zaltar
Copy link

zaltar commented Jun 10, 2020

I was able to test out this patch and it did prevent the crashing, however I found two memory leaks.

The first is line 4608 in janus_sip.c. I don't believe this refcount increase is needed on the helper since the call to janus_sip_sofia_callback will also increase it.

The second issue I found is when the browser disconnects from Janus while a call is in progress. A nua_i_terminated event for the call never occurs so the refcount on the session never decreases.

Let me know if you need additional logs for these leaks or any other info. Thanks for the help with this issue.

@lminiero
Copy link
Member Author

lminiero commented Jun 10, 2020

The first is line 4608 in janus_sip.c. I don't believe this refcount increase is needed on the helper since the call to janus_sip_sofia_callback will also increase it.

Good catch! I'll fix that.

The second issue I found is when the browser disconnects from Janus while a call is in progress. A nua_i_terminated event for the call never occurs so the refcount on the session never decreases.

That's what I feared would happen, in my comment on the issue page: the loop being stopped before all events are shipped. Not sure what the proper way to handle this will be, since it will probably mean adding some complex state management. I'll have to think about it.

@lminiero
Copy link
Member Author

@zaltar can you check this update?

I tried to use the nua_r_shutdown event for the purpose, which is notified by the loop when a NUA shutdown we requested has been processed. The documentation says this gets rid of all calls, dialogs, etc. associated with that stack, but unfortunately that doesn't result in the related events to be sent before nua_r_shutdown, or at all actually. As such, I had to add a "manual" tracking of call-related references in the master session: any time we increase a reference for a session due to a call, we keep track of it (if it's a helper, all tracking is done in the master), and similarly when we decrease the reference we remove it from our list as well. When we get the nua_r_shutdown, if we still have call-related references pending (which would never be cleaned up by a nua_i_terminated) we get rid of them ourselves.

I tested this briefly and it seems to be working as expected, but you may want to stress it a bit more, especially with helpers involved. On helpers, I wonder if the same issue that is happening with calls can happen with subscriptions as well (e.g., if the helper sent a subscribe or originated a transfer), but we can worry about that another day.

@zaltar
Copy link

zaltar commented Jun 10, 2020

Unfortunately the crashes are now back.
AddressSanitizer output: https://pastebin.com/mpZ8V8Ei
Tail of the Janus log right before a crash: https://pastebin.com/bkRLseS2

From what I can tell, janus_sip_destroy_session is setting session->master to NULL which then causes the wrong condition to be executed in the nua_i_terminated case (janus_sip.c:4534). This causes the helper session to remain in master's active_calls list but the helper session's refcount is decremented. Later on when sessions remaining in active_calls are released in the nua_r_shutdown case (janus_sip.c:5064) the helper session has already been freed.

@lminiero
Copy link
Member Author

I guess one way to address that is by only unreffing if the session was in the list we're removing it from. That should solve the crash. Not sure if it will introduce a leak, though: maybe we shouldn't set session->master to NULL for helpers, so that the info can still be used until the session is really freed. I'll try and come up with a patch later.

@lminiero
Copy link
Member Author

@zaltar pushed a new tentative fix.

@zaltar
Copy link

zaltar commented Jun 12, 2020

I think we're very close to a fix. Locally I just had to modify one thing that AddressSanitizer caught. In janus_sip_unref_active_call the janus_refcount_decrease call at (janus_sip.c:1680) could cause session itself to be freed and then the access to session->master causes a use after free bug. I was able to fix this by storing session->master in a local variable. With that one change I did not see any more crashes or leaks/bugs in AddressSanitizer in testing.

Thanks for helping with this!

@code7-adrianomartins
Copy link

I think we're very close to a fix. Locally I just had to modify one thing that AddressSanitizer caught. In janus_sip_unref_active_call the janus_refcount_decrease call at (janus_sip.c:1680) could cause session itself to be freed and then the access to session->master causes a use after free bug. I was able to fix this by storing session->master in a local variable. With that one change I did not see any more crashes or leaks/bugs in AddressSanitizer in testing.

Thanks for helping with this!

Something like this?

static void janus_sip_unref_active_call(janus_sip_session *session) {
	if(session == NULL)
		return;
	janus_sip_session *master = session->master;
	if(master) {
		janus_mutex_lock(&master->mutex);
		if(g_list_find(master->active_calls, session) != NULL) {
			master->active_calls = g_list_remove(master->active_calls, session);
			janus_refcount_decrease(&session->ref);
		}
		janus_mutex_unlock(&master->mutex);
	} else {
		janus_mutex_lock(&session->mutex);
		if(g_list_find(session->active_calls, session) != NULL) {
			session->active_calls = g_list_remove(session->active_calls, session);
			janus_refcount_decrease(&session->ref);
		}
		janus_mutex_unlock(&session->mutex);
	}
}

@lminiero
Copy link
Member Author

Ok, done, thanks for the feedback.

@lminiero
Copy link
Member Author

Since @zaltar mentioned that with that fix, he couldn't replicate crashes or leaks anymore, I'll merge and close the original issue. Thanks for the help!

@lminiero lminiero merged commit d93606c into master Jun 15, 2020
@lminiero lminiero deleted the sip-fix-helpers branch June 15, 2020 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants