Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[adv-proxy] potential issue in publishing multiple services may cause crash #1991

Closed
abtink opened this issue Sep 3, 2023 · 6 comments · Fixed by openthread/openthread#9398

Comments

@abtink
Copy link
Member

abtink commented Sep 3, 2023

I have been testing/developing new code on otbr and noticed crash failures on some github action runs.

An example from actions/run/6066185479:

Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.296 [C] Platform------: *** FATAL ERROR: Caught signal: 11 (Segmentation fault)
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 0: /usr/sbin/otbr-agent(+0x424856) [0x56153e824856]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 1: /usr/sbin/otbr-agent(+0x4249b5) [0x56153e8249b5]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 2: /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10) [0x7f0cd73adf10]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 3: /lib/x86_64-linux-gnu/libc.so.6(+0x18e487) [0x7f0cd74fd487]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 4: /usr/lib/x86_64-linux-gnu/libstdc++.so.6 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::operator=(char const*)+0x14 [0xd7a9db64]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 5: /usr/sbin/otbr-agent(+0x52ebb4) [0x56153e92ebb4]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 6: /usr/sbin/otbr-agent otbr::AdvertisingProxy::PublishHostAndItsServices(otSrpServerHost const*, otbr::AdvertisingProxy::OutstandingUpdate*)+0x5ef [0x3e92f7ed]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 7: /usr/sbin/otbr-agent otbr::AdvertisingProxy::AdvertisingHandler(unsigned int, otSrpServerHost const*, unsigned int)+0xa0 [0x3e92e2ec]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 8: /usr/sbin/otbr-agent otbr::AdvertisingProxy::AdvertisingHandler(unsigned int, otSrpServerHost const*, unsigned int, void*)+0x3e [0x3e92e236]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: # 9: /usr/sbin/otbr-agent void ot::Callback<void (*)(unsigned int, otSrpServerHost const*, unsigned int, void*), (ot::CallbackContextPosition)1>::Invoke<unsigned int, ot::Srp::Server::Host*, unsigned int>(unsigned int&&, ot::Srp::Server::Host*&&, unsigned int&&) const+0x3c [0x3e8c84f4]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: #10: /usr/sbin/otbr-agent ot::Srp::Server::InformUpdateHandlerOrCommit(otError, ot::Srp::Server::Host&, ot::Srp::Server::MessageMetadata const&)+0x4be [0x3e8c5696]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: #11: /usr/sbin/otbr-agent ot::Srp::Server::HandleUpdate(ot::Srp::Server::Host&, ot::Srp::Server::MessageMetadata const&)+0x1dc [0x3e8c51bc]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: #12: /usr/sbin/otbr-agent ot::Srp::Server::ProcessDnsUpdate(ot::Message&, ot::Srp::Server::MessageMetadata&)+0x1c8 [0x3e8c3156]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: #13: /usr/sbin/otbr-agent ot::Srp::Server::ProcessMessage(ot::Message&, ot::Time, ot::Srp::Server::TtlConfig const&, ot::Srp::Server::LeaseConfig const&, ot::Ip6::MessageInfo const*)+0xfb [0x3e8c5d69]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: #14: /usr/sbin/otbr-agent ot::Srp::Server::ProcessMessage(ot::Message&, ot::Ip6::MessageInfo const&)+0x4d [0x3e8c5c65]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: #15: /usr/sbin/otbr-agent ot::Srp::Server::HandleUdpReceive(ot::Message&, ot::Ip6::MessageInfo const&)+0x2b [0x3e8c5be7]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: #16: /usr/sbin/otbr-agent ot::Srp::Server::HandleUdpReceive(void*, otMessage*, otMessageInfo const*)+0x45 [0x3e8c5bb3]
Sep  3 19:00:55 9d019175891f otbr-agent[138]: 00:01:16.311 [C] Platform------: #17: /usr/sbin/otbr-agent ot::Ip6::Udp::SocketHandle::HandleUdpReceive(ot::Message&, ot::Ip6::MessageInfo const&)+0x31 [0x3e8cab97]

This happens in the SrpRegister500ServicesBR tests, where clients try to register multiple services.

Upon further investigation, I believe I have found the root cause of this issue:

  • The AdvertisingProxy::PublishHostAndItsServices() iterates over all services of the given aHost and calls Mdns::Publisher::PublishService() for each service.
  • We pass a callback to the PublishService() method to report back the result.
  • If there is an immediate failure, the PublishService() method can invoke the callback directly (before returning from PublishService()). This is possible for both MDNSResponder and Avahi implementations.
  • The callback used is the AdvertisingProxy::OnMdnsPublishResult() method, which, in case of an error, will itself call the OT API to report the failure of the "Service Update Handler". It will call otSrpServerHandleServiceUpdateResult() API, which will trigger OT SRP Sever to send a failure response to the SRP client and free the aHost.
  • The issue is that when there are multiple services in aHost, if PublishService() fails immediately for one of the first ones, the failure is reported all the way to the OT stack and aHost is freed while the OTBR AdvertisingProxy is still holding on the aHost pointer and trying to iterate and access the next services on an already freed aHost.
@abtink
Copy link
Member Author

abtink commented Sep 3, 2023

Some suggestions/ideas for fix (though none seem to be a quick/simple fix):

  • We can add a "tasklet" style mechanism so to process callback results after we are done iterating over all services in PublishHostAndItsServices().
  • Or add a new mechanism to track error from immediate callback invocation and abort/stop iterating/accessing the services in PublishHostAndItsServices().

Thoughts?

BTW, the new adv-proxy implementation in OT core (from openthread/openthread#9268) should not face similar issue since we use a Tasklet and we cover this is the test-cases (i.e., the immediate "register" callbacks).

@abtink
Copy link
Member Author

abtink commented Sep 4, 2023

Thinking more on this I feel it may be easier if we fix this in OT core and in SRP server code:

  • From HandleServiceUpdateResult() we find the matching update in mOutstandingUpdates list and remove it and add it to a new "completed updated" list.
  • Then we post a Tasklet which from its handler processes all the entries in the new "completed update" list.

This seems simpler than other approaches to fix this in otbr. @wgtdkp what do you think?

I will try to submit a PR with this later today or tomorrow.

@abtink
Copy link
Member Author

abtink commented Sep 4, 2023

@wgtdkp
Copy link
Member

wgtdkp commented Sep 5, 2023

Thinking more on this I feel it may be easier if we fix this in OT core and in SRP server code:

  • From HandleServiceUpdateResult() we find the matching update in mOutstandingUpdates list and remove it and add it to a new "completed updated" list.
  • Then we post a Tasklet which from its handler processes all the entries in the new "completed update" list.

This seems simpler than other approaches to fix this in otbr. @wgtdkp what do you think?

I will try to submit a PR with this later today or tomorrow.

Yeah, agree the SRP server side should guarantee that aHost should always be available before the service handler callback returns.

Or add a new mechanism to track error from immediate callback invocation and abort/stop iterating/accessing the services in PublishHostAndItsServices().

Though the SRP server fix will make it work even without this change in OTBR, I think is still a good idea to stop trying to publish the rest of the service if the first one already failed (by checking if this update is still in the mOutstandingUpdates). thoughts?

@abtink
Copy link
Member Author

abtink commented Sep 5, 2023

Yeah, agree the SRP server side should guarantee that aHost should always be available before the service handler callback returns.

Let me copy my thoughts/comment from openthread/openthread#9398: 😃

SRP server is already expecting this behavior. The issue is that the otbr advertising proxy is not following this rule. It is still accessing the aHost after itself invoked the handler callback (more details #1991).

The change here is not really fixing anything in SRP server. It makes following change related to lifecycle of aHost to make it easier for the next layer code:

Current expectation:

  • When otSrpServerServiceUpdateHandler() is called SRP server guarantees that aHost is valid util the callback is called or we return from otSrpServerServiceUpdateHandler() (whichever comes first).
  • This rule specifically applies even when the callback is called before returning from otSrpServerServiceUpdateHandler() call itself.

New expectation:(with this PR):

  • SRP server guarantees that aHost remains valid until otSrpServerServiceUpdateHandler() returns even if the callback is called before returning from otSrpServerServiceUpdateHandler.

The use of takelet is not to allow the next layer to keep a copy of the aHost, only that it remains valid util the otSrpServerServiceUpdateHandler() call itself returns.

As mentioned in #1991, we can consider fixing this issue in otbr (which may be a good idea to avoid doing extra work when we already know that advertising aHost has failed). However, such patterns of callbacks from the called handler function changing state are often tricky to get right. To be fair, this issue is not immediately obvious in the otbr code. If we have this issue in our otbr adv-proxy, it is likely that other implementations (RTOS-based adv-proxy) may make similar mistakes. That's why I think it is better to make it easier for next layer code by changing the expectation of SRP server.

@abtink
Copy link
Member Author

abtink commented Sep 6, 2023

Should be resolved by openthread/openthread#9398

@abtink abtink closed this as completed Sep 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants