Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure NCP connection state is processed when changed by the muxer [ch48825] #2049

Merged
merged 4 commits into from Mar 25, 2020

Conversation

technobly
Copy link
Member

@technobly technobly commented Mar 15, 2020

Problem

The loss of signal can cause the modem to draw higher currents and if the power supply is not sufficient, the modem power supply will brown out and the NCP client will become unresponsive to keep alive pings. [gsm0710muxer] ERROR: The other end has not replied to keep alives (TESTs) 5 times, considering muxed connection dead can be observed in the logs when this happens. This could be immediately after signal is lost, or approx. 5.2 minutes afterwards. This causes the AT command & Data muxer to go through a teardown, which in turn causes the NCP client to turn off the modem, but the PPP client doesn't know about this and keeps trying to retransmit a configure-request for IPCP. The modem will attempt to be turned on, the retransmitted configure-request will fail, and the modem initialization will immediately fail again before it turns on.

Solution

The root cause is that the NCP connection state is changed directly in the muxer callback function, but that does not get propagated to the NCP function that handles state changes, which would send an event to the PPP client to inform it that NCP is disconnected. This would prevent the failure loop discussed above from occurring.

The solution is to call connectionState() directly from the muxer callback to ensure we process events in the NCP client when NCP state is changed from the muxer callback. Note: we should not do this with the CONNECTED state, since that will cause the connectionState() function to call back into the muxer_ object. Fortunately right now this is not a requirement in the muxer callback.

Another problem area was found as well that does not seem to attribute to the fix here, but it should be noted as changed. disable() was being called before connectionState(NcpConnectionState::DISCONNECTED); below. I've swapped them and added a comment. They were also swapped in the muxer callback.

Screen Shot 2020-03-12 at 9 40 37 PM

Steps to Test

There are multiple ways to reproduce this issue, but given that it requires a power failure it can be tricky to reproduce.

One way, is to run the test application, wait for the device to connect initially to the tower and Particle Cloud. Disconnect the battery and run on USB power only. Ensure publishes are occurring every 5 seconds, then disconnect the antenna and wait for the device to show a similar muxer teardown error as below. After the error, the battery should be re-attached. The device should start into the failure loop as described, and reconnecting the antenna will not recover it.

Screen Shot 2020-03-12 at 2 46 29 PM

Another more complicated but more reliable way of reproducing the issue is to run only battery input only from a lab power supply (no USB) set to 3.6 ~ 3.8V and limit the supply current to 500mA.

Example App

See example app referenced in ch48825, but a threaded app that publishes data every 5 seconds should be similar enough to reproduce this issue.

References

ch48825


Completeness

  • User is totes amazing for contributing!
  • Contributor has signed CLA (Info here)
  • Problem and Solution clearly stated
  • Run unit/integration/application tests on device
  • N/A Added documentation
  • Added to CHANGELOG.md after merging (add links to docs and issues)

hal/src/boron/network/sara_ncp_client.h Outdated Show resolved Hide resolved
@@ -1130,12 +1137,14 @@ int SaraNcpClient::muxChannelStateCb(uint8_t channel, decltype(muxer_)::ChannelS
// Muxer stopped
self->disable();
self->connState_ = NcpConnectionState::DISCONNECTED;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually the main problem. We've changed the state manually, without generating the event and also preventing the NCP client to do so properly. Calling just disable() should be enough to trigger a call to enable() in the next loop invocation (in the PppNcpNetif), which in turn will call off(), which will call ncpState(), which will call connectionState().

We may potentially do the same thing similarly in https://github.com/particle-iot/device-os/pull/2049/files#diff-b3f74fd1e22ae53aae264584d668ea1cL363

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other UBLOX_NCP_PPP_CHANNEL case in the muxer callback:

if (self->connState_ != NcpConnectionState::DISCONNECTED) {
    self->connState_ = NcpConnectionState::CONNECTING;
}

How would you propose we change/update this?

Initially I did simply remove self->connState_ = NcpConnectionState::DISCONNECTED; from the callback, and it seemed to resolve the issue... but when I checked what state was being seen by connectionState() on the next call that produced a PPP state change event, it was NcpConnectionState::CONNECTING, not DISCONNECTED. This didn't seem like the right way to resolve the issue. We want the shut down of the muxer to trigger a disable/shut down of the NCP and also the PPP client. Seeing a CONNECTING state didn't seem like the right way to do that.

Also, the ESP32 code calls the connectionState() function directly from the muxer callback. That would be a nice way to fix it as well, but it didn't seem like it was thread safe from the comments in the SARA muxer callback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the ESP32 code calls the connectionState() function directly from the muxer callback. That would be a nice way to fix it as well, but it didn't seem like it was thread safe from the comments in the SARA muxer callback.

We can actually do that (why don't we do that)? It's safe to call with NcpConnectionState::DISCONNECTED, because that won't call into the muxer. Let's do that.

How would you propose we change/update this?

connectionState(NcpConnectionState::DISCONNECTED);

but when I checked what state was being seen by connectionState() on the next call that produced a PPP state change event, it was NcpConnectionState::CONNECTING, not DISCONNECTED

I'm not really sure where CONNECTING would be coming from, as that state is only generated by processEvents/checkRegistrationState, which would be bypassed in disabled state or registerNet(). And anyway, that state will trigger the teardown of the PPP session.

I'd also vote for calling just disable() in dataChannelWrite() on error conditions.

hal/src/b5som/network/quectel_ncp_client.cpp Outdated Show resolved Hide resolved
hal/src/b5som/network/quectel_ncp_client.cpp Show resolved Hide resolved
hal/src/boron/network/sara_ncp_client.cpp Outdated Show resolved Hide resolved
hal/src/boron/network/sara_ncp_client.cpp Outdated Show resolved Hide resolved
hal/src/b5som/network/quectel_ncp_client.cpp Outdated Show resolved Hide resolved
@technobly technobly force-pushed the fix/ch48825 branch 2 times, most recently from d26427b to 64ac79f Compare March 23, 2020 18:47
@avtolstoy avtolstoy self-requested a review March 24, 2020 12:22
@avtolstoy avtolstoy added ready to merge PR has been reviewed and tested and removed DO NOT MERGE needs review labels Mar 24, 2020
@avtolstoy avtolstoy requested a review from sergeuz March 24, 2020 23:32
@avtolstoy avtolstoy added needs review and removed ready to merge PR has been reviewed and tested labels Mar 24, 2020
@avtolstoy
Copy link
Member

avtolstoy commented Mar 24, 2020

PR updated. Removed connectionState() calls from a couple places to not hit potential race conditions/deadlocks. Test NETWORK_02_network_connection_recovers_after_ncp_failure in wiring/no_fixture_long_running test suite also has been added to simulate the failure that triggers this problem.

The issue this PR resolves is only reproducible on 2G/3G Borons or BSoMs due to the fact that the GSM0710 muxer keeplive timeout is lower than the PPP keepalive timeout: https://github.com/particle-iot/device-os/blob/develop/hal/src/boron/network/sara_ncp_client.cpp#L933

TODO: just in case in the future we could increase the timeouts to be the same as on R4-based Borons.

Copy link
Member

@sergeuz sergeuz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad to see we decided not to introduce additional state variables 👍

@avtolstoy avtolstoy deleted the fix/ch48825 branch March 25, 2020 14:42
@avtolstoy avtolstoy added ready to merge PR has been reviewed and tested and removed needs review labels Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug ready to merge PR has been reviewed and tested
Projects
None yet
3 participants