Ensure NCP connection state is processed when changed by the muxer [ch48825] #2049

technobly · 2020-03-15T16:56:25Z

Problem

The loss of signal can cause the modem to draw higher currents and if the power supply is not sufficient, the modem power supply will brown out and the NCP client will become unresponsive to keep alive pings. [gsm0710muxer] ERROR: The other end has not replied to keep alives (TESTs) 5 times, considering muxed connection dead can be observed in the logs when this happens. This could be immediately after signal is lost, or approx. 5.2 minutes afterwards. This causes the AT command & Data muxer to go through a teardown, which in turn causes the NCP client to turn off the modem, but the PPP client doesn't know about this and keeps trying to retransmit a configure-request for IPCP. The modem will attempt to be turned on, the retransmitted configure-request will fail, and the modem initialization will immediately fail again before it turns on.

Solution

The root cause is that the NCP connection state is changed directly in the muxer callback function, but that does not get propagated to the NCP function that handles state changes, which would send an event to the PPP client to inform it that NCP is disconnected. This would prevent the failure loop discussed above from occurring.

The solution is to call connectionState() directly from the muxer callback to ensure we process events in the NCP client when NCP state is changed from the muxer callback. Note: we should not do this with the CONNECTED state, since that will cause the connectionState() function to call back into the muxer_ object. Fortunately right now this is not a requirement in the muxer callback.

Another problem area was found as well that does not seem to attribute to the fix here, but it should be noted as changed. disable() was being called before connectionState(NcpConnectionState::DISCONNECTED); below. I've swapped them and added a comment. They were also swapped in the muxer callback.

Steps to Test

There are multiple ways to reproduce this issue, but given that it requires a power failure it can be tricky to reproduce.

One way, is to run the test application, wait for the device to connect initially to the tower and Particle Cloud. Disconnect the battery and run on USB power only. Ensure publishes are occurring every 5 seconds, then disconnect the antenna and wait for the device to show a similar muxer teardown error as below. After the error, the battery should be re-attached. The device should start into the failure loop as described, and reconnecting the antenna will not recover it.

Another more complicated but more reliable way of reproducing the issue is to run only battery input only from a lab power supply (no USB) set to 3.6 ~ 3.8V and limit the supply current to 500mA.

Example App

See example app referenced in ch48825, but a threaded app that publishes data every 5 seconds should be similar enough to reproduce this issue.

References

ch48825

Completeness

User is totes amazing for contributing!
Contributor has signed CLA (Info here)
Problem and Solution clearly stated
Run unit/integration/application tests on device
N/A Added documentation
Added to CHANGELOG.md after merging (add links to docs and issues)

hal/src/boron/network/sara_ncp_client.cpp

hal/src/boron/network/sara_ncp_client.h

avtolstoy · 2020-03-19T12:19:02Z

hal/src/boron/network/sara_ncp_client.cpp

@@ -1130,12 +1137,14 @@ int SaraNcpClient::muxChannelStateCb(uint8_t channel, decltype(muxer_)::ChannelS
                // Muxer stopped
                self->disable();
                self->connState_ = NcpConnectionState::DISCONNECTED;


I think this is actually the main problem. We've changed the state manually, without generating the event and also preventing the NCP client to do so properly. Calling just disable() should be enough to trigger a call to enable() in the next loop invocation (in the PppNcpNetif), which in turn will call off(), which will call ncpState(), which will call connectionState().

We may potentially do the same thing similarly in https://github.com/particle-iot/device-os/pull/2049/files#diff-b3f74fd1e22ae53aae264584d668ea1cL363

What about the other UBLOX_NCP_PPP_CHANNEL case in the muxer callback:

if (self->connState_ != NcpConnectionState::DISCONNECTED) { self->connState_ = NcpConnectionState::CONNECTING; }

How would you propose we change/update this?

Initially I did simply remove self->connState_ = NcpConnectionState::DISCONNECTED; from the callback, and it seemed to resolve the issue... but when I checked what state was being seen by connectionState() on the next call that produced a PPP state change event, it was NcpConnectionState::CONNECTING, not DISCONNECTED. This didn't seem like the right way to resolve the issue. We want the shut down of the muxer to trigger a disable/shut down of the NCP and also the PPP client. Seeing a CONNECTING state didn't seem like the right way to do that.

Also, the ESP32 code calls the connectionState() function directly from the muxer callback. That would be a nice way to fix it as well, but it didn't seem like it was thread safe from the comments in the SARA muxer callback.

Also, the ESP32 code calls the connectionState() function directly from the muxer callback. That would be a nice way to fix it as well, but it didn't seem like it was thread safe from the comments in the SARA muxer callback.

We can actually do that (why don't we do that)? It's safe to call with NcpConnectionState::DISCONNECTED, because that won't call into the muxer. Let's do that.

How would you propose we change/update this?

connectionState(NcpConnectionState::DISCONNECTED);

but when I checked what state was being seen by connectionState() on the next call that produced a PPP state change event, it was NcpConnectionState::CONNECTING, not DISCONNECTED

I'm not really sure where CONNECTING would be coming from, as that state is only generated by processEvents/checkRegistrationState, which would be bypassed in disabled state or registerNet(). And anyway, that state will trigger the teardown of the PPP session.

I'd also vote for calling just disable() in dataChannelWrite() on error conditions.

hal/src/b5som/network/quectel_ncp_client.cpp

hal/src/boron/network/sara_ncp_client.cpp

hal/src/b5som/network/quectel_ncp_client.cpp

avtolstoy · 2020-03-24T23:35:56Z

PR updated. Removed connectionState() calls from a couple places to not hit potential race conditions/deadlocks. Test NETWORK_02_network_connection_recovers_after_ncp_failure in wiring/no_fixture_long_running test suite also has been added to simulate the failure that triggers this problem.

The issue this PR resolves is only reproducible on 2G/3G Borons or BSoMs due to the fact that the GSM0710 muxer keeplive timeout is lower than the PPP keepalive timeout: https://github.com/particle-iot/device-os/blob/develop/hal/src/boron/network/sara_ncp_client.cpp#L933

TODO: just in case in the future we could increase the timeouts to be the same as on R4-based Borons.

sergeuz

I'm glad to see we decided not to introduce additional state variables 👍

…h48825]

…ction_recovers_after_ncp_failure

… cellular platforms

technobly added bug needs review labels Mar 15, 2020

technobly added this to the 1.5.0 milestone Mar 15, 2020

technobly requested a review from avtolstoy March 15, 2020 16:56

technobly commented Mar 15, 2020

View reviewed changes

hal/src/boron/network/sara_ncp_client.cpp Show resolved Hide resolved

avtolstoy requested changes Mar 19, 2020

View reviewed changes

technobly force-pushed the fix/ch48825 branch from 1c7fb85 to 6beddfa Compare March 23, 2020 17:26

technobly requested a review from avtolstoy March 23, 2020 17:32

avtolstoy requested changes Mar 23, 2020

View reviewed changes

technobly force-pushed the fix/ch48825 branch 2 times, most recently from d26427b to 64ac79f Compare March 23, 2020 18:47

avtolstoy approved these changes Mar 23, 2020

View reviewed changes

avtolstoy self-requested a review March 24, 2020 12:22

avtolstoy added the DO NOT MERGE label Mar 24, 2020

avtolstoy force-pushed the fix/ch48825 branch from 64ac79f to 5962309 Compare March 24, 2020 23:32

avtolstoy added ready to merge PR has been reviewed and tested and removed DO NOT MERGE needs review labels Mar 24, 2020

avtolstoy requested a review from sergeuz March 24, 2020 23:32

avtolstoy added needs review and removed ready to merge PR has been reviewed and tested labels Mar 24, 2020

avtolstoy approved these changes Mar 24, 2020

View reviewed changes

sergeuz approved these changes Mar 25, 2020

View reviewed changes

technobly and others added 4 commits March 25, 2020 21:40

Ensure NCP connection state is processed when changed by the muxer [c…

76594af

…h48825]

[ncp] it's not safe to call connectionState() under certain conditions

0ac942d

[tests] wiring/no_fixture_long_running: adds NETWORK_02_network_conne…

2ca2177

…ction_recovers_after_ncp_failure

[argon] ncp: set baudrate in initReady() to behave in the same way as…

40b8b08

… cellular platforms

avtolstoy force-pushed the fix/ch48825 branch from 1db2d0a to 40b8b08 Compare March 25, 2020 14:41

avtolstoy merged commit ccdbe1e into develop Mar 25, 2020

avtolstoy deleted the fix/ch48825 branch March 25, 2020 14:42

avtolstoy added ready to merge PR has been reviewed and tested and removed needs review labels Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure NCP connection state is processed when changed by the muxer [ch48825] #2049

Ensure NCP connection state is processed when changed by the muxer [ch48825] #2049

technobly commented Mar 15, 2020 •

edited

avtolstoy Mar 19, 2020

technobly Mar 19, 2020

avtolstoy Mar 19, 2020

avtolstoy commented Mar 24, 2020 •

edited

sergeuz left a comment •

edited

Ensure NCP connection state is processed when changed by the muxer [ch48825] #2049

Ensure NCP connection state is processed when changed by the muxer [ch48825] #2049

Conversation

technobly commented Mar 15, 2020 • edited

Problem

Solution

Steps to Test

Example App

References

Completeness

avtolstoy Mar 19, 2020

Choose a reason for hiding this comment

technobly Mar 19, 2020

Choose a reason for hiding this comment

avtolstoy Mar 19, 2020

Choose a reason for hiding this comment

avtolstoy commented Mar 24, 2020 • edited

sergeuz left a comment • edited

Choose a reason for hiding this comment

technobly commented Mar 15, 2020 •

edited

avtolstoy commented Mar 24, 2020 •

edited

sergeuz left a comment •

edited