Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crit-err 3000001 on retransmiting CASE Sigma2 #21671

Closed
Kxuan opened this issue Aug 5, 2022 · 9 comments · Fixed by #22418
Closed

Crit-err 3000001 on retransmiting CASE Sigma2 #21671

Kxuan opened this issue Aug 5, 2022 · 9 comments · Fixed by #22418
Assignees

Comments

@Kxuan
Copy link
Contributor

Kxuan commented Aug 5, 2022

Problem

In CASE process (during device commissioning), the comissionee should reply Sigma2 to comissioner.
But there is a little chance that the comissionee failed sent Sigma2 message.

It happens with a very low chance (<1%), but we have catched the problem twice. Both of them have exactly same phenomenon:

  1. Sent Sigma2 msg first time, no error reported, but the comissioner does not received
  2. Retransmit Sigma2 after 360ms, no error reported, but the comissioner still does not received
  3. Retransmit Sigma2 third times, an error Crit-err 3000001 reported, retransmition stopped, and the comissioner still does not received

After that, device commissioning failed due to timeout in the comissioner side.

The error code 3000001 means ERR_MEM, but some log messages showed that free heap size is enough to handle network message.

I (25176) chip[IN]: Prepared unauthenticated message 0x3ffc6af4 to 0x0000000000000000 (0)  of type 0x31 and protocolId (0, 0) on exchange 5359r with MessageCounter:112355573.
I (25196) chip[IN]: Sending unauthenticated msg 0x3ffc6af4 with MessageCounter:112355573 to 0x0000000000000000 at monotonic time: 0000000000005F7E msec
I (25206) chip[SC]: Sent Sigma2 msg
E (25206) chip[DL]: Long dispatch time: 1400 ms, for event type 3
I (25216) chip[DL]: IP_EVENT_GOT_IP6
I (25226) chip[DL]: IPv6 addr available. Ready on WIFI_STA_DEF interface: fe80:0000:0000:0000:260a:c4ff:fe49:41b4
I (25236) app-devicecallbacks: Current free heap: 96384

I (25236) chip[DIS]: Updating services using commissioning mode 0
I (25246) chip[DIS]: CHIP minimal mDNS started advertising.
I (25256) chip[DIS]: Advertise operational node 2DD462755BFF7F02-061ABCD5F72C7C00
I (25266) chip[DIS]: CHIP minimal mDNS configured as 'Operational device'.
I (25286) chip[DIS]: mDNS service published: _matter._tcp
I (25286) app-devicecallbacks: Current free heap: 94084

I (25386) CHIP[DL]: Write request received for CHIPoBLE RX characteristic con 0 12
I (25556) chip[IN]: Sending unauthenticated msg 0x3ffc6af4 with MessageCounter:112355573 to 0x0000000000000000 at monotonic time: 00000000000060E6 msec
I (25896) chip[IN]: Sending unauthenticated msg 0x3ffc6af4 with MessageCounter:112355573 to 0x0000000000000000 at monotonic time: 000000000000623F msec
E (25916) chip[EM]: Crit-err 3000001 when sending CHIP MessageCounter:112355573 on exchange 5359r, send tries: 2
@bzbarsky-apple
Copy link
Contributor

We've been running into this in our testing as well, with esp32 devices. @shubhamdp @dhrishi

@dhrishi dhrishi self-assigned this Aug 9, 2022
@dhrishi
Copy link
Contributor

dhrishi commented Aug 9, 2022

We will take that up soon. cc @shubhamdp

@shubhamdp
Copy link
Contributor

I tried on ESP32-DevKit-V4 with SHA e556daa (branch:sve) and 608b143(branch:master), I was able to commission device without the above mentioned error.

Can you please share the specific steps to recreate this issue.

@bzbarsky-apple
Copy link
Contributor

@shubhamdp See this part from the issue description:

It happens with a very low chance (<1%)

@ryan-ma
Copy link
Contributor

ryan-ma commented Sep 1, 2022

Hi everyone. The error code 3000001 ERR_MEM issue is likely caused by running out of TX buffer in the ESP32 WiFi driver.

This issue happens more frequently if you set CONFIG_ESP32_WIFI_DYNAMIC_TX_BUFFER_NUM too low and run a fews multi-admin commissioning, which will send a lot of UDP messages in short time.

Try to set CONFIG_ESP32_WIFI_DYNAMIC_TX_BUFFER_NUM to 32 or more in menuconfig.

To fully solve the issue, a better error handling for error code 3000001 is needed.

@bzbarsky-apple
Copy link
Contributor

@ryan-ma Interesting. So this is a transient failure and the right thing to do is to try again (via MRP)?

We have something like that already inExchangeMessageDispatch::SendMessage for the ENOBUFS POSIX error. So maybe we should add this there too?

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 1, 2022
That can be caused by a transient lack of TX buffers.  We should just
try again via MRP.

project-chip#21671
bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 1, 2022
That can be caused by a transient lack of TX buffers.  We should just
try again via MRP.

Fixes project-chip#21671
@bzbarsky-apple
Copy link
Contributor

Could someone who is running into this try bzbarsky-apple@8f52341 and let me know whether that seems to help?

bzbarsky-apple added a commit to bzbarsky-apple/connectedhomeip that referenced this issue Sep 6, 2022
That can be caused by a transient lack of TX buffers.  We should just
try again via MRP.

Fixes project-chip#21671
bzbarsky-apple added a commit that referenced this issue Sep 6, 2022
* Treat an ERR_MEM on send as non-fatal on LwIP.

That can be caused by a transient lack of TX buffers.  We should just
try again via MRP.

Fixes #21671

* Address review comment.
@ryan-ma
Copy link
Contributor

ryan-ma commented Sep 8, 2022

Sorry for the late reply. I checked the fix. It looks good.
I try to run multi-admin with CONFIG_ESP32_WIFI_DYNAMIC_TX_BUFFER_NUM 6.
For most case, the send message can successfully send out after one retry.

I (195906) chip[ZCL]: OpCreds: successfully created fabric index 0x3 via AddNOC
I (195916) chip[IN]: Prepared secure message 0x3ffcb7ac to 0xFFFFFFFB00000000 (3)  of type 0x9 and protocolId (0, 1) on exchange 56655r with MessageCounter:258993561.
I (195926) chip[IN]: Sending encrypted msg 0x3ffcb7ac with MessageCounter:258993561 to 0xFFFFFFFB00000000 (3) at monotonic time: 000000000002FCA3 msec
E (195946) chip[EM]: Ignoring transient send error: 3000001 on exchange 56655r
E (195956) chip[DL]: Long dispatch time: 2616 ms, for event type 3
I (195956) chip[EM]: Received message of type 0x10 with protocolId (0, 0) and MessageCounter:35158570 on exchange 56654r
I (196346) chip[IN]: Sending encrypted msg 0x3ffcb7ac with MessageCounter:258993561 to 0xFFFFFFFB00000000 (3) at monotonic time: 000000000002FE46 msec
I (196426) chip[EM]: Received message of type 0x10 with protocolId (0, 0) and MessageCounter:35158571 on exchange 56655r
I (196666) chip[EM]: Received message of type 0x30 with protocolId (0, 0) and MessageCounter:113912559 on exchange 56656r

@bzbarsky-apple
Copy link
Contributor

@ryan-ma Thank you for confirming that! It's good to know this actually fixes the problem in practice.

isiu-apple pushed a commit to isiu-apple/connectedhomeip that referenced this issue Sep 16, 2022
* Treat an ERR_MEM on send as non-fatal on LwIP.

That can be caused by a transient lack of TX buffers.  We should just
try again via MRP.

Fixes project-chip#21671

* Address review comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants