coap_io.c: Fire KEEPALIVE_FAILURE when ping send fails with unanswered ping#1954
coap_io.c: Fire KEEPALIVE_FAILURE when ping send fails with unanswered ping#1954MaikVermeulen wants to merge 1 commit intoobgm:developfrom
Conversation
…d ping When coap_session_send_ping_lkd() fails (e.g. sendto() returns ENETUNREACH because the network interface is down), the keepalive code skips the KEEPALIVE_FAILURE check via 'continue' and resets last_rx_tx to 'now'. This creates a zombie session: the session stays ESTABLISHED forever because each failed ping extends its life by another ping_timeout interval, and no failure event is ever reported to the application. Fix the client-side and server-side keepalive paths to: 1. Check whether a previous ping went unanswered before continuing. If so, fire COAP_EVENT_KEEPALIVE_FAILURE (client) or call coap_session_server_keepalive_failed() (server) so the application or server can act on the dead session. 2. When there is no unanswered prior ping, rate-limit retries by preserving the last_rx_tx = now reset so the next attempt waits a full ping_timeout interval. This avoids spinning sendto() every I/O cycle while the network is down. 3. Add coap_log_debug() on ping send failure for diagnostics. This was discovered in a production IoT deployment where a Thread RCP crash made sendto() return ENETUNREACH for all CoAP/DTLS sessions. The sessions stayed ESTABLISHED indefinitely with no recovery signal, causing permanent device disconnections.
31515b7 to
23fc667
Compare
|
I'm trying to reproduce this. I can reproduce something with the server continuing to send keep-alives, but not the client with the develop branch, but your changes make no difference. Are you seeing the issue with the client or server? |
The underlying challenge here is that |
|
Thanks @mrdeep1 , We are using both libcoap client and servers. The issue seems to be client-side, or at least that's where we made a workaround that seemed to work. We're using DTLS, and it's for CoAP over Thread from Posix FTD to ESP32 FTD nodes. I don't think we have enabled reconnect support |
|
Please try #1962 as an alternative as I am able to keep pings retrying, even if there is a network failure with this code for a server. As an FYI, the -Z option has been added to the examples coap-client and coap-server to simulate network failure. |
Problem
When
coap_session_send_ping_lkd()fails (e.g.sendto()returnsENETUNREACHbecause the network interface is down), the keepalive code incoap_io_process_sessions()skips theKEEPALIVE_FAILUREcheck viacontinueand resetslast_rx_txtonow. This creates a zombie session: the session staysESTABLISHEDforever because:last_rx_tx, extending the session's life by anotherping_timeoutintervalKEEPALIVE_FAILUREcheck (last_ping > 0 && last_pong < last_ping) is never reachedcoap_session_failed()is a no-op whenreconnect_timeis not set (the default)The session appears healthy from the application's perspective but cannot send or receive any data.
Root cause
This was discovered in a production Thread/IoT deployment where a Radio Co-Processor (RCP) crash made
sendto()returnENETUNREACHfor all CoAP/DTLS sessions. The sessions stayedESTABLISHEDindefinitely with no recovery signal, causing permanent device disconnections that persisted until manual service restart.Fix
For both the client-side and server-side keepalive paths:
Check for unanswered previous ping before continuing. If
last_ping > 0 && last_pong < last_ping, fireCOAP_EVENT_KEEPALIVE_FAILURE(client) or callcoap_session_server_keepalive_failed()(server). This gives the application a chance to detect and recover from the dead session.Do not update
last_rx_txon the client path when the send failed. The send did not prove liveness — resetting the timer just delays detection. Leavinglast_rx_txunchanged causes the next keepalive cycle to fire immediately on the nextcoap_io_process()call, giving rapid failure detection instead of waiting another fullping_timeoutinterval.Behavior change
last_rx_txresetcoap_session_failed()called (client w/ reconnect) or silent (no reconnect)last_rx_txresetKEEPALIVE_FAILUREevent firedThe
KEEPALIVE_FAILUREevent only fires when there was already an unanswered ping — meaning the session was unhealthy before the current send failure. This avoids false positives from a single transient send error.