Skip to content

coap_io.c: Fire KEEPALIVE_FAILURE when ping send fails with unanswered ping#1954

Open
MaikVermeulen wants to merge 1 commit intoobgm:developfrom
MaikVermeulen:fix/keepalive-ping-send-failure
Open

coap_io.c: Fire KEEPALIVE_FAILURE when ping send fails with unanswered ping#1954
MaikVermeulen wants to merge 1 commit intoobgm:developfrom
MaikVermeulen:fix/keepalive-ping-send-failure

Conversation

@MaikVermeulen
Copy link
Copy Markdown

Problem

When coap_session_send_ping_lkd() fails (e.g. sendto() returns ENETUNREACH because the network interface is down), the keepalive code in coap_io_process_sessions() skips the KEEPALIVE_FAILURE check via continue and resets last_rx_tx to now. This creates a zombie session: the session stays ESTABLISHED forever because:

  1. Each failed ping resets last_rx_tx, extending the session's life by another ping_timeout interval
  2. The KEEPALIVE_FAILURE check (last_ping > 0 && last_pong < last_ping) is never reached
  3. coap_session_failed() is a no-op when reconnect_time is not set (the default)
  4. No failure event is reported to the application

The session appears healthy from the application's perspective but cannot send or receive any data.

Root cause

This was discovered in a production Thread/IoT deployment where a Radio Co-Processor (RCP) crash made sendto() return ENETUNREACH for all CoAP/DTLS sessions. The sessions stayed ESTABLISHED indefinitely with no recovery signal, causing permanent device disconnections that persisted until manual service restart.

Fix

For both the client-side and server-side keepalive paths:

  1. Check for unanswered previous ping before continuing. If last_ping > 0 && last_pong < last_ping, fire COAP_EVENT_KEEPALIVE_FAILURE (client) or call coap_session_server_keepalive_failed() (server). This gives the application a chance to detect and recover from the dead session.

  2. Do not update last_rx_tx on the client path when the send failed. The send did not prove liveness — resetting the timer just delays detection. Leaving last_rx_tx unchanged causes the next keepalive cycle to fire immediately on the next coap_io_process() call, giving rapid failure detection instead of waiting another full ping_timeout interval.

Behavior change

Scenario Before After
Ping send fails, no prior unanswered ping Silent, last_rx_tx reset coap_session_failed() called (client w/ reconnect) or silent (no reconnect)
Ping send fails, prior ping unanswered Silent, last_rx_tx reset KEEPALIVE_FAILURE event fired
Ping send succeeds Unchanged Unchanged

The KEEPALIVE_FAILURE event only fires when there was already an unanswered ping — meaning the session was unhealthy before the current send failure. This avoids false positives from a single transient send error.

…d ping

When coap_session_send_ping_lkd() fails (e.g. sendto() returns
ENETUNREACH because the network interface is down), the keepalive code
skips the KEEPALIVE_FAILURE check via 'continue' and resets last_rx_tx
to 'now'.  This creates a zombie session: the session stays ESTABLISHED
forever because each failed ping extends its life by another
ping_timeout interval, and no failure event is ever reported to the
application.

Fix the client-side and server-side keepalive paths to:

1. Check whether a previous ping went unanswered before continuing.
   If so, fire COAP_EVENT_KEEPALIVE_FAILURE (client) or call
   coap_session_server_keepalive_failed() (server) so the application
   or server can act on the dead session.

2. When there is no unanswered prior ping, rate-limit retries by
   preserving the last_rx_tx = now reset so the next attempt waits
   a full ping_timeout interval.  This avoids spinning sendto()
   every I/O cycle while the network is down.

3. Add coap_log_debug() on ping send failure for diagnostics.

This was discovered in a production IoT deployment where a Thread RCP
crash made sendto() return ENETUNREACH for all CoAP/DTLS sessions.
The sessions stayed ESTABLISHED indefinitely with no recovery signal,
causing permanent device disconnections.
@MaikVermeulen MaikVermeulen force-pushed the fix/keepalive-ping-send-failure branch from 31515b7 to 23fc667 Compare April 10, 2026 06:18
@MaikVermeulen MaikVermeulen marked this pull request as ready for review April 10, 2026 06:21
@mrdeep1
Copy link
Copy Markdown
Collaborator

mrdeep1 commented Apr 10, 2026

I'm trying to reproduce this. I can reproduce something with the server continuing to send keep-alives, but not the client with the develop branch, but your changes make no difference.

Are you seeing the issue with the client or server?
Are you using DTLS or not?
Are you using the reconnect support?
Is this in a Posix hosting environment or something like LwIP?

@mrdeep1
Copy link
Copy Markdown
Collaborator

mrdeep1 commented Apr 13, 2026

I'm trying to reproduce this. I can reproduce something with the server continuing to send keep-alives,
but not the client with the develop branch, but your changes make no difference.

The underlying challenge here is that last_ping is only set for a successfully initiated ping, so cannot be used used for working with network failures. Another way of doing this needs to be worked on.

@MaikVermeulen
Copy link
Copy Markdown
Author

Thanks @mrdeep1 ,

We are using both libcoap client and servers. The issue seems to be client-side, or at least that's where we made a workaround that seemed to work. We're using DTLS, and it's for CoAP over Thread from Posix FTD to ESP32 FTD nodes. I don't think we have enabled reconnect support

@mrdeep1
Copy link
Copy Markdown
Collaborator

mrdeep1 commented Apr 16, 2026

Please try #1962 as an alternative as I am able to keep pings retrying, even if there is a network failure with this code for a server.

As an FYI, the -Z option has been added to the examples coap-client and coap-server to simulate network failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants