coap_io.c: Fire KEEPALIVE_FAILURE when ping send fails with unanswered ping by MaikVermeulen · Pull Request #1954 · obgm/libcoap

MaikVermeulen · 2026-04-10T05:52:08Z

Problem

When coap_session_send_ping_lkd() fails (e.g. sendto() returns ENETUNREACH because the network interface is down), the keepalive code in coap_io_process_sessions() skips the KEEPALIVE_FAILURE check via continue and resets last_rx_tx to now. This creates a zombie session: the session stays ESTABLISHED forever because:

Each failed ping resets last_rx_tx, extending the session's life by another ping_timeout interval
The KEEPALIVE_FAILURE check (last_ping > 0 && last_pong < last_ping) is never reached
coap_session_failed() is a no-op when reconnect_time is not set (the default)
No failure event is reported to the application

The session appears healthy from the application's perspective but cannot send or receive any data.

Root cause

This was discovered in a production Thread/IoT deployment where a Radio Co-Processor (RCP) crash made sendto() return ENETUNREACH for all CoAP/DTLS sessions. The sessions stayed ESTABLISHED indefinitely with no recovery signal, causing permanent device disconnections that persisted until manual service restart.

Fix

For both the client-side and server-side keepalive paths:

Check for unanswered previous ping before continuing. If last_ping > 0 && last_pong < last_ping, fire COAP_EVENT_KEEPALIVE_FAILURE (client) or call coap_session_server_keepalive_failed() (server). This gives the application a chance to detect and recover from the dead session.
Do not update last_rx_tx on the client path when the send failed. The send did not prove liveness — resetting the timer just delays detection. Leaving last_rx_tx unchanged causes the next keepalive cycle to fire immediately on the next coap_io_process() call, giving rapid failure detection instead of waiting another full ping_timeout interval.

Behavior change

Scenario	Before	After
Ping send fails, no prior unanswered ping	Silent, `last_rx_tx` reset	`coap_session_failed()` called (client w/ reconnect) or silent (no reconnect)
Ping send fails, prior ping unanswered	Silent, `last_rx_tx` reset	`KEEPALIVE_FAILURE` event fired
Ping send succeeds	Unchanged	Unchanged

The KEEPALIVE_FAILURE event only fires when there was already an unanswered ping — meaning the session was unhealthy before the current send failure. This avoids false positives from a single transient send error.

…d ping When coap_session_send_ping_lkd() fails (e.g. sendto() returns ENETUNREACH because the network interface is down), the keepalive code skips the KEEPALIVE_FAILURE check via 'continue' and resets last_rx_tx to 'now'. This creates a zombie session: the session stays ESTABLISHED forever because each failed ping extends its life by another ping_timeout interval, and no failure event is ever reported to the application. Fix the client-side and server-side keepalive paths to: 1. Check whether a previous ping went unanswered before continuing. If so, fire COAP_EVENT_KEEPALIVE_FAILURE (client) or call coap_session_server_keepalive_failed() (server) so the application or server can act on the dead session. 2. When there is no unanswered prior ping, rate-limit retries by preserving the last_rx_tx = now reset so the next attempt waits a full ping_timeout interval. This avoids spinning sendto() every I/O cycle while the network is down. 3. Add coap_log_debug() on ping send failure for diagnostics. This was discovered in a production IoT deployment where a Thread RCP crash made sendto() return ENETUNREACH for all CoAP/DTLS sessions. The sessions stayed ESTABLISHED indefinitely with no recovery signal, causing permanent device disconnections.

mrdeep1 · 2026-04-10T17:46:02Z

I'm trying to reproduce this. I can reproduce something with the server continuing to send keep-alives, but not the client with the develop branch, but your changes make no difference.

Are you seeing the issue with the client or server?
Are you using DTLS or not?
Are you using the reconnect support?
Is this in a Posix hosting environment or something like LwIP?

mrdeep1 · 2026-04-13T13:56:07Z

I'm trying to reproduce this. I can reproduce something with the server continuing to send keep-alives,
but not the client with the develop branch, but your changes make no difference.

The underlying challenge here is that last_ping is only set for a successfully initiated ping, so cannot be used used for working with network failures. Another way of doing this needs to be worked on.

MaikVermeulen · 2026-04-16T11:15:09Z

Thanks @mrdeep1 ,

We are using both libcoap client and servers. The issue seems to be client-side, or at least that's where we made a workaround that seemed to work. We're using DTLS, and it's for CoAP over Thread from Posix FTD to ESP32 FTD nodes. I don't think we have enabled reconnect support

mrdeep1 · 2026-04-16T14:09:51Z

Please try #1962 as an alternative as I am able to keep pings retrying, even if there is a network failure with this code for a server.

As an FYI, the -Z option has been added to the examples coap-client and coap-server to simulate network failure.

MaikVermeulen force-pushed the fix/keepalive-ping-send-failure branch from 31515b7 to 23fc667 Compare April 10, 2026 06:18

MaikVermeulen marked this pull request as ready for review April 10, 2026 06:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coap_io.c: Fire KEEPALIVE_FAILURE when ping send fails with unanswered ping#1954

coap_io.c: Fire KEEPALIVE_FAILURE when ping send fails with unanswered ping#1954
MaikVermeulen wants to merge 1 commit intoobgm:developfrom
MaikVermeulen:fix/keepalive-ping-send-failure

MaikVermeulen commented Apr 10, 2026

Uh oh!

mrdeep1 commented Apr 10, 2026

Uh oh!

mrdeep1 commented Apr 13, 2026

Uh oh!

MaikVermeulen commented Apr 16, 2026

Uh oh!

mrdeep1 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MaikVermeulen commented Apr 10, 2026

Problem

Root cause

Fix

Behavior change

Uh oh!

mrdeep1 commented Apr 10, 2026

Uh oh!

mrdeep1 commented Apr 13, 2026

Uh oh!

MaikVermeulen commented Apr 16, 2026

Uh oh!

mrdeep1 commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants