Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry timeout do not work as expected #197

Closed
mbaril opened this issue Jun 5, 2018 · 7 comments
Closed

Retry timeout do not work as expected #197

mbaril opened this issue Jun 5, 2018 · 7 comments

Comments

@mbaril
Copy link

mbaril commented Jun 5, 2018

I think the retry timeout do not work correctly when multiple CONF messages are sent in parallel.

Test procedure:

Step 1: I run the CoAP server example (coap-server)
Step 2: I create a CoAP observer on the resource /time.
Step 3: I kill the CoAP observer process

Test #1: OK
I use COAP_RESOURCE_CHECK_TIME of 120 and a COAP_OBS_MAX_FAIL is equal to 0.

Messages sent by server after step 3:
image

After 4 retries the CoAP server remove the observer entry as expected.
In this test the delay between each retries are correct.

Test #2: Delay between retries are not good

I use COAP_RESOURCE_CHECK_TIME of 5 sec and a COAP_OBS_MAX_FAIL is equal to 0.

Messages sent by server after step 3:
image

This time, the delay between retries do not seems to be correct.

MsgID= 29557.
Retry 1 at 2.2 sec (OK)
Retry 2 at 7.3 sec (OK)
Retry 3 at 12.1 sec (Not ok. Should be approx. 15.4 sec)
Retry #4 at 15.4 sec (Not ok.. Should approx. 33 sec)

MsgID= 29558.
Retry 1 at 3.0 sec (OK)
Retry 2 at 7.8 sec (Not ok. Should be approx. 9 sec)
Retry 3 at 9.8 sec (Not ok. Should be approx. 21 sec)
Retry #4 at 12.8 sec sec (Not ok.. Should approx. 45 sec)

@mbaril
Copy link
Author

mbaril commented Jun 5, 2018

Coap server logs
serverLogs.zip

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Jun 6, 2018

With test 1, there is only one outstanding unsolicited Observe response waiting for an ACK from the client, and so the retry counters are as expected.

With test 2, the server continues to send additional unsolicited Observe responses at the repeat frequency (in your case every 5 secs) until the logic times things completely out and drops the observer. So in your test 2 you have 4 outstanding Observe responses waiting for an ACK before observer is dropped.

The issue here is that coap_insert_node() is biasing the timeouts (making it a delta time between 2 entries) of a queue of outstanding requests, making the re-transmit times shorter as this delta bias not re-adjusted when the head of the queue is taken off and processed (because of the nextpdu->t = 0, and so coap_pop_next(ctx) does not update the delta).

Fixing coap_insert_node() by not making it time deltas fixes your issue, but causes the sendqueue tests to blow up.

diff --git a/src/net.c b/src/net.c
index c7cbfa6..526e833 100644
--- a/src/net.c
+++ b/src/net.c
@@ -201,21 +201,16 @@ coap_insert_node(coap_queue_t **queue, coap_queue_t *node) {
   if (node->t < q->t) {
     node->next = q;
     *queue = node;
-    q->t -= node->t;           /* make q->t relative to node->t */
     return 1;
   }
 
   /* search for right place to insert */
   do {
-    node->t -= q->t;           /* make node-> relative to q->t */
     p = q;
     q = q->next;
   } while (q && q->t <= node->t);
 
   /* insert new item */
-  if (q) {
-    q->t -= node->t;           /* make q->t relative to node->t */
-  }
   node->next = q;
   p->next = node;
   return 1;

Fixing coap_write() to keep the bias and cause it to get added to the first entry to the queue also works, but also the LWIP coap_retransmittimer_execute() function needs to get fixed - I suspect in the same way, but I do not have a LWIP version to test against. The CONTIKI version appears to have no issues here from code inspection.

diff --git a/src/coap_io.c b/src/coap_io.c
index 174d239..265bd0b 100644
--- a/src/coap_io.c
+++ b/src/coap_io.c
@@ -1093,7 +1093,6 @@ coap_write(coap_context_t *ctx,
   nextpdu = coap_peek_next(ctx);
 
   while (nextpdu && now >= ctx->sendqueue_basetime && nextpdu->t <= now - ctx->sendqueue_basetime) {
-    nextpdu->t = 0;
     coap_retransmit(ctx, coap_pop_next(ctx));
     nextpdu = coap_peek_next(ctx);
   }
diff --git a/src/net.c b/src/net.c
index c7cbfa6..c035e34 100644
--- a/src/net.c
+++ b/src/net.c
@@ -2304,7 +2304,6 @@ static void coap_retransmittimer_execute(void *arg) {
       break;
     } else {
       elapsed -= nextinqueue->t;
-      nextinqueue->t = 0;
       coap_retransmit(ctx, coap_pop_next(ctx));
       nextinqueue = coap_peek_next(ctx);
     }

@obgm What was the reasoning for maintaining retry times as deltas between each entry in the coap_queue_t list?
Not having them as deltas I think simplifies the code.
I understand why the times should be relative to sendqueue_basetime.

@obgm
Copy link
Owner

obgm commented Jun 6, 2018

The reasoning was that numbers will be smaller and, more important, when you insert a node into the list, you only need to update the t value of the subsequent node (and set the value of the inserted node). I still think that this is a good idea (but have not looked into the issue discussed here).

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Jun 6, 2018

The issue is when the node is pop'd off the list - the t value of the next entry does not get adjusted.

It is quite easy to reproduce using standard coap-client and coap-server

coap-server -v9
coap-client  -s 30 coap://127.0.0.1/time

Then abort coap-client after it has received about 2 time updates. Capture output from coap-server and grep for a tid value to see the re-transmit times.

mrdeep1 added a commit to mrdeep1/libcoap that referenced this issue Jul 10, 2018
The node queue uses time deltas between each entry in the queue for
retransmissions.  The head entry does not have a time delta.

When the head of the head of the queue is taken off for retransmission
the next in the queue needs to have the time delta updated to be an
absolute value.  However, this does not happen if the current head has had
the time offset fore set to 0 and so the retransmit timer fires more frequently
than expected.

There are 2 ways to fix this - not maintain time deltas between queue
entries, or not set the head entries time to 0.

Doing the latter means that code that relies on the deltas (I can only
find the test suite) continue to work before calling coap_retransmit().

See Issue obgm#197 for further information.
@mrdeep1
Copy link
Collaborator

mrdeep1 commented Aug 1, 2018

@mbaril Can you please confirm the issue has now been fixed in the latest develop code.

@obgm
Copy link
Owner

obgm commented Nov 6, 2018

Can anyone confirm this to be fixed?

@obgm
Copy link
Owner

obgm commented Sep 20, 2019

Closing as the issue seems to be solved.

@obgm obgm closed this as completed Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants