Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client and server behavior at the same time #845

Closed
fbucafusco opened this issue May 6, 2022 · 29 comments
Closed

Client and server behavior at the same time #845

fbucafusco opened this issue May 6, 2022 · 29 comments

Comments

@fbucafusco
Copy link

Environment

  • libcoap version: v4.3.0
  • Build System: Platformio
  • Operating System: Windows
  • Operating System Version: 10
  • Hosted Environment: ESP-IDF

Problem Description

I need my system to have a coap server and a client working at the same time.
I took code from both examples (from the ESP-IDF client and server ) without any sec layer.

Basically I created 2 tasks where each example is running, so the libcoap lib is working on two different libcoap contexts. So I have 2 coap_io_process calls (on different pbjects of course). The client demo is sending a message each second, while the server is waiting for remote requests. Each server and client uses a different port.

I am having an issue where everything goes ok, but after a random running time (in general after 5 or 10 minutes) the coap_io_process function starts returning EBUSY. I went deep into coap_io_process and it seems that EBUSY comes from lwip_select function and I think is related to some mutual exclusion mechanism, but I don't really understand why it is happening this.

Not so sure if that is a bug or it's me doing things wrong.

If you have some idea of what could be happening to me, or something that you want me to try, I will appreciate.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented May 7, 2022

For the client and server threads, there can be no overlapping static variable definitions otherwise the 2 threads could get confused here.

lwip_select() returns EBUSY after 256 calls if one of the file descriptors passed in is watched, but not released for some reason. 256 seconds is 4 1/4 minutes which ties up with the run time that you are seeing.

I assume that it is the client thread that is failing, not the server thread.

In the client, are you creating a new (coap_session_t) session for each request, or are you using the same session for each new request. You should be doing the latter (i.e. create the session once and use that session for each new coap_send()).

@fbucafusco
Copy link
Author

Hi mrdeep!
Thanks for you response !
I did 't use any static or global variable definitions. I tried to isolate both examples properly, in that sense, I think.
The failing task is the server one. Sorry I didn't clarify that.
For the client, I am creating one session at the beginning.

I will check what you've told me about the time and see if it matches.

Any other clue on what could be happening will be appreciated.

@fbucafusco
Copy link
Author

Hi again!
I did some extra testing focusing on what you said about the 256 calls to select. I added a counter to see at which call it fails.
It always fails at~460 calls (not always the same #). That is without making any request to the server. Just calling coap_io_process( servercontext , 1000 ) repeatedly in the main task's loop. .

@fbucafusco
Copy link
Author

fbucafusco commented May 9, 2022

Could this be related with this portion of code? (taken from coap_io_process_with_fds function).

#if COAP_CONSTRAINED_STACK
    static coap_mutex_t static_mutex = COAP_MUTEX_INITIALIZER;
# ifndef COAP_EPOLL_SUPPORT
    static fd_set readfds, writefds, exceptfds;
    static coap_socket_t *sockets[64];
    unsigned int num_sockets = 0;
# endif /* ! COAP_EPOLL_SUPPORT */
#else /* ! COAP_CONSTRAINED_STACK */

# ifndef COAP_EPOLL_SUPPORT
    fd_set readfds, writefds, exceptfds;
    coap_socket_t *sockets[64];
    unsigned int num_sockets = 0;
# endif /* ! COAP_EPOLL_SUPPORT */

The esp32 arduino framework uses COAP_CONSTRAINED_STACK = 1 as far I could see.

I've seen that you push new commits on the weekend related to this.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented May 9, 2022

The new commits pushed over the weekend were rebasing some PRs on the latest develop code which were resolving merge conficts from the latest code updates on hte develop branch.

Yes, the esp-idf code does have COAP_CONSTRAINED_STACK = 1, but I do not think the code you have highlighted could be the issue.

I will try tomorrow setting up a code test for what you are trying to do with the edp-idf code. It is possible though that there is an issue with an esp-idf network driver that this not releasing a socket usage counter.

FYI, the section of lwip_select() from esp-idf is

          SYS_ARCH_PROTECT(lev);
          sock = tryget_socket_unconn_locked(i);
          if (sock != NULL) {
            sock->select_waiting++;
            if (sock->select_waiting == 0) {
              /* overflow - too many threads waiting */
              sock->select_waiting--;
              nready = -1;
              maxfdp2 = i;
              SYS_ARCH_UNPROTECT(lev);
              done_socket(sock);
              set_errno(EBUSY);
              break;
            }

where select_waiting is 8 bits - hence the 256 count wrap.

@fbucafusco
Copy link
Author

fbucafusco commented May 9, 2022

Hi!
Yep, I know the part it is failing is that one you marked. I highlighted the other code, just because it has some static variables there, and then I saw you did some changes. But tbh I was guessing. :)
Thanks!
And yes, it is probably the issue is something related with some other low level module and has nothing to do with libcoap or lwip.
As far I could see the select_waiting is incremented there, and decremented a bit below that code, in the same function (conditionally). So I am not pretty sure why it is not happening.
Thanks again!

@mrdeep1
Copy link
Collaborator

mrdeep1 commented May 12, 2022

I am not able to reproduce your issue with a standalone esp-idf emulated server and many requests coming in to that server. Is it possible to get the source code for the 2 examples you are running?

@fbucafusco
Copy link
Author

Hi mrdeep1! Thanks for your time!

Actually I have both, server and client, converted into c++ classes. In the problem description I linked both examples I used as references. Actually I have both, server and client, converted into c++ classes. But, converting to c again, I have this:

server:

coap_context_t* server_ctx;
 
void coap_server_task(void* param)
{ 
    uint32_t cicle_counter =  0;

    ESP_LOGI(TAG, "Coap Server Running", "");

    uint32_t timeoutS = 1; 

    uint32_t wait_ms = timeoutS * 1000; 

    while (1)
    {
        int result = coap_io_process(server_ctx, wait_ms );
        if (result < 0)
        {
            ESP_LOGE(TAG, "coap_io_process() failed: %d with %u cicles: %s %u", errno, cicle_counter , strerror(errno), wait_ms );
            vTaskDelay( 2000 / portTICK_PERIOD_MS);
            // break;
        }
        else if (result && (unsigned)result < wait_ms)
        {
            /* decrement if there is a result wait time returned */
            wait_ms -= result;
        }

        if (result)
        {
            /* result must have been >= wait_ms, so reset wait_ms */
            wait_ms = timeoutS * 1000;
            cicle_counter++;
        }
    }
}

void dummy_hnd(coap_resource_t* resource, coap_session_t* session, const coap_pdu_t* request, const coap_string_t* query, coap_pdu_t* response)
{
    size_t              size;
    const uint8_t*      data;
    coap_opt_iterator_t opt_iter;
    coap_opt_t*         option;
    size_t              offset;
    size_t              total;
    coap_binary_t*      data_so_far;

    if (coap_get_data_large(request, &size, &data, &offset, &total))
    {
        ESP_LOGI(TAG, "dummy handler data:", "");        
    }
    else
    {
        ESP_LOGI(TAG, "dummy handler no data", "");
    }

    coap_pdu_set_code(response, COAP_RESPONSE_CODE_CHANGED);  /* the resource has changed */
}

void register_resource(const char* path, size_t path_len, coap_request_t type, void* hnd)
{
    coap_resource_t* resource = NULL;
    assert(server_ctx != NULL);

    resource = coap_resource_init( coap_make_str_const(path) , 0);
 
    if (!resource)
    {
        ESP_LOGE(TAG, "coap_resource_init() failed", "");
        return;
    }

    /* register a handler for the resource */
    coap_register_handler(resource, type, (coap_method_handler_t)hnd);
 
    coap_add_resource(server_ctx, resource);
}

void server_init()
{
    if (server_ctx == NULL)
    {
        /* create the context for the server */
        server_ctx = coap_new_context(NULL);

        if (server_ctx == NULL)
        {
            ESP_LOGE(TAG, "coap_new_context() failed", "");
            /* critical error */
            return;
        }

        ESP_LOGI(TAG, "%s coap_new_context() -> %p", __FUNCTION__, server_ctx);

        /* event handler just for loging */
        coap_register_event_handler(server_ctx, coap_event_handler);
    }
	
    coap_set_log_level(LOG_DEBUG);  // see coap_debug.h for the different levels
 
    coap_context_set_block_mode(server_ctx, COAP_BLOCK_USE_LIBCOAP | COAP_BLOCK_SINGLE_BODY);

    /* Prepare the CoAP server socket */
    coap_address_t serv_addr;

    coap_address_init(&serv_addr);

    serv_addr.addr.sin.sin_family      = AF_INET;
    serv_addr.addr.sin.sin_addr.s_addr = INADDR_ANY;
    serv_addr.addr.sin.sin_port        = htons(port);

    coap_endpoint_t* ep = coap_new_endpoint(server_ctx, &serv_addr, COAP_PROTO_UDP);

    if (!ep)
    {
        ESP_LOGE(TAG, "udp: coap_new_endpoint() failed", "");
    }
  
    register_resource("dummy", strlen("dummy"), COAP_REQUEST_GET, (void*)dummy_hnd); 

    /* start the server task */    
   xTaskCreatePinnedToCore(coap_server_task, "coap_server_task task", 2*4096 , NULL , 5 , NULL , 1);
}

client:

 
 
 
static coap_session_t* m_session;
static coap_context_t*  client_ctx;
static bool m_resp_wait; 
 
 
 static coap_optlist_t *coap_build_optlist(coap_str_const_t *path, int payload_type)
{
    coap_optlist_t *optlist = NULL;

#define BUFSIZE 40

    unsigned char  _buf[BUFSIZE];
    unsigned char *buf;

    size_t buflen;
    int    res;

    if (path)
    {
        buflen = BUFSIZE;
        buf    = _buf;
        res    = coap_split_path(path->s, path->length, buf, &buflen);

        while (res--)
        {
            coap_insert_optlist(&optlist, coap_new_optlist(COAP_OPTION_URI_PATH, coap_opt_length(buf), coap_opt_value(buf)));

            buf += coap_opt_size(buf);
        }
    }

    /* inserts at the opt list the information of the payload content */
    u_char enc_data[4];                                                                                           // bufffer to store the encoded data
    size_t enc_size = coap_encode_var_safe(enc_data, sizeof(enc_data), COAP_MEDIATYPE_APPLICATION_OCTET_STREAM);  // encodes COAP_MEDIATYPE_APPLICATION_OCTET_STREAM into enc_data

    coap_insert_optlist(&optlist, coap_new_optlist(COAP_OPTION_CONTENT_FORMAT, enc_size, enc_data));

    return optlist;
}

 
int send(void *request_)
{
    if (NULL == request_) { return -1; }

    int result = 0;

    coap_pdu_t *request = (coap_pdu_t *)request_;

    /* sync flag on */
    m_resp_wait = true; 

    ESP_LOGI(TAG, "Sent MID: %d", coap_pdu_get_mid(request));

    /* send: sincronized with the response handler (when leaving the function the request does not have any more validity, the object is deleted. )*/
    coap_send(m_session , request);
 
    int wait_ms = 2000;

    while ( m_resp_wait )
    { 
        const uint32_t time_step = 1000;

        int time = coap_io_process(client_ctx , wait_ms > time_step ? time_step : wait_ms );         
        
        if (time >= 0)
        {
            if (time >= wait_ms)
            {
                ESP_LOGE(TAG, "No response from server %u>=%u", time, wait_ms);
                result = -2;
                break;
            }
            else
            {
                wait_ms -= time;
            }
        }
    }
	
    return result;
}
	
 
/* returns a opaque pointer to the request just created */
void * post_create(uint8_t *data, uint16_t length, char *path)
{
    coap_pdu_t     *request = NULL;
    unsigned char   token[8];
    size_t          tokenlength;
    coap_optlist_t *option_list = NULL;

    /* confirmable message + post */
    request = coap_new_pdu(COAP_MESSAGE_CON, COAP_REQUEST_CODE_POST, m_session);

    if (!request)
    {
        ESP_LOGE(TAG, "coap_new_pdu() failed", "");
        return NULL;
    }

    /* Add in an unique token */
    coap_session_new_token(m_session, &tokenlength, token);
 
    coap_add_token(request, tokenlength, token);

    coap_str_const_t path_ = {strlen(path), (unsigned char *)path};

    option_list = coap_build_optlist( &path_ , COAP_MEDIATYPE_APPLICATION_OCTET_STREAM);

    if (option_list == NULL)
    {
        ESP_LOGE(TAG, "coap_build_optlist() failed", "");
        return NULL;
    }

    /* inserts the optlist into the PDU DO THIS BEFORE ADDING DATA!!! */
    coap_add_optlist_pdu(request, &option_list);

    /* inserts the data into the PDU */
    // coap_add_data_large_request(m_session, request, length, data, NULL, NULL); add coap_register_option( ctx, COAP_OPTION_BLOCK2 ); when creating the context
    coap_add_data(request, length, data);
   
	coap_delete_optlist(option_list);
 
    return request;
}


 void client_task(void* param)
{ 
    /* Set up the CoAP logging */
    coap_set_log_handler(coap_log_handler);
    coap_set_log_level(LOG_DEBUG);  // see coap_debug.h for the different levels

    /* Set up the context */
    client_ctx = coap_new_context(NULL);

    if (!client_ctx)
    {
        ESP_LOGE(TAG, "coap_new_context() failed", "");
        return;
    }
    
    ESP_LOGI(TAG, "%s coap_new_context() -> %p", __FUNCTION__ ,client_ctx);

    coap_context_set_block_mode(client_ctx, COAP_BLOCK_USE_LIBCOAP | COAP_BLOCK_SINGLE_BODY);  // this is use for defining the dynamics of the library

    /* REGISTER REPONSE HANDLER */
    coap_register_response_handler(client_ctx, message_handler);

    /* event handler just for loging */
    coap_register_event_handler(client_ctx, coap_event_handler);
    	
	coap_address_t  server;
    coap_address_t *dst_addr = &server;

    coap_address_init(&server);
   
    inet_pton(AF_INET, "192.168.10.10" , &(server.addr.sin.sin_addr));
    server.addr.sa.sa_family = AF_INET;
    server.addr.sin.sin_port = htons( 8018  );
 
    m_session = coap_new_client_session(client_ctx, NULL, dst_addr, COAP_PROTO_UDP);
    

    if (!m_session)
    {
        ESP_LOGE(TAG, "coap_new_client_session() failed", "");
        return;
    }
    
    coap_session_set_app_data(m_session, NULL );
 
    TickType_t xLastWakeTime = xTaskGetTickCount();

    while (1)
    {
        void* r =  post_create( "SOMETHING", sizeof("SOMETHING")-1 , "something"); 

        int32_t exitCode =  send(r); /* The r object is destroyed (freed from memory) inside this call */

        vTaskDelayUntil(&xLastWakeTime, pdMS_TO_TICKS(1000));  // vTaskDelay -> runs again 1 second
    };
}
   
void client_init()
{
    xTaskCreatePinnedToCore(client_task, "client task", 0x1000, NULL , 5, &m_taskHandle, 1); 	
}	 

@mrdeep1
Copy link
Collaborator

mrdeep1 commented May 16, 2022

Using your code - which needed a couple of changes to make it compile - has been running for 1.5 hrs with a client request every second now without failure on a QEMU emulated system (using OpenCores driver).

So it does look like it is an issue with whatever network driver you are using.

@fbucafusco
Copy link
Author

OK. I will look into it.
Thank you so much.
I ll post the solution, if I find it.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented May 16, 2022

I have managed to reproduce something - digging into it at present.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented May 17, 2022

The issue is down to coap_mutex_lock() and coap_mutex_unlock() not being properly defined as a mutex (or semaphore) for ESP-IDF, and hence for any COAP_CONSTRAINED_STACK protected code was not actually getting protected.

Can you please try updating managed_components/espressif__coap/port/include/coap_config_posix.h with the last 2 lines of the following:-

define HAVE_SYS_SOCKET_H
#define HAVE_MALLOC
#define HAVE_ARPA_INET_H
#define HAVE_TIME_H
#define HAVE_NETDB_H
#define HAVE_NETINET_IN_H
#define HAVE_STRUCT_CMSGHDR
#define COAP_DISABLE_TCP 0
#define HAVE_PTHREAD_H
#define HAVE_PTHREAD_MUTEX_LOCK

and let me know how it goes.

@fbucafusco
Copy link
Author

mmm. I am fu#$%#$% with that. Actually because I have the Arduino framework with the precompiled espidf (yeah, I know it sucks, I haven't chosen that). Certainly it is compiled without those lines and it is using that crappy stub.
I will figure out what to do next.
Thanks for helping me ! I'll let you know my progress.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented May 17, 2022

esp-idf extra components PR 61 has been raised to fix this. My test code has now looped 17,500 times with no error.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Jun 14, 2022

espressif/idf-extra-components#61 has now been merged into the master branch, and hence generally available.

@fbucafusco Please close this issue if you are happy with the solution.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Jun 23, 2022

@fbucafusco Please free to re-open this issue if there still is a problem.

@mrdeep1 mrdeep1 closed this as completed Jun 23, 2022
@fbucafusco
Copy link
Author

fbucafusco commented Nov 22, 2022

Hi again!
I was able to resume testing my setup recently. It has been a while.
I updated the platformio arduino esp32 framework to the latest which seems to have the patch you created for this issue.
My setup is the same as before. I am having a coap server and a coap client running on different tasks.
Currently I am facing another issue.
While the server is doing nothing but waiting some request from a remote client, my client should be sending periodic confirmable messages to a remote server. Same scenario as before, basically.
The issue is that while the server is waiting (calling coap_io_process each second) the client is calling coap_io_process at least one time for each message I want to send (As a comment, I configured the retransmission count to 1, and my remote server is shutted down none of my confirmable messages are being received, so no ACK are happening. )
The problem is that sometimes coap_io_process returns pretty fast (in relation with my waittime) and sometimes it takes "forever" (more that the specified in the waittime field). 1 or 2 minutes. I was able to see in the call stack that the place where it is being stucked is in the semaphoretake (which call is derived from coap_mutex_lock call). So, I am not so sure why the semaphore is kept locked that much time.

So, I reviewed the mutex implementation, and I realized that the mutex is a unique object (is static within coap_io_process_with_fds) so my best guess is that I have a race condition happing here. I wonder why is the mutex a global unique object instead of being unique within each coap_context_t instanced in the system.

BTW. If I shut down the server, the system's timing is correct.

Thanks in advance for the help.

@mrdeep1 mrdeep1 reopened this Nov 22, 2022
@mrdeep1
Copy link
Collaborator

mrdeep1 commented Nov 22, 2022

The mutex in coap_io_process_with_fds() is protecting

  static fd_set readfds, writefds, exceptfds;
  static coap_socket_t *sockets[64];

which could be on the stack, but may cause stack overrun issues.

I appreciate it is not easy for you to change the build of libcoap, so not sure how to move forward here. A possibility is to reduce the wait time on the coap_io_process() call in both the client and server to see if this helps.

@fbucafusco
Copy link
Author

I will increase the stack just in case. I haven't tried that yesterday.
Is not that easy for me to change the build because I am using the prebuild version that comes in the arduino framework (and also fight against platformio which I hate in regard of dependencies.)
I will let you know if I find something,.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Nov 23, 2022

I don't think that increasing the stack size will fix this issue. Reducing the delay time to 100ms on the coap_io_process() call may help.

Otherwise. changing COAP_CONSTRAINED_STACK in coap/port/include/coap_config_posix.h to 0 and increasing the stack size would work if you could change the build.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Nov 23, 2022

@fbucafusco See PR #981 for a proper fix here.

@fbucafusco
Copy link
Author

Hi!
I almost was going to suggest that fix again, cos it didn't make sense that threads' timings depend on the dynamics on the system and other threads' behaviors. But you did it in advance. So, that is great! Thanks!
Reducing the time, helped. So, for now, that will be my workaround until it is merged and released, and espidf updates the version. :)

@fbucafusco
Copy link
Author

Hi. I did some more tests and reducing the time, as I stated, helped but not solved the issue fully. Sometimes weird things happen.

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Jul 27, 2023

Reducing the time, helped. So, for now, that will be my workaround until it is merged and released, and espidf updates the version.

This PR #981 has now been merged into idf-extra-components - see espressif/idf-extra-components#215

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Aug 13, 2023

@fbucafusco Can this be closed now?

@mkschreder
Copy link

Why is it even necessary to have two coap contexts? Why not reuse the same socket for both client and server?

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Aug 15, 2023

@mkschreder There is no need to have 2 coap contexts if a client and server are running at the same time (which is what is happening when the standard libcoap coap-server is running as a proxy server forwarding on requests and forwarding back responses. BUT, as libcoap is not currently multi-thread-safe, both the client and server logic need to be running on the same thread if they are to share a common coap context.

Then there are coap sessions (acting as a coap endpoint) which actually open a socket and either connects to a remote server, or listens for information coming from a client. It is possible to have multiple coap sessions per coap context.

In theory, a single coap session can act as both a server endpoint and a client endpoint (i.e. use the same socket), but this does not give you port to use flexibility (as both the server listening port and the source port of a client connecting to a remote server have to be the same).

@fbucafusco
Copy link
Author

fbucafusco commented Aug 15, 2023

Hi!
We changed the project build system so we can build and link the latest version of libcoap. This way we can skip prebuilded versions of the lib.
Currently we are running both client and server concurrently.
We don t have so many issues because I kept the hacked implementation that I had before with olded libcoap versions.
I should try removing the hacks to check if the issue is gone with your change regarding the mutexes. I will do it eventually, but I cant promess to do it asap.
I would say that you can close the issue.
Now, if something weird happens to us again I can change the code myself and propose changes. :)

@mrdeep1
Copy link
Collaborator

mrdeep1 commented Aug 16, 2023

Thanks for the feedback and update. I will close this off now.

@mrdeep1 mrdeep1 closed this as completed Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants