Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xQueueGenericReceive deadlock #877

Closed
kiik opened this issue Feb 17, 2016 · 6 comments

Comments

@kiik
Copy link

commented Feb 17, 2016

I am experiencing cases on the photon where a deadlock occurs, namely JTAG shows that the device never exits 'xQueueGenericReceive' method. During this both the user application and particle firmware execution are blocked.

gdb frame stack:

#0  xQueueGenericReceive (xQueue=0x856f78f8, pvBuffer=pvBuffer@entry=0x0, xTicksToWait=xTicksToWait@entry=1, xJustPeeking=xJustPeeking@entry=0) at WICED/RTOS/FreeRTOS/ver7.5.2/Source/queue.c:1165
#1  0x0803138e in sys_sem_new (sem=sem@entry=0x20005bdc <memp_memory_NETCONN_base+828>, count=count@entry=0 '\000') at WICED/network/LwIP/WWD/FreeRTOS/sys_arch.c:239
#2  0x080267e2 in netconn_alloc (t=<optimized out>, callback=callback@entry=0x0) at WICED/network/LwIP/ver1.4.0.rc1/src/api/api_msg.c:598
#3  0x08025eac in netconn_new_with_proto_and_callback (t=t@entry=NETCONN_TCP, proto=proto@entry=0 '\000', callback=callback@entry=0x0) at WICED/network/LwIP/ver1.4.0.rc1/src/api/api_lib.c:73
#4  0x0802d822 in wiced_tcp_create_socket (socket=0x0, interface=<optimized out>) at WICED/network/LwIP/WICED/tcpip.c:295
#5  0x080401ac in socket_create (family=<optimized out>, type=<optimized out>, protocol=<optimized out>, port=<optimized out>, nif=0) at src/photon/socket_hal.cpp:960
#6  0x0804597c in Spark_Connect () at src/system_cloud_internal.cpp:856
#7  0x08044f14 in establish_cloud_connection () at src/system_task.cpp:215
#8  0x08045174 in manage_cloud_connection (force_events=<optimized out>) at src/system_task.cpp:302
#9  0x080451ae in Spark_Idle_Events (force_events=<optimized out>) at src/system_task.cpp:335
#10 0x08044000 in ActiveObjectBase::run (this=0x200096b8 <SystemThread>) at src/active_object.cpp:49
#11 0x08044020 in ActiveObjectBase::run_active_object (object=<optimized out>) at src/active_object.cpp:92
#12 0x08043968 in std::invoke_thread (ptr=0x2000c498) at src/system_threading.cpp:85
#13 0x00000000 in ?? ()
(gdb) fin 
Run till exit from #0  xQueueGenericReceive (xQueue=0x856f78f8, pvBuffer=pvBuffer@entry=0x0, xTicksToWait=xTicksToWait@entry=1, xJustPeeking=xJustPeeking@entry=0) at WICED/RTOS/FreeRTOS/ver7.5.2/Source/queue.c:1165
@kiik

This comment has been minimized.

Copy link
Author

commented Feb 17, 2016

I am not sure exactly how I should go about locating the exact cause to this, what should I suspect the most?

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Feb 17, 2016

Looks like a deadlock in the WICED SDK, but it's not possible to say why. Knowing what the other threads are doing would be useful information. Could you please make a short app to reproduce the problem?

@kiik

This comment has been minimized.

Copy link
Author

commented Feb 17, 2016

I upgraded my arm-none-eabi from 4.9 to 5.2 and switched from particle firmware 0.4.8-rc.6 to the latest development branch state ( The 0.4.9 tagged states were not working for me at all ) and currently the device has been running flawlessly for an hour. Usually a hard fault often occurred, originating from an FreeRTOS task context switch method, perhaps this sometimes lead to the deadlock instead of a hard fault. I'll let it run for the next 24hours and see if perhaps this issue is no longer occurring.

@kiik

This comment has been minimized.

Copy link
Author

commented Feb 20, 2016

This is no longer occurring in the latest development head.

@kiik kiik closed this Feb 20, 2016

@m-mcgowan m-mcgowan reopened this Apr 25, 2016

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2016

We are seeing this issue with devices suddenly halting - LED activity stops and interrupts are not processed.

It stems from the sys_sem_new() call in WICED and the interaction with a mutex when allocating from the heap, also provided by WICED.

  • some thread A, e.g. the application thread, performs a heap allocation
    • as part of the allocation, it acquires the heap mutex
  • before thread A releases the mutex, the sys_sem_new() call starts on the system thread S. Before attempting to allocate the new semaphore, thread S starts a critical section, stopping all interrupts and task switching. thread A still owns the heap mutex, but makes no progress.
  • thread S then attempts to allocate heap for the new semaphore. It blocks waiting for the heap semaphore to be available, but this is never available since thread A cannot make any progress until the critical section is exited.
  • thread S only exits the critical section after it's acquired the semaphore.

This is the typical cyclic graph of resource acquisition between threads that causes a deadlock.

A trivial fix is to call malloc_lock()/malloc_unlock() around the critical section. This will ensure that the thread has the heap mutex before entering a critical section, so is guaranteed not to block in the critical section.

I will look into the latest version of WICED to see if/how they have resolved the problem.

@m-mcgowan

This comment has been minimized.

Copy link
Contributor

commented Apr 25, 2016

As a safeguard, I will add a new PANIC code that is raised from within malloc_lock if the semaphore is not owned by the calling thread and task switching is disabled. This will remove the timing dependencies and cause the programming fault to be apparent immediately. I've raised the issue on the WICED community forums.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.