Mutex deadlock can cause cFE to hang indefinitely during shutdown (Linux/POSIX) #2433

LornDMiller · 2023-08-22T01:36:18Z

Describe the bug
This may be more widespread than simply Software Bus, but Software Bus is where I've seen it the most.

The Software Bus function CFE_SB_BroadcastBufferToRoute locks global data via CFE_SB_LockSharedData, then may call OS_QueuePut, then eventually unlocks the global data via CFE_SB_UnlockSharedData. In OSAL, the OS_QueuePut call finds its way into mq_timedsend, and the locking and unlocking of global data is via pthread_mutex_lock and pthread_mutex_unlock.

During shutdown, the various apps are all cancelled. Again in OSAL, this resolves to pthread_cancel calls. This leads to apps terminating when they reach "cancellation points."

mq_timed_send is a cancellation point. pthread_mutex_lock is not a cancellation point.

When a task is in the process of sending a message and is cancelled, it may be terminated while it holds the SB Shared Data mutex. Occasionally this coincides with another task that is pending or is about to pend on that mutex. Any remaining tasks that pend on that mutex are then deadlocked. I have not yet identified why the abort at the end of CFE_PSP_Restart is not called, but the system hangs indefinitely.

My current work-around is to modify OSAL's os-impl-mutex.c. In OS_MutSemCreate_Impl I use pthread_mutexattr_setrobust to make all mutexes robust just before the call to pthread_mutex_init and in OS_MutSemTake_Impl I check the return code for EOWNERDEAD and, if that is returned, call pthread_mutex_consistent to restore the mutex.

To Reproduce
Steps to reproduce the behavior:
This may be difficult to reproduce without a lot of Software Bus traffic. With enough software bus traffic, simply restarting the system should be sufficient to trigger this eventually. Unfortunately this is a race condition that is not easily triggered.

Expected behavior
On shutdown, all tasks terminate.

Code snips
cFE/modules/sb/fsw/src/cfe_sb_api.c line 1548 is the call to CFE_SB_LockSharedData that could trigger the deadlock
cFE/modules/sb/fsw/src/cfe_sb_api.c line 1605 (Call to OS_QueuePut) is a cancellation point where that same function has the global data mutex locked.
osal/src/os/posix/src/os-impl-queues.c line 305 is the actual call to mq_timed_send
osal/src/os/posix/src/os-impl-mutex.c line 179 is the pthread_mutex_lock that ultimately blocks and deadlocks a subsequent task.

System observed on:
Ubuntu 22.04. Analysis indicates any POSIX system would be vulnerable. I have not evaluated vulnerability for other platforms.

Additional context
This may require coordination with the OSAL project. This ticket may be more appropriate for the OSAL team.

Reporter Info
Lorn Miller
Red Canyon Engineering & Software

irowebbn · 2023-10-12T20:00:27Z

I have encountered a similar problem where the software bus appears to have a race condition. Here are the system log messages it emits:

CFE_SB_UnlockSharedData: SharedData Mutex Give Err Stat=-6,App=1114127,Func=CFE_SB_ReceiveBuffer,Line=1892
CFE_SB_UnlockSharedData: SharedData Mutex Give Err Stat=-6,App=1114127,Func=CFE_SB_ReceiveBuffer,Line=2005

chillfig added the bug label Aug 30, 2023

jphickey self-assigned this Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mutex deadlock can cause cFE to hang indefinitely during shutdown (Linux/POSIX) #2433

Mutex deadlock can cause cFE to hang indefinitely during shutdown (Linux/POSIX) #2433

LornDMiller commented Aug 22, 2023

irowebbn commented Oct 12, 2023

Mutex deadlock can cause cFE to hang indefinitely during shutdown (Linux/POSIX) #2433

Mutex deadlock can cause cFE to hang indefinitely during shutdown (Linux/POSIX) #2433

Comments

LornDMiller commented Aug 22, 2023

irowebbn commented Oct 12, 2023