Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock with updates in epics base 3.14.12.6-rc1 and 3.15.5-rc1 #29

Closed
xiaoqiangwang opened this issue Nov 10, 2016 · 6 comments
Closed

Comments

@xiaoqiangwang
Copy link
Collaborator

xiaoqiangwang commented Nov 10, 2016

Currenly all pcaspy applications assume that they can call Driver::updatePVs() at a separate thread. And the main thread will pick this request and send monitor events to clients.

In the release of 3.14.12.6-rc1 and 3.15.5-rc1, the pcas server receives an update to support dynamic length array. The affecting change is the additional call to [chan.getPVI().nativeCount()] (epics-base/epics-base@0743417#diff-4eafaaeaa6480eaed1cfd866d63cd0adR886) inside casStrmClient::monitorResponse .

When the deadlock happens, here is the backtrace from both main thread 1 and python thread 2,

Thread 2 (Thread 0x7fbd87537700 (LWP 15445)):
#0  0x00000035ebe0e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000035ebe095f3 in _L_lock_892 () from /lib64/libpthread.so.0    
#2  0x00000035ebe094d7 in pthread_mutex_lock () from /lib64/libpthread.so.0                                                                                       
#3  0x00007fbd879d1026 in mutexLock (pmutex=0x1c7c230) at ../../../src/libCom/osi/os/posix/osdMutex.c:44                                                          
#4  epicsMutexOsdLock (pmutex=0x1c7c230) at ../../../src/libCom/osi/os/posix/osdMutex.c:116                                                                       
#5  0x00007fbd879c9d9c in epicsMutex::lock (this=<value optimized out>) at ../../../src/libCom/osi/epicsMutex.cpp:236                                             
#6  0x00007fbd880b1b9f in epicsGuard (this=<value optimized out>, monitorList=..., select=..., event=...) at ../../../../include/epicsGuard.h:68                  
#7  casEventSys::postEvent (this=<value optimized out>, monitorList=..., select=..., event=...) at ../../../../src/cas/generic/casEventSys.cc:334                 
#8  0x00007fbd880aee1c in postEvent (this=0x20c7db0, select=..., event=...) at ../../../../src/cas/generic/casCoreClient.h:167                                    
#9  postEvent (this=0x20c7db0, select=..., event=...) at ../../../../src/cas/generic/chanIntfForPV.h:71                                                           
#10 casPVI::postEvent (this=0x20c7db0, select=..., event=...) at ../../../../src/cas/generic/casPVI.cc:282                                                        
#11 0x00007fbd88502199 in PV::postEvent (this=0x1e95270, mask=<value optimized out>, value=...) at pcaspy/pv.cpp:96                                               
#12 0x00007fbd884f65d9 in _wrap_PV_postEvent (args=<value optimized out>) at pcaspy/casdef_wrap.cpp:9740                                                          
#13 0x00000036006d55b6 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#14 0x00000036006d7647 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0                                                                                
#15 0x00000036006d5a94 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#16 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#17 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#18 0x00000036006d7647 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0                                                                                
#19 0x000000360066ad9d in ?? () from /usr/lib64/libpython2.6.so.1.0                                                                                               
#20 0x0000003600643c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0                                                                                    
#21 0x00000036006d4460 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#22 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#23 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0                                                                               
#24 0x00000036006d7647 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0                                                                                
#25 0x000000360066aca0 in ?? () from /usr/lib64/libpython2.6.so.1.0                                                                                               
#26 0x0000003600643c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0                                                                                    
#27 0x00000036006566af in ?? () from /usr/lib64/libpython2.6.so.1.0                                                                                               
#28 0x0000003600643c63 in PyObject_Call () from /usr/lib64/libpython2.6.so.1.0                                                                                    
#29 0x00000036006cfc83 in PyEval_CallObjectWithKeywords () from /usr/lib64/libpython2.6.so.1.0                                                                    
#30 0x00000036007017aa in ?? () from /usr/lib64/libpython2.6.so.1.0                                                                                               
#31 0x00000035ebe07aa1 in start_thread () from /lib64/libpthread.so.0                                                                                             
#32 0x00000035eb2e893d in clone () from /lib64/libc.so.6                                                                                                          

Thread 1 (Thread 0x7fbd92e08700 (LWP 15443)):
#0  0x00000035ebe0e334 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000035ebe095f3 in _L_lock_892 () from /lib64/libpthread.so.0    
#2  0x00000035ebe094d7 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fbd879d1026 in mutexLock (pmutex=0x20783e0) at ../../../src/libCom/osi/os/posix/osdMutex.c:44
#4  epicsMutexOsdLock (pmutex=0x20783e0) at ../../../src/libCom/osi/os/posix/osdMutex.c:116             
#5  0x00007fbd879c9d9c in epicsMutex::lock (this=<value optimized out>) at ../../../src/libCom/osi/epicsMutex.cpp:236
#6  0x00007fbd880adbc8 in epicsGuard (this=0x20c7db0) at ../../../../include/epicsGuard.h:68                         
#7  casPVI::nativeCount (this=0x20c7db0) at ../../../../src/cas/generic/casPVI.cc:521                                
#8  0x00007fbd880a9ad4 in casStrmClient::monitorResponse (this=0x2112900, guard=..., chan=..., msg=..., desc=..., completionStatus=0)
    at ../../../../src/cas/generic/casStrmClient.cc:886                                                                              
#9  0x00007fbd880b23a3 in casMonitor::response (this=<value optimized out>, guard=<value optimized out>, client=<value optimized out>, 
    value=<value optimized out>) at ../../../../src/cas/generic/casMonitor.cc:61                                                       
#10 0x00007fbd880b24e7 in casMonitor::executeEvent (this=0x21152d0, client=..., ev=..., value=<value optimized out>, clientGuard=..., evGuard=...)
    at ../../../../src/cas/generic/casMonitor.cc:110                                                                                              
#11 0x00007fbd880b1fd2 in casEventSys::process (this=0x2112928, casClientGuard=...) at ../../../../src/cas/generic/casEventSys.cc:133             
#12 0x00007fbd880b75d9 in eventSysProcess (this=0x2112b48) at ../../../../src/cas/generic/casCoreClient.h:176
#13 casStreamEvWakeup::expire (this=0x2112b48) at ../../../../src/cas/generic/st/casStreamOS.cc:150
#14 0x00007fbd879d6943 in timerQueue::process (this=0x1e126f8, currentTime=...) at ../../../src/libCom/timer/timerQueue.cpp:139
#15 0x00007fbd879be388 in fdManager::process (this=0x7fbd87beea40, delay=<value optimized out>) at ../../../src/libCom/fdmgr/fdManager.cpp:127
#16 0x00007fbd884f1e17 in _wrap_process (args=<value optimized out>) at pcaspy/casdef_wrap.cpp:11076
#17 0x00000036006d59d4 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0
#18 0x00000036006d6b7f in PyEval_EvalFrameEx () from /usr/lib64/libpython2.6.so.1.0
#19 0x00000036006d7647 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.6.so.1.0
#20 0x00000036006d7722 in PyEval_EvalCode () from /usr/lib64/libpython2.6.so.1.0
#21 0x00000036006f1b9c in ?? () from /usr/lib64/libpython2.6.so.1.0
#22 0x00000036006f1c70 in PyRun_FileExFlags () from /usr/lib64/libpython2.6.so.1.0
#23 0x00000036006f315c in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.6.so.1.0
#24 0x00000036006ff892 in Py_Main () from /usr/lib64/libpython2.6.so.1.0
#25 0x00000035eb21ed5d in __libc_start_main () from /lib64/libc.so.6
#26 0x0000000000400649 in _start ()
@xiaoqiangwang
Copy link
Collaborator Author

xiaoqiangwang commented Nov 10, 2016

Here is how the deadlock happens,

main thread               python update thread
casEventSys::process      casPVI::postEvent
casPVI::nativeCount       casEventSys::postEvent

@bhill-slac
Copy link

Hi Xiaoqiang,
Do you have a pcaspy example that I can run to reproduce this?
Thanks,

  • Bruce

@xiaoqiangwang
Copy link
Collaborator Author

xiaoqiangwang commented Dec 6, 2016

Hi Bruce,

I used the simscope.py as the test program.

After building pcaspy 0.6.2 or the current HEAD in the repo,

$ cd example
# make a soft link to the current build
$ ln -s ../build/lib.linux-x86_64-2.6/pcaspy .
$ python simscope.py

At another terminal, launch the MEDM panel, change the Update time to 0.01 second, and click Run.

medm -x -macro P=MTEST simscope.adl

After a few seconds (in my case), the sinuous wave stops updating. And now if you attach gdb to the PCASpy program, and exec command (gdb) thread apply all bt. The traceback shows two threads are stuck in pthread_mutex_lock, as show in the first post.

@xiaoqiangwang
Copy link
Collaborator Author

The official release of EPICS base 3.14.12.6 has fixed this deadlock.

@till-s
Copy link

till-s commented May 29, 2019

Note that the fix mentioned above only remedies one particular race condition but there are more...

I filed a bug report here

https://bugs.launchpad.net/epics-base/+bug/1830957

@xiaoqiangwang
Copy link
Collaborator Author

The deadlock you discovered is indeed serious.

https://bugs.launchpad.net/epics-base/+bug/1830957

In theory the deadlock should happen soon if I do "caput -c " on a fast changing PV. I used simscopy.py as the test program (updating in 100Hz) and ran the "caput -c " in 50Hz.

count=1;while [ 1 ]; do caput -c MTEST:MinValue $((count++));sleep 0.02; done

However it cannot be reproduced. In what usage do you see this deadlock, in a cagateway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants