-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] StreamMessageParser::parseMessageToADatatype avoids exception and crashes script #779
Comments
Hi @dhartness Thanks for the report - I have some suspicions as to what might cause this, but nothing stands out as "the obvious" thing. The issue stems from the native library and is therefore "uncatchable" from Python side. I think it'd be best if you set the creating of coredump files & share that after this crash.
If you see the above crash you should get a |
Will do. Thanks for the directive. |
I'm going to add that command to the crontab. It finally occurred again but apparently someone had undone my changes. |
@themarpe, I'm looking around in the Queue code area due to another issue.
I think validating a Google search suggests that the specific segfault of Here's example of a |
I agree with the analysis - if packet was sent through but is not validly constructed (or somehow corrupted in flight), the this could occur. I'll dive into it and make it more robust + add some tests to go along. Was checking this a little when it was first reported, but i assume it must fall down to a packet being garbled but still successfully coming through stack. |
Loop me in if you have ideas/code on which you want me to provide feedback. I think you said to me that the XLink for PoE devices uses TCP. If that's true, then it isn't realistically possible to corrupt in the OS/hardware networking section of the stack. And I don't think a cosmic ray bit-flip would happen twice to the same customer. I'm suspicious of device-side code for both xlink and the layers on top of it. Perhaps, something rare, an unexpected camera/data value which the firmware's json/pack code then doesn't notice or transform correctly which leads to an invalid packet. If packet validation tests can be put in depthai-core code, I see at least two approaches:
A fuzz test would be good here. Send random values within a |
I think its highly likely this event is tied with a device side crash or some sort of error. Due to this, I think sanest way would be to throw as do other functions when there is a comms exception. (much better than crashing host side)
I agree - was thinking of adding something with Catch. Will open a PR if/when time permits |
Hi, I've had this issue occur again but I don't see a core.[id] file having been created. I have the following two lines in my active crontab file:
Is there something I'm missing that should cause this to capture the data from the crash? The system was last rebooted on the 5th before I left for the day.
I tried the following 'find' command.
Is there another log this information may be contained in? I can see from my application's log that it stopped just after Jun 10th a 01:11. I'm going through the var/log files currently but nothing currently stands out. |
@dhartness I suggest you check latest |
After running for several days in a script that utilizes six OAK POE devices, five OAK-1 and one OAK-D, one of them threw the following error and crashed the entire script:
Stack trace (most recent call last) in thread 8936: #6 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in #5 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f92638c1b, in #4 Object "/lib/aarch64-linux-gnu/libpthread.so.0", at 0x7f92817647, in #3 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887e104b, in #2 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f886c587b, in #1 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887282c7, in dai::StreamMessageParser::parseMessageToADatatype(streamPacketDesc_t*) #0 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f925eb64c, in Segmentation fault (Invalid permissions for mapped object [0x7f7489e000]) Segmentation fault hviz@hvizpi18:~/HVision $ sudo screendump 1 > term001.log
At this point it has happened twice. We've normally observed the system, with a mix of up to ten cameras, running for over a week. Because of the infrequency we are unable to reproduce and it doesn't indicate a line number so we're unable to verify its not happening outside our try/except block.
I don't see any thrown exceptions on screen from the time the script was running and none were caught in the exception block.
Even if a camera has crashed, normally we catch it in an exception block and then reload the camera, but this bug has killed the entire script, dumping us back to the CLI, and I'm not sure how to determine which camera has caused it.
dthompson_mre_03202023.zip
log_system_information.zip
This is the run() loop the camera uses:
`while not self.stopped:
singletimecall = time.perf_counter()
if((singletimecall - checkforvaluechange) > 14):
if self.rm.getmyneedtoreload(self.camindex):
checkforvaluechange = singletimecall
self.reloadcam()
#############################
######################
self.trackletsData = None
try:
while self.needtoreinitialize:
self.reloadcam()
###################################################
imgFrame = self.preview.get()
inDet = self.qDet.get()
if imgFrame is not None:
frame = imgFrame.getCvFrame()
if(len(frametimestamps) > 1):
dasfps = round((len(frametimestamps)/(frametimestamps[-1]-frametimestamps[0])),1)
if (self.icountitems == 2) and (track is not None):
self.trackletsData = track.tracklets
if inDet is not None:
detections = inDet.detections
self.crop_queue.append((detections, frame, self.trackletsData, dasfps))
frametimestamps.append(time.perf_counter())
###################################################
self.recoveryattempts = 0
self.myhealth[self.camindex-1] = True
self.myhealth[self.camindex-1] = self.healthychild[0]
except Exception as exception:
self.myhealth[self.camindex-1] = False
exc_type, exc_obj, exc_tb = sys.exc_info()
self.oaklogging.error(""+self.camlogident+"An error has occurred around the Oak Device while active: "+str(exc_type)+" "+str(exc_tb.tb_lineno)+"")
self.oaklogging.error(""+self.camlogident+"Exception: {}".format(type(exception).name)+"")
self.oaklogging.error(""+self.camlogident+"Exception message: {}".format(exception)+"")
self.device.close()
self.recoveryattempts += 1
self.needtoreinitialize = True`
The text was updated successfully, but these errors were encountered: