Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] StreamMessageParser::parseMessageToADatatype avoids exception and crashes script #779

Open
dhartness opened this issue Mar 21, 2023 · 9 comments
Labels
bug Something isn't working

Comments

@dhartness
Copy link

After running for several days in a script that utilizes six OAK POE devices, five OAK-1 and one OAK-D, one of them threw the following error and crashed the entire script:

Stack trace (most recent call last) in thread 8936: #6 Object "[0xffffffffffffffff]", at 0xffffffffffffffff, in #5 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f92638c1b, in #4 Object "/lib/aarch64-linux-gnu/libpthread.so.0", at 0x7f92817647, in #3 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887e104b, in #2 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f886c587b, in #1 Object "/usr/local/lib/python3.9/dist-packages/depthai.cpython-39-aarch64-linux-gnu.so", at 0x7f887282c7, in dai::StreamMessageParser::parseMessageToADatatype(streamPacketDesc_t*) #0 Object "/lib/aarch64-linux-gnu/libc.so.6", at 0x7f925eb64c, in Segmentation fault (Invalid permissions for mapped object [0x7f7489e000]) Segmentation fault hviz@hvizpi18:~/HVision $ sudo screendump 1 > term001.log

At this point it has happened twice. We've normally observed the system, with a mix of up to ten cameras, running for over a week. Because of the infrequency we are unable to reproduce and it doesn't indicate a line number so we're unable to verify its not happening outside our try/except block.

I don't see any thrown exceptions on screen from the time the script was running and none were caught in the exception block.

Even if a camera has crashed, normally we catch it in an exception block and then reload the camera, but this bug has killed the entire script, dumping us back to the CLI, and I'm not sure how to determine which camera has caused it.

dthompson_mre_03202023.zip
log_system_information.zip

This is the run() loop the camera uses:
`while not self.stopped:
singletimecall = time.perf_counter()
if((singletimecall - checkforvaluechange) > 14):
if self.rm.getmyneedtoreload(self.camindex):
checkforvaluechange = singletimecall
self.reloadcam()
#############################

######################
self.trackletsData = None
try:
while self.needtoreinitialize:
self.reloadcam()
###################################################
imgFrame = self.preview.get()
inDet = self.qDet.get()
if imgFrame is not None:
frame = imgFrame.getCvFrame()
if(len(frametimestamps) > 1):
dasfps = round((len(frametimestamps)/(frametimestamps[-1]-frametimestamps[0])),1)
if (self.icountitems == 2) and (track is not None):
self.trackletsData = track.tracklets
if inDet is not None:
detections = inDet.detections
self.crop_queue.append((detections, frame, self.trackletsData, dasfps))
frametimestamps.append(time.perf_counter())
###################################################
self.recoveryattempts = 0
self.myhealth[self.camindex-1] = True
self.myhealth[self.camindex-1] = self.healthychild[0]
except Exception as exception:
self.myhealth[self.camindex-1] = False
exc_type, exc_obj, exc_tb = sys.exc_info()
self.oaklogging.error(""+self.camlogident+"An error has occurred around the Oak Device while active: "+str(exc_type)+" "+str(exc_tb.tb_lineno)+"")
self.oaklogging.error(""+self.camlogident+"Exception: {}".format(type(exception).name)+"")
self.oaklogging.error(""+self.camlogident+"Exception message: {}".format(exception)+"")
self.device.close()
self.recoveryattempts += 1
self.needtoreinitialize = True`

@dhartness dhartness added the bug Something isn't working label Mar 21, 2023
@themarpe
Copy link
Collaborator

Hi @dhartness

Thanks for the report - I have some suspicions as to what might cause this, but nothing stands out as "the obvious" thing.

The issue stems from the native library and is therefore "uncatchable" from Python side.

I think it'd be best if you set the creating of coredump files & share that after this crash.

echo 'core' | sudo tee /proc/sys/kernel/core_pattern
ulimit -c unlimited
# Then execute the script

If you see the above crash you should get a core.[id] file - we'll be able to analyze that, to see into what state the system got for it to crash in such manner.

@dhartness
Copy link
Author

Will do. Thanks for the directive.

@dhartness
Copy link
Author

I'm going to add that command to the crontab. It finally occurred again but apparently someone had undone my changes.

@diablodale
Copy link
Contributor

@themarpe, I'm looking around in the Queue code area due to another issue.

parseMessageToADatatype() does do some pointer things without validating them. It assumes the earlier readMove() in the Queue loop throws when XLink has an internal error. However, if XLink is "X_LINK_SUCCESS" yet XLink didn't populate a valid StreamPacketDesc...then parseMessageToADatatype() blindly uses them and could segfault. I could imagine a scenario of an OAK device creating a StreamPacketDesc_t with unexpected/bad values and then that invalid struct is sent to the host and then passed to parseMessageToADatatype() that blindly uses it.

I think validating a StreamPacketDesc is limited. Code could check for data != null and length > 8. But its not possible to validate that the memory block to which data points is actually length size--that needs to be done/checked within the XLink code itself. I don't (yet) see in code any checks that the serializedObjectSize is actually the size needed for each of the parseDatatype<T> in the switch/case. They usually call utility::deserialize() and then nlohmann::json::parse(raw pointers). So if those pointers are invalid/out-of-bounds, it could segfault.

Google search suggests that the specific segfault of Invalid permissions for mapped object is often associated with trying to manipulate const values (i.e. trying to write to read-only memory). But perhaps that segfault message is misleading. 🤷‍♀️

Here's example of a StreamPacketDesc with ::data=null and ::length=0. It causes pointer wraparound and then a segfault.
https://godbolt.org/z/eYhncd4Gq

@themarpe
Copy link
Collaborator

I agree with the analysis - if packet was sent through but is not validly constructed (or somehow corrupted in flight), the this could occur.

I'll dive into it and make it more robust + add some tests to go along.

Was checking this a little when it was first reported, but i assume it must fall down to a packet being garbled but still successfully coming through stack.

@diablodale
Copy link
Contributor

Loop me in if you have ideas/code on which you want me to provide feedback.

I think you said to me that the XLink for PoE devices uses TCP. If that's true, then it isn't realistically possible to corrupt in the OS/hardware networking section of the stack. And I don't think a cosmic ray bit-flip would happen twice to the same customer. I'm suspicious of device-side code for both xlink and the layers on top of it. Perhaps, something rare, an unexpected camera/data value which the firmware's json/pack code then doesn't notice or transform correctly which leads to an invalid packet.

If packet validation tests can be put in depthai-core code, I see at least two approaches:

  • throw: Leads to shut down of the Queue loop and eventually disconnect. Aggressive yet safe. Customers are responsible for handling; which the OP writes they do.
  • drop: Is this valid? Can any/all packets of an ADatatype derived type be dropped? Looking through the list of the derived in the switch/case...it seems like they can be dropped. If yes, tests could be put in parseMessageToADatatype(), test fails return an empty shared_ptr, and in the Queue loop if packet.empty() { log; continue; }.

A fuzz test would be good here. Send random values within a StreamPacketDesc from a device. Let device-side xlink package it and send to host and wait for the crashes. And then keep adding as many checks as possible.

@themarpe
Copy link
Collaborator

@diablodale

I think its highly likely this event is tied with a device side crash or some sort of error.

Due to this, I think sanest way would be to throw as do other functions when there is a comms exception. (much better than crashing host side)

A fuzz test would be good here. Send random values within a StreamPacketDesc from a device. Let device-side xlink package it and send to host and wait for the crashes. And then keep adding as many checks as possible.

I agree - was thinking of adding something with Catch. Will open a PR if/when time permits

@dhartness
Copy link
Author

dhartness commented Jun 12, 2023

Hi,

I've had this issue occur again but I don't see a core.[id] file having been created.

I have the following two lines in my active crontab file:

# For more information see the manual pages of crontab(5) and cron(8)
#
# m h  dom mon dow   command
@reboot echo 'core' | sudo tee /proc/sys/kernel/core_pattern
@reboot ulimit -c unlimited

Is there something I'm missing that should cause this to capture the data from the crash?

The system was last rebooted on the 5th before I left for the day.

hviz@hvizpi18:~ $ cd HVision/
hviz@hvizpi18:~/HVision $ sudo crontab -e
No modification made
hviz@hvizpi18:~/HVision $ ls -l /proc/sys/kernel/c*
-rw------- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/cad_pid
-r--r--r-- 1 root root 0 Jun  5 16:56 /proc/sys/kernel/cap_last_cap
-rw-r--r-- 1 root root 0 Jun  5 16:56 /proc/sys/kernel/core_pattern
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/core_pipe_limit
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/core_uses_pid
-rw-r--r-- 1 root root 0 Jun 12 08:16 /proc/sys/kernel/ctrl-alt-del
hviz@hvizpi18:~/HVision $

I tried the following 'find' command.

hviz@hvizpi18:~/HVision $ sudo find / -name core.*
/usr/share/man/man5/core.5.gz
/usr/lib/python3/dist-packages/numpy/ma/core.py
/usr/lib/python3/dist-packages/numpy/ma/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/numpy/distutils/core.py
/usr/lib/python3/dist-packages/numpy/distutils/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/setuptools/_distutils/core.py
/usr/lib/python3/dist-packages/setuptools/_distutils/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/pidng/core.py
/usr/lib/python3/dist-packages/pidng/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/certifi/core.py
/usr/lib/python3/dist-packages/certifi/__pycache__/core.cpython-39.pyc
/usr/lib/python3/dist-packages/idna/core.py
/usr/lib/python3/dist-packages/idna/__pycache__/core.cpython-39.pyc
/usr/lib/python3.9/distutils/core.py
/usr/lib/python3.9/distutils/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/pyparsing/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/pyparsing/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/certifi/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/certifi/__pycache__/core.cpython-39.pyc
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/idna/core.py
/home/hviz/.local/lib/python3.9/site-packages/pip/_vendor/idna/__pycache__/core.cpython-39.pyc
hviz@hvizpi18:~/HVision $

Is there another log this information may be contained in? I can see from my application's log that it stopped just after Jun 10th a 01:11. I'm going through the var/log files currently but nothing currently stands out.

@themarpe
Copy link
Collaborator

@dhartness I suggest you check latest develop or switch to v2.22.0 release once its out. From there we can check if it resolves the issue

jakgra pushed a commit that referenced this issue Mar 20, 2024
Updated YoloDetectionNetwork docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants