-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start+stop is thread unsafe -- leads to crashes (e.g. bootloader_version test failing 50% of time) #257
Comments
Here's the stack
|
Technically, it is from a debug build as follows:
|
I suspect a relation between the Here is one of today's many fails, this time with...
if I cd to If I run
|
I've seen a pattern that is visible when Tests that usually successful have two Looking at the code and behavior, I smell a race condition bug related to Usually successful
Fail
|
I've got a good repo that so far always fails. As in the OP it is src\xlink\XLinkStream.cpp line 33
My question is... the first part of that prepare watchdog thread which is constructing an XLinkStream and needs a I put a spdlog::debug on line 25 to output
I think this is one source of the crashes. I see depthai-core/src/device/DeviceBootloader.cpp Line 390 in 2657938
First issue...you dereference and loose the nature of a shared_pointer. The I see 14 places where this unsafe behavior of passing a raw reference from a shared_ptr is happening. It is very likely all of them are potential places for race conditions and crashes. Second issue...There is also that the I grabbed ownership using
|
I think i've found a bug in XLink usage. After hack above of holding ownership, I kept having the app exit with exitcode 1. I couldn't find anywhere an exception or windows SEH. So I looked into XLink itself. XLink has several places where it After enabling XLink logging(warn) I can get a 90% reliable app exit 1 and it appears like this
The logs above are explaining that the watchdog lambda finished (doesn't always), the join() in |
Found the XLink issue. I've logged it separately in your xlink repo luxonis/XLink#18 |
Also found use of XLinkConnection without ownership two times in DataInputQueue::DataInputQueue(const std::shared_ptr<XLinkConnection> conn, const std::string& streamName, unsigned int maxSize, bool blocking)
: connection(std::move(conn)), queue(maxSize, blocking), name(streamName) {
// open stream with default XLINK_USB_BUFFER_MAX_SIZE write size
XLinkStream stream(*connection, name, device::XLINK_USB_BUFFER_MAX_SIZE); |
Thanks for digging into this issue. |
The majority of these thread/ownership code issues are in depthai (not xlink). These are not xlink bugs. These bugs point to core design/architectural issues in depthai. Depthai classes need to have new members to hold ownership of things on which they operate, a cascade of ownerships related to them (definitely connection, maybe device, etc) to populate those members, destructors that will cleanup, etc. Your team should have conversations and whiteboard on how you want to manage ownership using the examples and repro cases and code I've pointed to above. This is your code, you have the corpus of knowledge. I don't have the corpus, don't have architectural ownership, and don't have the role to edict how you architect. This is work your team needs to do. My hacks are private as I'm not overhauling your classes...just quick grabbing ownership so I can start actually coding my app. Its messy, still not entirely thread safe, and definitely not for merging for wide usage. The xlink issue is separate from all the thread/ownership issues and the xlink "bug" is separate at luxonis/XLink#18. Those xlink changes are isolated and trivial. I can make a PR for those xlink issues. |
I agree, we will look into these once we finish current efforts. |
When you are ready, I can push my hacks somewhere so you can see how ownership is missing and how I've got temporary hacks in place for some of the most problematic. There is definitely a better approach by shuffling classes/members as my changes are temporary hacks, repetitive, and still have some race conditions. |
Main reason for this "design" was because of current XLink limitations, Data queues cannot unblock the XLink R/W calls, etc... This is also the main pain point which needs work on - both from XLink and core perspective - to make more resilient to device connect/disconnects. That said, the causes where this shows are mostly related to trying to connect to a device which isn't in a "ready" state or going out of a connection in a not clean way. In your testing, do you repeatedly call eg. bootloader_version? How does it behave if you run the example with some time in between or a device power cycle? Does the error rate remain the same? Or does it work in that case? Will read through the rest of the observations and dig into XLink more soon |
- this is not for production. It is not quality code and only for debugging and investigation of ownership/threading issues
...but since depthai is multithreaded there is no safe "duration of the call". One must always get ownership using the intention of shared_ptr....one must pass it by value so that shared_ptr will do the refcount++ and therefore the connection will live during the duration of the call. My repro is consistantly to run your ctest as per https://github.com/luxonis/depthai-core#running-tests while I have ensured that all are ON -> DEPTHAI_BUILD_TESTS, DEPTHAI_BUILD_EXAMPLES, and DEPTHAI_TEST_EXAMPLES. The pain points tend to be tests 7-12. When they run in their natural sequence, depthai fails the majority of the time. 😟 Even after all my hack fixes, I still get occassional failures (though I have some ideas of race conditions that might be causing them). To date, I have never seen a 100% pass rate on your full test suite. From what I can see in your CI https://github.com/luxonis/depthai-core/runs/3735947677?check_suite_focus=true you only run two tests. You never run your full test suite. I suspect you are missing lots of crashes. I know some of your full suite needs a gui...and the test timeout dance. But tests 1-9 need no UI you should be able to CI them easily. Overall, I encourage you to run (at minimum) tests 1-11 (rgb_preview) in your CI. One of my views...is if a dependency's (e.g. depthai) doesn't pass its own test suite with a clean build, then there is a low chance of my project using said dependency to be successful. I'll always be chasing issues/bugs due to underlying dependency issues. "Can't build on quicksand." That's why I started digging into depthai in Spring and returned as planned now in Fall and am hardcore trying to get depthai's test suite to pass. My particular usage of depthai will be a DLL that "plugs" into a 3rd party EXE. I can't crash that EXE nor can I expect a user to powercycle OAK on every config change because the user rapidly makes changes...probably hundreds of times in a day. I will integrate with a UI that allows the user to dynamically at runtime choosing one or more OAK devices, choose the config for that particulate OAK (resolution, rgb, mono, depth, etc) all runtime without app restarts or crashes. Its not tenable for depthai to crash when a user stops depthai, chooses a different rgb resolution, and then restarts depthai. I won't be able to recommend OAK hardware nor sell my solution supporting it. At the last commit of https://github.com/diablodale/depthai-core/commits/hack_257_ownership are all the hacks for threads/ownership. You'll also see some commits before that which I'll push as formal PRs later this week or next to fix other issues I've already opened. And a PR is still forthcoming on XLink itself. That one is easy to fix. |
It does, the caller doesn't decrement its own refcount for the duration till it returns from that function, so as long as the caller has a valid ref, the called function will have it as well.
We've yet to setup local runners with HW to test against and be tied to CI. We do however test these locally currently (unscalable, working on creating runners - security is a bit of an issue with GH self-hosted runners)
I understand - I agree this not being a solution, was just a question to see what could be under the crash, as some of this behavior we've seen before. Makes it easier to start exploring the root cause. The reason I'm asking is that there are some XLink issues which are the root of how to resolve this "completely correctly", and it gives a quick glance on the severity of jumping to them. And thanks for the snippets and the XLink PR - will take into consideration when circling back to core. |
Please don't take my pointing of errant things in a bad way. My intention is to point to the bug, not people. We're human, make mistakes, fix them, and move on. 👍 I already debugged and know the source of the first wave of crashes. Build it Windows x64 Debug. Repeatedly run the EXE in a batch loop. Over and over. You'll be lucky if you can do 2 runs. After you fix the ownership of the connection then the code will run longer to then suddenly terminate due to the exit() issue in XLink. Once I found it, its an obvious bug and easy/isolated to fix. I have one question on it I'll put in the PR notes. I'll open that PR within an hour. A class diagram would be useful so I can see the hierarchy and cross-usage. Does such a diagram exist? That can help me independently -or- if we have a interactive convo assists on which classes need to hold ownership to others. Given time, I'll grok it myself as I debug/fix. I would rather code my app, but I need to stablize depthai-core enough for my use cases. |
Sorry for the disappointment @diablodale . And thanks for the help. |
good news 🎉 I've got a 100% test suite passing tests 1->50. This is a dramatic improvement! test 51 I'll remove all my debugging logging and push the changes I made. It's a much better approach than my earlier hacks. Can still be improved but its a great start. |
AWESOME thanks @diablodale ! |
OK. I've pushed my fixes/workarounds to https://github.com/diablodale/depthai-core/tree/hack_257_ownership All the above issues are true. I won't repeat them again here. I was able two days ago to get 100% pass rate on the full 61 item test suite. My approach didn't call The depthai codebase uses the Thread 1 - the watchdog 🐕🦺 Thread 2 - the DataOutputQueue 🏭 Thread 3 - the destroyer ☠ All three threads are running in parallel. None of the threads are synchronized between themselves. Thread 1🐕🦺 and thread 2🏭 need a valid connection to run correctly. Therefore they throw errors and exceptions which all need to be managed. See my branch for fixes. But... threads 1 and 2 are not at some beautiful/easy location in their loop. They are in random locations of code in their loop and therefore the XLink connection is closed at random times. Thread 2🏭 is problematic. First, it is the thread that often blocks leading the current codebase to do the harsh Thread 4 - XLink 🔗 Now we have 3 threads in a death scenario.
💥Hard crash of SEGFAULT and memory violations. These hard crashes due to memory violations are VERY difficult to manage. These must be fixed. Since these are "hardware" errors, they are outside standard C++ exception handling mechanisms. Linux has SIG handlers but the C++ object unwinding breaks down. Windows has proprietary SEH Windows+MSVC has a compiler option that is a workaround. |
More analysis. All above issues in depthai and XLink continue. Focusing on the XLink code...
I'm attempting to fix the XLink issues without introducing C++ into XLink. 🙄 Yet, it is so much easier with C++ RAII and unique/shared_ptr...but does change the interface and cascade some work. If you are working on this in parallel, let's share ideas. If you are not yet, then I'll continue to push forward as I need to get this fixed so my app meets my reliability standards. |
Thanks you! To my knowledge we are not working on this in parallel but @themarpe can confirm. |
I finished the thread+ownership fixes I've planned for this wave. Overnight I'll run If all is good after the run, I'll first make a PR for the XLink changes. I want to get some more eyes on it for style+approach. |
Thank you! |
Great news @diablodale |
- fix many thread/ownership issues for start/stop scenarios - XLinkStream::readMove() for moving packet ownership - fix XLinkStream move semantics - removed all use of XLinkStream::readRaw as often leads to memory violations and/or memory leaks - deprecate all XLinkStream::readRaw...() APIs - fixes luxonis#257
A fix for this overall issue is in diablodale@3ebfd7d |
- fix many thread/ownership issues for start/stop scenarios - XLinkStream::readMove() for moving packet ownership - fix XLinkStream move semantics - removed all use of XLinkStream::readRaw as often leads to memory violations and/or memory leaks - deprecate all XLinkStream::readRaw...() APIs - fixes luxonis#257
Mistakenly closed this issue... @diablodale the move semantic is now merged. |
@themarpe, yes. Will do before end of year. I'll rebase it on to current |
- fix many thread/ownership issues for start/stop scenarios - XLinkStream::readMove() for moving packet ownership - fix XLinkStream move semantics - removed all use of XLinkStream::readRaw as often leads to memory violations and/or memory leaks - deprecate all XLinkStream::readRaw...() APIs - fixes luxonis#257
Using the
ctest
instructions in this repo, or manually running/build/examples/bootloader_version.exe
...leads to a hard crash maybe 50% of the time. If I run that EXE in vscode to debug it, I can seeXLinkStream::XLinkStream(const XLinkConnection& conn, const std::string& name, std::size_t maxWriteSize) : streamName(name)
constructor throws and nothing is catching so it hard crashes.Setup
Repro
build
thenctest
Result
Expected
No errors or crashes. All tests to pass.
Notes
verbose ctest
Debugger tracing
Run and debug
/build/examples/bootloader_version.exe
directly in vscode so I can debug the crashthe crash is in
src\xlink\XLinkStream.cpp
line 33depthai-core/src/xlink/XLinkStream.cpp
Lines 32 to 34 in c1b697a
streamID does equal INVALID_STREAM_ID. It is
0xDEADDEAD
.and this->streamname = __watchdog
The text was updated successfully, but these errors were encountered: