Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limiter_node fails to decrement/open with continue_msg yet works with int #419

Closed
diablodale opened this issue May 27, 2021 · 12 comments
Closed

Comments

@diablodale
Copy link

Using the default limiter_node and sent continue_msg() via an edge to the limiter.decrementer() does not decrement and open again the limiter node. It stays closed and no messages pass.

I made a single code change to limiter_node<T, int> and sent (int)1 via the edges. Now I have no issues. The limiter_node works as I expect. 🤔

Related to discussion in #342 (comment)

Setup

  • Windows 10 64-bit, Debug and Release compiles
  • Microsoft VS2019 v16.8.4 through v16.10.0 Community
  • one api tbb v2021.1.1 and v2021.2.0

Repo

I acknowledge a full compliable code example is best. My bandwidth is limited at the moment. Until something else exists...

  1. Setup as above
  2. Create graph with various node types including multifunction nodes, and put a limiter node near the entrance of the flowgraph.
  3. create edges connecting the nodes and on the multifunction nodes some edges going back to the limiter's decrementer with associated try_put() in their lambda.
  4. Set limiter_node<myClass>>(flowgraph, 2)
  5. Set some of the nodes to tbb::flow::serial while others to 3
  6. Run your graph

Result

About 2 or 3 messages will traverse the graph. Then no more.

Expected

Never ending flow of messages.

Workaround

The following workaround has been used for 3 months with 100% success. Across multiple VS versions, and both tbb v2021.1.1 and v2021.2.0

  1. Change to limiter_node<myClass, int>>(flowgraph, 2)
  2. update multifunction_node out tuples, and change all try_put(continue_msg()) to try_put(1)

Notes

I continue to be suspicious about the TBB threshold_regulator code in _flow_graph_body_impl. The specialization for continue_msg has execute() but doesn't have try_put_task() with a decrement like its sibling specialization. Missing code? Race condition?

@aleksei-fedotov
Copy link
Contributor

aleksei-fedotov commented May 31, 2021

Hi @diablodale. Thank you for opening up a separate issue to track this!

The specialization for continue_msg inherits continue_receiver, which has try_put_task() that calls the execute() method, which, in turn, calls limiter_node<>::decrement_counter() so that it adjusts the gate accordingly.

<..> on the multifunction nodes some edges going back to the limiter's decrementer with associated try_put() in their lambda.

Please make sure multifunction nodes indeed send messages to the output ports that are connected to the limiter_node's decrementer.

I agree that this looks suspicious, since with the integral specialization of the limiter_node everything works on your side.
Could you please provide a reproducer then?

@diablodale
Copy link
Author

I have a reproducer with bulk private code removed. The framework/flow is still there which I would prefer not be public.
Do you have a process by which I can provide the reproducer privately to a verified Intel employee and Intel to keep it confidential ala NDA?

@aleksei-fedotov
Copy link
Contributor

I have an idea what can go wrong - have you considered buffering policy of the limiter_node's predecessor(s) in your graph? Perhaps, some of the rejected messages by the limiter_node are simply lost?

Regarding the reproducer, you can use official inteltbbdevelopers@intel.com address or send the reproducer directly to me - my email address should be written on my GitHub account page. Please let me know if it is not the case. Wherever you ask, I am sure your IP will be treated appropriately since we have strict guidelines and trainings about working with third-party code in order not to leak its IP be that customer or company-owned one.

@diablodale
Copy link
Author

Intel asks if I considered buffering policy?

Yes. I want to drop messages. My scenario is a graph that processes sensor frame data. If n number of frames are still within the graph, then the limiter should drop incoming frames. It is more important to have current frames and drop excess frames until the graph completes frames already inflight.

While further reducing, I got to a point I could run generate a trace for Flow Graph Analyzer. The graph chart looks correct. But the Graph rule-check is reporting errors that I do not see in code. And the node inspector is reporting incorrect types on outlets even though it is clearly correct in the template params. This seems to be bugs with the Flow Graph collector and analyzer. I won't discuss this further in this bug thread.

I'm sending an email to inteltbbdevelopers@intel.com with a ZIP that contains this setup, cmake, and source files.

Setup

  • Intel i7-10875h (8 cores/16 threads)
  • Windows 10 64-bit
  • Microsoft VS2019 v16.10.0 Community
  • oneapi tbb v2021.2.0 installed from the oneapi web installer. Please note, your installer, the folders the installer makes, github, and documentation call this 2021.2.0. This is inconsistent with the PACKAGE_VERSION 2021.2.2.0 inside C:\Program Files (x86)\Intel\oneAPI\tbb\2021.2.0\lib\cmake\tbb\TBBConfigVersion.cmake that the installer saves to the drive. I recommend future versions of TBB correct this inconsistency so that find_package(TBB x.y.z) will work correctly. For example, given this existing version, the TBBConfigVersion.cmake should contain PACKAGE_VERSION 2021.2.0 or PACKAGE_VERSION 2021.2.0.0

Repro steps

  1. Setup as above. I recommend you do it on a 8/16 cpu so that we have the same thread/pools
  2. cmake config
  3. copy tbb12.* and tbb12_debug.* to your build location
  4. At the top of pipeline_oneapi_tbb.h, set LIMIT_WITH_INT to your desired value. Start with the default.
  5. cmake build
  6. run onetbb-419.exe
  7. Wait 10 seconds for the exe to stop itself.

Good Result 🙂

Output showing a cascade of capture -> put -> get. These cascades should continue all 10 seconds.

Bad Result 😢

Output showing a cascade of capture -> put but less than 20 get.

Notes

Run the above repro steps with both the default LIMIT_WITH_INT 1 and LIMIT_WITH_INT 0. This controls if the limiter_node uses an int or continue_msg. The behaviors are demonstratively different! 🤔

LIMIT_WITH_INT 1 often produces the good result. However, with this simplified code, I now experience the bad result occasionally. And in this bad scenario, only one get occurs.

LIMIT_WITH_INT 0 produces the bad result. In this bad result, there are usually 1-25 get and then no more until the program ends.

These behaviors makes me suspect a race condition somewhere. And the simplied repro code exposes more bad scenarios that my real code with heavier/longer computation in the nodes doesn't expose.

@diablodale
Copy link
Author

Hi. Checking back here to confirm that you received my email with the repo code. I sent it on the same day as above, 2 June 2021.

@aleksei-fedotov
Copy link
Contributor

Yes, we received it and did the analysis. We found two issues:

  1. The documentation of the limiter_node is vague about the behavior of its specialization for the continue_msg type for the decrementer's template parameter. We are working on improving that part in this PR.
  2. The calls to graph::reset() method are done in concurrent environment, which clears all the buffers and hence leaves the graph in uncertain condition wrt what messages are processed by what node and what messages are not and will not be. Therefore, graph::reset() should be called only in serial part of the algorithm.

@diablodale
Copy link
Author

diablodale commented Jun 18, 2021

In the sample code I provided, there is only one call to graph::reset(). And it is in the Pipeline() constructor, which naturally is serial itself. The constructor is triggered at the start of main() which naturally is also serial.

And am I right to assume that your (2) above is alluding that within that constructor I call taskArena.enque( []{ flowgraph.reset() } )? And that is before any other tasks or messages are created. And that it could be a race condition if somehow the first line of that enqueued task is somehow run out of sequence due to new tasks being created by code that runs after this constructor?

The intention of the reset() is to (re)attach the graph to a specific task arena (for independent threads) and specific task group (for graph shutdown). I'm not aware of any other constructor arguments or apis that will associate a flowgraph with a specific arena. This is why I call taskArena.enque( []{ flowgraph.reset() } )

A long time ago, I looked at task_arena::execute(). But the documentation writes that it first tries to directly execute but if that fails "wraps the functor into a task, enqueues it into the arena...". This behavior was indeterminate and therefore I chose to use task_arena::enqueue() so that I always have the same behavior.

If I can guarantee execute() never returns until after the reset(), then I could split the lambda and run just the reset() in execute() and run the rest of the lambda in an enqueue(). But if execute() uses its internal enqueue() code, then we are back to the same bug scenario.

Is there some other way to achieve the same goal? To have a unique arena, and to create a flowgraph associated with it?

I understand what you are describing in (1). That approach would add unneeded complexity in my solution. So I will continue to use an integer decrementer.

@aleksei-fedotov
Copy link
Contributor

The enqueuing happens only if all available slots are occupied by other threads. Since task_arena::execute() is executed in serial environment, you may be sure it won't fallback enqueuing.

@diablodale
Copy link
Author

diablodale commented Jun 18, 2021

ok. I read the (1) doc update in the PR. That looks good.

I think there is value in adding a sentence or two...somewhere...that describes the one-and-only-way to assign a flowgraph to a specific arena and that the approach using reset() must be done within task_arena::execute(). Since this is the only reliable way, I believe clear doc will help those like me in the future. Since it has to do with arena and flowgraph, it might be best served if on one or the other page for those classes.

@aleksei-fedotov
Copy link
Contributor

Yeah, definitely this information should be written somewhere. Not sure about the specification, but in the docs for sure.

BTW, what do you think if constructor of the graph will allow specifying an arbitrary task arena it will attach its execution to? Will that be useful and less error-prone for you?

@diablodale
Copy link
Author

Perhaps "Flow Graph Tips and Tricks" section?

I also was thinking this one-and-only-way seemed an API itself rather than a documented strict set of API calls. 😏

I (re)read articles, now having somewhat more TBB experience, and I can imagine three approaches I would explore to achieve my goal. I can imagine the generic desire of "I want to run my flow graph in an explicit grouping of arena, task group, or isolation".

  1. explicit task_arena. This is my brute-force approach I currently use. Docs states this is the low-level scheduler api. Yes, it would be helpful to have a task_arena param on a graph constructor. I would use graph(task_group_context, task_arena)
  2. explicit task_group. This is documented as the high-level scheduler api. Makes me wonder if I might could use this.
  3. this_task_arena::isolate(). This is documented as an isolation approach to use for correctness vs. performance (via task_arena).

I do not have enough experience to know if 2. and 3. make sense with a flow graph. But it is something I would explore given time and an API.

My specific scenario is to have some TLS isolation; preventing accidents, not malicious attacks. I use OpenCL via the OpenCV sdk. The OpenCV sdk has an api to choose the "current OpenCL gpu"; the impl stores it in TLS. Imagine now my code and some 3rd party code that both use OpenCV and perhaps also TBB. My code calls the OpenCV api in the task_scheduler_observer to set the "current OpenCL gpu" to my Intel gpu. Sometime later the 3rd party calls the same OpenCV api to set the "current OpenCL gpu" to my Nvidia gpu. We live in a heterogenous world. 🙂

The OpenCV api to select the gpu is expensive...and...it has tremendous amounts of state. It is important that the tasks in my flow graph don't have their gpu changed. When threads runs tasks in my flow graph, it must be guaranteed the OpenCV gpu choice stored in TLS was not accidentally changed by a 3rd party calling the same OpenCV apis. I used the brute-force approach of a dedicated task_arena so that I am guaranteed separate threads and therefore no TLS data can possibly be shared between my code and all 3rd parties.

@isaevil
Copy link
Contributor

isaevil commented Oct 6, 2022

@diablodale please take a look at this documentation page https://oneapi-src.github.io/oneTBB/main/tbb_userguide/attach_flow_graph_to_arena.html. This page was added as the result of #785.

We also have created #932 to track request for usability improvement of flow_graph to task_arena attachment.
I am closing that issue. Feel free to reopen if any questions are left.

@isaevil isaevil closed this as completed Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants