Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient queuing of tasks that start waiting? #165

Closed
benvanik opened this issue Jul 7, 2020 · 12 comments
Closed

Efficient queuing of tasks that start waiting? #165

benvanik opened this issue Jul 7, 2020 · 12 comments

Comments

@benvanik
Copy link

benvanik commented Jul 7, 2020

In my usage pattern I'm creating a possibly large number of tasks that may have a decent % initially waiting. For example, think of the pathological Ticket::Queue use case where the first (or one of the first) operations would be a wait:

void runTasksConcurrentThenSerially(int numConcurrentTasks) {
    marl::Ticket::Queue queue;
    auto tickets = queue.take(N);
    for (int i = 0; i < N; i++) {
        marl::schedule([=] {
            ticket[N - i].wait();  // or some other ordering
            doSerialWork();
            ticket[i].done();
        });
    }
}

Right now each task will get enqueued, switched to, and then immediately switched out when ticket.wait() blocks and for certain workloads this can be the norm instead of a pathological case.

Maybe the better approach for this is to use the tasks-in-tasks technique to only create the tasks when they are runnable? If so, are there other performance gotchas from doing that? For example, in this artwork:
image
Would it be more efficient to queue up all of these using tickets (so Bn waits for A, C waits for Bn (waitgroup, etc), etc) ahead of time or better instead to have each queue up the following ones (so A only, then A runs and enqueues Bn, etc)?

I'm mostly interested in the fan-out case where A->Bn may be going 1 to 100 and Bn->A would be going 100 to 1. I wouldn't want C to wake 100 times only to wait 99 of them, for example. But I also wouldn't want to pessimize the fanout across threads at Bn by waiting until A was executing to enqueue the tasks.

Thoughts?

@benvanik
Copy link
Author

benvanik commented Jul 7, 2020

I guess the other thing would be the cost of the fibers involved for these; for example, if the tasks are enqueued and each allocated a Fiber (that holds 1MB of stack) and then all sit waiting keeping those fibers pinned it'll cause large memory growth. I'd want these tasks to not take a fiber from the pool until they were able to be executed.

@ben-clayton
Copy link
Member

ben-clayton commented Jul 7, 2020

Yes - you are right to be concerned about scheduling a whole load of tasks, each starting with an immediate wait - you're forcing a task to start with an immediate fiber yield. It's going to be slow and you're likely going to allocate a whole fiber stack per task.

If you are using marl::Tickets, you can use marl::Ticket::onCall() to do work or schedule new tasks when that ticket is called. SwiftShader uses this approach for kicking the pixel tasks.

I wouldn't want C to wake 100 times only to wait 99 of them, for example.

The internals of marl::Ticket is actually reference counted and sharable. You can pass the ticket into the lambda for each of [B1, B2 ... BN], and when the last task of B has returned, you'll implicitly drop the last reference to the ticket, and it'll automatically call the next ticket in the queue for you (no need to call Ticket::done() for ticket B) . For your diagram, you could grab 5 tickets (ticketA, ticketB, ticketC, etc):

  • Create taskA by either calling marl::schedule or ticketA.onCall().
  • At the end of taskA, it's time to schedule the B tasks. Pass ticketB to each of these by lambda capture.
  • Use ticketC.onCall() to do C's work, and so on...

Does this make sense? Does this work for you?

@ben-clayton
Copy link
Member

Also: For these sorts of chained fan-out fan-in patterns, I think it's conceptionally simpler to consider the B's and D's here as a task with sub-tasks (example):

So instead of thinking:

           [A]
     ┌────┬─┴──────┐
     ↓    ↓        ↓
   [B0] [B1] ... [Bn]
     ↓    ↓        ↓
     └────┴─┬──────┘
            ↓
           [C]
     ┌────┬─┴──────┐
     ↓    ↓        ↓
   [D0] [D1] ... [Dn]
     └────┴─┬──────┘
            ↓
           [E]

We have:

           [A]
            ↓
           [B]
            ↓
           [C]
            ↓
           [D]
            ↓
           [E]

Where [B] and [D] are implemented something like:

void taskB(int n) {
    marl::WaitGroup wg(n);
    for (int i = 0; i < n; i++) {
        marl::schedule([=] {
            defer(wg.done());
            // Do [Bi] work 
        });
    }
    wg.wait();
}

Or if there's a known finite amount of work for [B], you can use marl::parallelize():

marl::parallelize([=] { /* B0*/ }, [&] { /* B1*/ }, [&] { /* B2*/ });

@ben-clayton
Copy link
Member

ben-clayton commented Jul 7, 2020

But to reiterate - if you go the path of using marl::Ticket, you don't actually have to use the marl::WaitGroup to fan-in. Something like this should work:

void taskB(int n, marl::Ticket ticketB) {
    for (int i = 0; i < n; i++) {
        // once all these sub-tasks are done, the last reference to ticketB will
        // be dropped, automatically calling the next ticket in the queue.
        marl::schedule([=, ticketB] {
            // Do [Bi] work 
        });
    }
}

This avoids a fiber yield at wg.wait();, and removes a bit of latency as ticketC will be called immediately after the last B sub-task has finished.

@benvanik
Copy link
Author

benvanik commented Jul 7, 2020

Interesting - I hadn't played with Ticket::onCall yet. So it looks like if I scheduled work in there I'd avoid the potential fiber overallocation issue (as the tasks wouldn't exist until they were able to be started); though it would still introduce one additional fiber swap (ticket.done() -> schedule(onCall) -> schedule(new work)) - which could be optimized away if I were doing things myself as in the done/callAndUnlock handler I could just directly schedule the followup work. Digging deeper it seems like at least if the ticket.done() returns immediately the onCall may use the same fiber, and if the onCall returns immediately after scheduling more work it'll allow the same. Would still be nice to avoid the work when not required, though :)

But to reiterate - if you go the path of using marl::Ticket, you don't actually have to use the marl::WaitGroup to fan-in.

Ohh I like that - I'd been worried about WaitGroup for fan-in because of that extra wake after wait (when that was the end of the task).

These examples are very helpful in helping me find parts of the code to dig into. Thanks for taking the time to share them :)

Maybe one thing that falls out of this discussion is some notes on onCall usage in ticket.h? I had skipped over it initially thinking it was a function for internal use and missed its awesome superpowers :)

@ben-clayton
Copy link
Member

though it would still introduce one additional fiber swap (ticket.done() -> schedule(onCall) -> schedule(new work))

You only get a fiber swap whenever a task cannot progress further. So in this example, you'd only get a fiber swap if the caller of ticket.done() went on to do a blocking call. If it the caller task simply returned, the schedule(onCall) should be executed like any other scheduled task. However, there's clearly overhead with the schedule and invocation of tasks, even without fiber swapping, which could be avoided entirely.

Digging deeper it seems like at least if the ticket.done() returns immediately the onCall may use the same fiber, and if the onCall returns immediately after scheduling more work it'll allow the same.

Correct. It may - it may end up on any worker thread, and it may be stolen by another idle worker.

Maybe one thing that falls out of this discussion is some notes on onCall usage in ticket.h? I had skipped over it initially thinking it was a function for internal use and missed its awesome superpowers :)

Good idea. I'll add some more documentation.

@benvanik
Copy link
Author

benvanik commented Jul 7, 2020

Noodling more on what spawned my original question (trying to map an existing task graph into marl); this may be situation where I'm trying to hold it wrong. I basically have a semi-declarative graph I want to traverse and enqueue work for marl to chew on, and I want to be able to efficiently transfer the data flow modeled in that graph to marl with the minimum overhead and maximum freedom for marl to schedule things aggressively. If it helps ground things I'm basically implementing Vulkan's execution model :)

In one model (I think what I was trying at first) I'd walk the entire graph and create all the tasks then hand them to marl and have it run those tasks as they were able. For that it'd need to be possible to create tasks that were initially suspended (like creating threads initially suspended in pthreads/win32) for the reasons we talked about above. The nice thing is that the code to go from the representation I have to tasks is straightforward and well-scoped (walk graph, enqueue tasks, return). The bad thing is that (as things are today) it may oversubscribe resources like fibers unless this was handled by the scheduler.

An alternative model (the one using onCall) would prevent the oversubscription but would fragment the code that processes the graph. Each onCall would have to keep track of where it was in the graph and what nodes should be traversed next, and each onCall would be performing that work (less ideal for cache locality, the original representation must stay live or liveish for the duration of all work, etc).

I think both can implement the same algorithm but the former feels more natural if starting with a representation that already has some tasks and dependencies defined. Maybe there's a data structure similar to Ticket/Queue that would allow this to be written like the former but implemented effectively as the latter. Fun stuff to think about :)

@ben-clayton
Copy link
Member

So, based on what you've described - does this sort of fit the bill? #167

@benvanik
Copy link
Author

benvanik commented Jul 8, 2020

Wow, that may be precisely what's needed. That allows nice localized DAG construction (can be done from arbitrary threads/fibers - possibly with multiple DAGs being produced in parallel) and execution (run a sequence of DAGs when ready from a serialized ticket queue, for example), which makes modeling things like Vulkan/Metal queues and CUDA streams really natural. Since it's just built from marl primitives it allows a nice mix of DAG and non-DAG work (such as parallel fan-out from within DAG work) and DAGs to be chained (to mimic the half-static/half-dynamic behavior of queuing systems that may be producing DAG fragments instead of a global DAG of all work). Running each DAG as a fiber helps limit the explosion of pinned fibers to tasks. It'll also make attributing time (via tracing/profiling) to a DAG much easier as there's a place to put begin/end markers.

Very cool - and a good demonstration of how to layer things together with marl primitives/fibers. If anything I think for people coming into marl it's a useful higher-level data structure that shows how the system can be used. I love it :)

For my particular uses (emulating command queues/command buffers) this fits nicely. I can even construct the DAG while recording the command buffer (instead of needing to record the command buffer and then run over it all again to produce the schedule), which is a nice elision of work. Submitting/queue management becomes just ordering the dag->run() calls, and if I wanted to be able to reuse command buffers I could just call dag->run() multiple times.

@ben-clayton
Copy link
Member

Great to hear. I'll get the PR ready for review. Please feel free to nit-pick if there's stuff you'd like changed.

@ben-clayton
Copy link
Member

#167 has now landed. Is there anything else you need here Ben?

@ben-clayton
Copy link
Member

Assuming not. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants