Skip to content
mr-j0nes edited this page Oct 9, 2017 · 8 revisions

How to link compute kernels

So there's lots of ways to link compute kernels. So we'll start with the simple in-line connection. First we'll need a map object:

raft::map m;

We'll also need some compute kernels:

kernel source;
kernel destination;

Before we go over linking, it must be said that it is assumed that each link has the same time (within the current implementation). In past versions (and will be again in the future) type conversion can take places (just need to make it a bit more efficient). Getting on to the link connection, here's a few variations on a simple sequential link:

/**
 * Assumes source only has a single output
 * port and destination has only a single
 * input port, otherwise an exception will
 * be thrown. 
 */
m += source >> destination;

/**
 * Assumes that source has an output port 
 * named "output" and that destination has
 * a single input port
 */
m += source[ "output" ] >> destination;

/** 
 * Assumes that the source has an output
 * port named "output" and that destination
 * has an input port named "input"
 */
m += source >> destination[ "input" ];

/**
 * Of course, both kernels can also have 
 * named ports
 */
m += source[ "output" ] >> destination[ "input" ];

You can also chain these further as in, just remember that the return type of the add assign operator is a struct with references to the source and last destination of this chain. Not always useful, but sometimes this return struct is invaluable.

kernel middle;
m += source >> middle >> destination;

So that source connects to middle which connects to the destination. There's also more complicated ways to connect compute kernels that enable explicit task parallelization (each kernel that I've shown so far is linked and executed in a pipe-lined parallel fashion). There is a difference between explicit parallelization and auto-parallelization which I'll note as they come up.

Explicit parallelization means linking a number of kernels together as a single "pipeline" stage within a streaming graph. As an example (pulled from the rbzip2 application in examples/general/rbzip2):

    /** declare kernels **/
    fr_t reader( inputfile,
                 num_threads /** manually set threads for b-marking **/ );
    fw_t writer( outputfile, 
                 num_threads /** manually set threads for b-marking **/ );
    comp c( blocksize, 
            verbosity, 
            workfactor );
    
    /** set-up map **/
    raft::map m;
    
    /** 
     * detect # output ports from reader,
     * duplicate c that #, assign each 
     * output port to the input ports in 
     * writer
     */
    m += reader <= c >= writer;

Lets walk through this example. We start off by instantiating our kernels, and instantiating our map. The reader and writer objects both have multiple output ports (they also instantiate the raft::parallel_k class that extends raft::kernel). The parallel_k class tells the run-time that ports can be added safely to any kernels that extend it. It also means that there is a manual kernel count function built in that can be used as well. This is what we've used in this instance since I wanted to benchmark specific numbers of threads. Moving to the actual linking operation, you should notice that there's a reader, c, and writer. There's only a single user-instantiated c kernel. The "<=" operator tells the runtime that we intend on splitting the streams as wide as we can, and that the user has provided either enough ports (we'll get to this in a second), or a kernel that can safely be duplicated. In this case, we've provided c which can be cloned() by the run-time at will. A join is specified by the ">=" operator which tells the run-time that we intend to take all the kernels we've duplicated and send them to a single kernel (with multiple input ports). The topology that this produces is a simple split/join topology where the run-time has duplicated the c kernel enough times to match the port count of the reader and writer kernels

This construct can be taken a step further to chain duplication between the "<=" and ">=" operators such as:

kernel reader, writer, a,b,c;
 m += reader <= ( a >> b >> c ) >= writer;

I've added parenthesis around these for clarity, however you don't have to. The a->b->c chain can be duplicated at will by the run-time, and linked to match the number of input ports and output ports available on the reader and writer ports. Of course, will all these methods, an exception will be thrown if the number of ports on (using the example as a reference) the reader and writer don't match or if there are any types that are incompatible.

You can also do orderings like this:

/** fake names used for brevity **/
randgen gaussian;
kernel  stoch_function /** 8 input ports **/;
kernel  print;
//make 8 of gaussian, link then to stoc_function, stream to print
raft::map += gaussian >= stoch_function >> print;

You can even go the other direction with 1-N. I've not yet implemented nested split/joins (still might,not). I'm debating the ability for people to read such constructs....might make debugging a bit more difficult. Then again, could make some codes far shorter.

NOTE: the raft::order::out currently doesn't work for the stream operator syntax, but will shortly. In the interim the old long form syntax does work. I'll update the docs when this is fixed.

This process is also rather manual (well, at least compared to the run-time parallelization that is available). There is a raft::order::out ordering keyword which tells the run-time that the order doesn't matter for this particular link, here's a more concrete example:

kernel a,b;
raft::map m;
m += a >> raft::order::out >> b;

In this case, the run-time assumes it can't safely do anything yet. I'm working on a nice way to check for open file handles per thread, but currently it seems this wouldn't be safe unless the programmer were very careful, so we need some more user input, or more out of order designators.

kernel a,b,c;
raft::map m;
m += a >> raft::order::out >> b;
m += b >> raft::order::out >> c;

which means that the run-time can duplicate b as much as it wants. If kernels a & c don't implement the parallel_k sub-kernel (meaning that they can handle dynamically adding more ports), the run-time will insert a split kernel after a and before c so that b is now in between two kernels that can handle dynamic port addition and subtraction.