+/ adverb performance #86

Closed
bakul opened this Issue Apr 14, 2011 · 11 comments

Projects

None yet

3 participants

@bakul
Contributor
bakul commented Apr 14, 2011

\t +/!10000000

is about 200+ times slower than k3.

@kevinlawler
Owner

One optimization would be to detect if the item behind the adverb (p-1) is a verb, and then apply the verb directly to the elements without recursive calls to dv_ex, perhaps using temporary K object wrappers local to the function.

It makes me wonder if type checking shouldn't have been abstracted from the verb implementations.

Another possibility would be to special case some of the verbs. For plus over there is a clear detection method and optimal reimplementation. Identify the pairing, then run a tight C loop. I would be interested to see a performance comparison with the first method, since I am not yet convinced this is the right way to do things.

The special case implementation is easily vectorized or multithreaded. I believe that may also be the case for the first more general solution, at least for the subset of verbs that cannot modify the K-Tree, and hence are thread safe.

Profiler output would be helpful.

It may take several attempts to get this right.

@bakul
Contributor
bakul commented Apr 15, 2011

I think you want to preprocess everything so that by the time you do the tight C loop, no checking or verb selection is left. Special casing is fine IMHO. Must make use of whatever information you have available when you parse/analyze the code. You have to arrange things so that that following will execute!

I *ip, *ep, i;
for (r = 0, ip = kI(x), ep = ip + xn; ip < ep; ip++) r += *ip;

Next best might be something like

f = op->code; r = op->zero;
for (ip = KI(x); ep = ip + xn; ip < ep; ip++) r = f(r, *ip);

if you can't decide what operator is being executed at compile time. [I don't know your code so this is just to sketch the idea]

Basically you want to map all iterative ops to the tightest possible loops. Sometime it may be possible to figure out the right thing to do on the first call (after which you switch to a TCL (tight C loop)!). "Just in time" optimization?:-)

It might be worth building a prototype with 3 verbs, 3 datatypes and 2 adverbs to experiment with all this. If you get scan done, each, eachprior, over are trivial.

@silentbicycle
Collaborator

Yes, some special-casing is necessary. Also, many dyads run under / are trivially data-parallel.

Making a prototype with just a few verbs, datatypes, and adverbs is exactly what I plan on doing when I have time - I have a couple ideas for cross-cutting optimizations that would be too much work to try w/ the existing codebase.

When I profile that specific example, it seems to spend 90+% of the time in system calls related to mmap. Obviously, not unnecessarily allocating the gigantic temporary vector would bring a great speed improvement.

@kevinlawler
Owner

Good discussion going. This and function reuse are the two most important unimplemented optimizations at this stage.

I tried summing a smaller vector. On my system +/!1e6 executes in 600ms. The K4 demo sums it in 6ms. So there is a lot of dead wood in the way.

This test suggests the time is dominated by function overhead in the adverb call:

  \t a:!1000000
10
  \t +/a
591

Again, it may take a few attempts to get this right. I expect the process will consist of generating a few solutions, thinking about them for a while, and then committing what eventually comes from that.

@silentbicycle
Collaborator

Indeed. I'm working on separating k.c now (and untangling some inter-dependencies), which will make reworking parts of the core system a bit easier down the line.

@bakul
Contributor
bakul commented Apr 15, 2011

To see what you are up against: On my machine k3 takes about 3.2 nanoseconds per iteration for +/v, where v:!1000000. C code to do the same takes about 6.2ns without optimization and 3.2ns with! k3 takes about 100 times longer when {x+y} is used in placed of +.

v:!1000000
\t do[1000;+/v]
\t do[10;{x+y}/v]
@silentbicycle
Collaborator

I have no illusions about it being easy. :)

@kevinlawler
Owner

I tried a few things. These timings are for summing a million 64-bit integers.

The current timing is about

0.600s

Changing the existing code to apply verbs (p-1) directly won't get the time down past

0.100s

+/a in q takes

0.010s

A tight loop in C sums a million integers in

0.002s
@kevinlawler
Owner

We've been talking about building a dispatch table for verbs. The dispatch table would help us select the relevant part of the verb and apply only that part.

@bakul
Contributor
bakul commented Apr 20, 2011

I did a quick experiment using a similar idea and was able to get close to C speeds (see below). A switch statement may be quicker (avoids table lookup). On the other hand a dispatch table can default to the general case and special cases can be plugged in easily. of course, this is just some toy code! But I think a simple analyses phase will allow partial evaluation, compile time selection of function to call, etc.

V over(V f, V x) // V is a ptr to a K like structure.
{
    switch (x->t) {
    case -ti: /* int vector */
        switch (opcode(f)) {
        case o_add: { I ir = 0; DO(x->n, ir += vI(x)[i]); R gi(ir); }
        case o_sub: { I ir = 0; DO(x->n, ir -= vI(x)[i]); R gi(ir); }
        case o_mul: { I ir = 1; DO(x->n, ir *= vI(x)[i]); R gi(ir); }
        ...
        }
    }
}
@silentbicycle
Collaborator

Right, the main point is to move cases for over and scan to their own loops, rather than +/ doing several calls for each value pair, then join them after the fact. Whether it's done via a switch statement, linear scan over an array literal, or something else entirely is an implementation detail. (Adding data parallelism later would also be good.)

I will build up a tiny interpreter (probably just +, *, and a few adverbs) to try various approaches. Once that's set, I will add functions and proper closures, to work on making function calls much cheaper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment