Skip to content
Florian Biermann edited this page Nov 2, 2016 · 6 revisions

Quad Rope Development Journal

This is a central place to keep track of things that I come across during implementing quad ropes.

2016-11-02

Sparseness and flattening seem to go hand in hand. Sparse quad ropes must not allocate an array for the entire size, so we need one variant of each function that preserves structure if the quad rope contains sparse elements as to maintain memory benefits; and one variant that does not preserve structure (or allocates a new underlying array) to maintain fast performance.

So it seems not possible to implement a sparse quad rope variant and a flattening variant completely independently. However, it would be possible to first implement a quad rope variant with two types of each function, such as to allocate a new underlying array for the entire size or not.

2016-10-21

Flattening quad ropes seems to have surprisingly little benefit. Maybe the implementation is too high-level. However, it seems that there is no immediate performance improvement from flattening quad ropes into single arrays during e.g. map. Index-based tiling seems ineffective.

Possible reasons may be that the tiling is currently implemented by slicing, which allocates new memory on the heap on each recursive step. Solutions that do not allocate new memory however don't perform well either, but in the current implementation, the loop is not very tight and involves some arithmetic.

Other functions on ArraySlice, however, also involve additional offset arithmetic, so I don't feel that this might be the root cause. Function ArraySlice.fastGet is inlined and performs offset arithmetic on integers, which should be fast enough.

2016-08-23

Update: After increasing s_max to 32, quad ropes perform convincingly:

C:\Users\fbie\src\rad-trees>benchmark -t 16
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.6.2 or later
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 63 Stepping 2, GenuineIntel; 32 "cores"
# Date        2016-08-23T08:00:24
# Threads     16
QuadRope.init                   4450000,0 ns   39528,47         64
QuadRope.map                    4578125,0 ns   67507,72         64
QuadRope.reduce                 2258593,8 ns   99262,83        128
QuadRope.zip                    5629687,5 ns   99382,32         64
QuadRope.hfold                  3516406,3 ns   46504,61        128
QuadRope.vfold                  3571875,0 ns   45673,25        128
QuadRope.hreduce                3134375,0 ns   71583,80        128
QuadRope.vreduce                3153906,3 ns   71227,65        128
Array2D.init                    3205468,8 ns   26054,68        128
Array2D.map                     4062500,0 ns   41666,67         64
Array2D.reduce                  3885937,5 ns   19557,27         64
Array2D.zip                     4901562,5 ns   89319,50         64
Array2D.hfold                   4076562,5 ns  399479,51         64
Array2D.vfold                   4639062,5 ns  667416,98         64
Array2D.hreduce                 1975000,0 ns   10925,09        128
Array2D.vreduce                 2143750,0 ns   11170,63        128

C:\Users\fbie\src\rad-trees>benchmark -t 16 -m mmult -s 20
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.6.2 or later
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 63 Stepping 2, GenuineIntel; 32 "cores"
# Date        2016-08-23T08:02:20
# Threads     16
mmult Array2D                    912695,3 ns    2069,04        512
mmult Array2D.Parallel         39325000,0 ns 6040718,32          8
mmult QuadRope                   766406,3 ns    1647,02        512
mmult QuadRope.Parallel          208007,8 ns    2604,17       2048

Performance on the p3 server with 16 cores is not convincing. Seems like instantiating objects is much more expensive, because parallel reduce on quad ropes is even slightly faster than on arrays and there, we do not re-build the quad rope tree structure.

C:\Users\fbie\src\rad-trees>benchmark -t 1
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.6.2 or later
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 63 Stepping 2, GenuineIntel; 32 "cores"
# Date        2016-08-23T06:33:36
# Threads     1
QuadRope.init                  22112500,0 ns   64549,72         16
QuadRope.map                   40987500,0 ns  231615,70          8
QuadRope.reduce                35550000,0 ns   64549,72          8
QuadRope.zip                   57262500,0 ns  216105,04          8
QuadRope.hfold                 42200000,0 ns   64549,72          8
QuadRope.vfold                 41362500,0 ns   92233,10          8
QuadRope.hreduce               55675000,0 ns   64549,72          8
QuadRope.vreduce               57350000,0 ns  184466,20          8
Array2D.init                   21718750,0 ns  370634,31         16
Array2D.map                    38387500,0 ns   92233,10          8
Array2D.reduce                 33050000,0 ns   64549,72          8
Array2D.zip                    53725000,0 ns   79056,94          8
Array2D.hfold                  33250000,0 ns       0,00          8
Array2D.vfold                  34487500,0 ns  319776,40          8
Array2D.hreduce                33312500,0 ns  135015,43          8
Array2D.vreduce                36575000,0 ns   64549,72          8

C:\Users\fbie\src\rad-trees>benchmark -t 16
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.6.2 or later
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 63 Stepping 2, GenuineIntel; 32 "cores"
# Date        2016-08-23T06:35:18
# Threads     16
QuadRope.init                   7393750,0 ns   69409,71         64
QuadRope.map                    7796875,0 ns  433576,15         32
QuadRope.reduce                 3323437,5 ns   36680,87        128
QuadRope.zip                    8209375,0 ns  160057,77         32
QuadRope.hfold                  4376562,5 ns   68880,09         64
QuadRope.vfold                  4439062,5 ns   70821,84         64
QuadRope.hreduce                7226562,5 ns   92950,88         64
QuadRope.vreduce                7140625,0 ns  109746,39         64
Array2D.init                    3428125,0 ns   46116,55        128
Array2D.map                     4240625,0 ns   56192,10         64
Array2D.reduce                  3836718,8 ns   80303,82        128
Array2D.zip                     4953125,0 ns   27559,91         64
Array2D.hfold                   5259375,0 ns  360783,77         64
Array2D.vfold                   4757812,5 ns  293266,63         64
Array2D.hreduce                 1925390,6 ns    8528,40        256
Array2D.vreduce                 2110156,3 ns   19295,45        128

2016-08-08

Improving balancing performance?

The performance of computing a van Der Corput sequence is much slower than on arrays, for unknown reasons. One reason might be balancing, so this here is a shot at improving the balancing algorithm:

/// Insert a quad rope in a list of quad ropes. Adjacent quad ropes
/// are merged using <code>merge</code> if their depths are equal. If
/// a merge is triggered, the function recurses on the list until
/// either the merge condition is not met anymore or there are no quad
/// ropes left to merge with.
let rec insert merge x xs =
    match xs with
        | [] -> [x]
        | y :: xs when depth x = depth y -> insert merge (merge y x) xs
        | _ -> x :: xs

/// Merge all quad ropes in the list using <code>merge</code>,
/// assuming that logical order is reversed, i.e. the west-most node
/// is at head.
let mergeAll merge xs =
    match xs with
        | [] -> invalidArg "xs" "List of nodes cannot be empty."
        | [x] -> x
        | _ -> List.reduce (fun r l -> merge l r) xs

/// Balance rope horizontally.
let hbalance rope =
    let rec hbalance rope =
        if isBalancedH rope then
            rope
        else
            mergeAll flatNode (collect rope [])
    and collect rope rs =
        match rope with
            | _ when isBalancedH rope -> insert flatNode rope rs
            | Empty -> rs
            | Node (_, _, _, ne, nw, Empty, Empty) -> collect ne (collect nw rs)
            | Node (_, _, _, ne, nw, sw, se) ->
                insert (node (hbalance ne) (hbalance nw) (hbalance sw) (hbalance se)) rs
            | _ -> insert flatNode rope rs
    hbalance rope

/// Balance rope vertically.
let vbalance rope =
    let rec vbalance rope =
        if isBalancedV rope then
            rope
        else
            mergeAll thinNode (collect rope [])
    and collect rope rs  =
        match rope with
            | _ when isBalancedV rope -> insert thinNode rope rs
            | Empty -> rs
            | Node (_, _, _, Empty, nw, sw, Empty) -> collect sw (collect nw rs)
            | Node (_, _, _, ne, nw, sw, se) ->
                insert (node (vbalance ne) (vbalance nw) (vbalance sw) (vbalance se)) rs
            | _ -> insert thinNode rope rs
    vbalance rope

This is an algorithm that is much closer to Boehm et al.'s variant. This means that the resulting ropes are less compact. Even less desirable, the performance is slightly worse. We therefore return to the already implemented balancing variant and conclude that its performance is sufficient.

>benchmark -m vdc -s 15
# ...
vdc Array2D                     1644140,6 ns   26329,79        256
vdc Array2D.Parallel            2775000,0 ns   58787,30        128
vdc QuadRope                    2033593,8 ns   39330,68        128
vdc QuadRope.Parallel           2659375,0 ns   83268,20        128

Hence, the root cause for the slow performance of the vdc benchmark must be located elsewhere.

2016-08-03

Array slices to the rescue

It is straight-forward to avoid the copying overhead of array slicing notation by (re-) introducing views on arrays. We call those 'a ArraySlice. Reallocation happens only if we absolutely must create a new array, because of for instance a call to map.

Most operations are still slower than on arrays. However, from the four simple algorithms we use as examples, i.e. vanDerCorput-sequences, Fibonacci-sequences, matrix multiplication and prime factorization, quad ropes now outperform arrays in three of these, if also only by a small margin for matrix multiplication.

Especially, the task-based approach of quad ropes is able to deal much better with nested parallelism than default arrays.

Current state of performance

>benchmark
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-08-03T09:25:52
QuadRope.init                  26400000,0 ns  356243,91         16
QuadRope.map                   47187500,0 ns  575090,57          8
QuadRope.reduce                41000000,0 ns  527046,28          8
QuadRope.zip                   66675000,0 ns 1598827,70          4
QuadRope.hfold                 47162500,0 ns  565347,73          8
QuadRope.vfold                 49800000,0 ns  687689,37          8
QuadRope.hreduce               61537500,0 ns 1078595,41          8
QuadRope.vreduce               64875000,0 ns 1029090,75          4
Array2D.init                   25462500,0 ns  218898,76         16
Array2D.map                    41862500,0 ns  418537,47          8
Array2D.reduce                 38762500,0 ns  748725,77          8
Array2D.zip                    61462500,0 ns  944740,56          8
Array2D.hfold                  38687500,0 ns 1227930,17          8
Array2D.vfold                  48937500,0 ns 1532347,96          8
Array2D.hreduce                39237500,0 ns  365386,02          8
Array2D.vreduce                49762500,0 ns 1168822,41          8

>benchmark -t 4
QuadRope.init                  14912500,0 ns  795876,96         32
QuadRope.map                   18950000,0 ns  925994,21         16
QuadRope.reduce                13771875,0 ns  907033,90         32
QuadRope.zip                   25037500,0 ns 1362697,49         16
QuadRope.hfold                 20081250,0 ns 1769742,31         16
QuadRope.vfold                 18631250,0 ns 1586468,93         16
QuadRope.hreduce               18168750,0 ns 1508094,48         16
QuadRope.vreduce               17637500,0 ns 1415685,94         16
Array2D.init                   12218750,0 ns  323420,31         32
Array2D.map                    16375000,0 ns  485912,66         16
Array2D.reduce                 10546875,0 ns  166177,67         32
Array2D.zip                    21325000,0 ns  263193,53         16
Array2D.hfold                  11625000,0 ns  117851,13         32
Array2D.vfold                  14056250,0 ns  232364,06         32
Array2D.hreduce                10487500,0 ns   95697,37         32
Array2D.vreduce                13262500,0 ns  233946,37         32

>benchmark -m mmult -s 20
mmult Array2D                   1048046,9 ns   23441,12        256
mmult Array2D.Parallel        111525000,0 ns  861603,80          4
mmult QuadRope                  1291015,6 ns   17876,99        256
mmult QuadRope.Parallel          875976,6 ns   13818,35        512

>benchmark -m primes -s 100
primes Array2D.Parallel       332950000,0 ns 19254797,38         2
primes QuadRope.Parallel       64350000,0 ns  2285825,89         4

>benchmark -m fibseq -s 1000
fibseq Array2D                  5031250,0 ns   79330,97         64
fibseq QuadRope                 1242968,8 ns   16654,46        256

>benchmark -m vdc -s 15
vdc Array2D                     1612500,0 ns   15494,24        256
vdc Array2D.Parallel            2855468,8 ns  127164,95        128
vdc QuadRope                    1997656,3 ns   35905,41        128
vdc QuadRope.Parallel           2632812,5 ns   77602,42        128

2016-08-01

Speeding up init

Allocating many small arrays is much slower than allocating a single large array. Chopping this array into smaller pieces, however, seems to be okay fast. I guess the compiler is performing some optimizations in the background, which I will need to investigate further to be able to explain this speedup.

Anyways, the improvement is clearly visible:

small arrays seq   63200000,0 ns 1116791,04          4
large array  seq   38337500,0 ns 1044246,80          8

small arrays par   40587500,0 ns 2575963,95          8
large array  par   29287500,0 ns 4054875,32         16

The relative speedup from parallelizing the operations however is much smaller.

Speeding up reallocate

As a consequence, we can drastically improve the performance of reallocate. We can implement init in terms of fromArray2D and then we can implement reallocate as toArray2D >> fromArray2D, which yields a more than four-fold improvement.

Old:

QuadRope.reallocate           263800000,0 ns 4151305,02          2

New:

QuadRope.reallocate            62100000,0 ns 2125245,08          4

2016-07-29

(Note: shame on me for not keeping this updated.)

Current Performance Results

E:\rad-trees>benchmark
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T10:04:51
QuadRope.init                  61237500,0 ns  693346,35          8
QuadRope.map                   44725000,0 ns  901002,53          8
QuadRope.reduce                40725000,0 ns  299304,75          8
QuadRope.zip                   65200000,0 ns 1046156,99          4
QuadRope.hfold                 50687500,0 ns  497389,02          8
QuadRope.vfold                 48237500,0 ns  602223,89          8
QuadRope.hreduce               63425000,0 ns  667187,30          4
QuadRope.vreduce               64800000,0 ns 1005540,21          4
Array2D.init                   23868750,0 ns  295289,99         16
Array2D.map                    40362500,0 ns  557306,27          8
Array2D.reduce                 39375000,0 ns  403973,32          8
Array2D.zip                    62462500,0 ns  759317,13          8
Array2D.hfold                  38750000,0 ns  549621,08          8
Array2D.vfold                  50587500,0 ns  704770,45          8
Array2D.hreduce                39362500,0 ns  480342,76          8
Array2D.vreduce                49037500,0 ns  870125,31          8

E:\rad-trees>benchmark -t 4
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T10:06:43
# Threads     4
QuadRope.init                  39712500,0 ns 3765269,84          8
QuadRope.map                   20193750,0 ns 1672762,29         16
QuadRope.reduce                12987500,0 ns  689737,52         32
QuadRope.zip                   27518750,0 ns 1924136,26         16
QuadRope.hfold                 19450000,0 ns 1595838,77         16
QuadRope.vfold                 19587500,0 ns 1941237,43         16
QuadRope.hreduce               17243750,0 ns 1484424,34         16
QuadRope.vreduce               17493750,0 ns 1381125,87         16
Array2D.init                   17356250,0 ns 1173732,32         16
Array2D.map                    20156250,0 ns  402088,73         16
Array2D.reduce                 10746875,0 ns   51979,06         32
Array2D.zip                    27625000,0 ns 1497973,17         16
Array2D.hfold                  12075000,0 ns  262367,69         32
Array2D.vfold                  14993750,0 ns  902686,96         32
Array2D.hreduce                11659375,0 ns  748472,11         32
Array2D.vreduce                13896875,0 ns  310427,15         32

Looking at the benchmark results, quad ropes are now quite competitive compared to 2D arrays. However, init is a weak spot and the performance is roughly three times slower than on 2D arrays. It seems that allocating many small arrays, in addition to the tree structure, is much more costly than allocating a single large array. If we remove the actual allocation from the leaves and only execute the initialization function appropriately many times, we can improve performance by about a third.

What is interesting about this is that map, which in principle performs the same operation, performs just as fast as on 2d arrays, so there might be something going on under the hood which I am not aware of.

Unfortunately, in most "real-life" benchmarks, quad ropes still can't keep up:

E:\rad-trees>benchmark -m vdc -s 20
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T10:10:10
vdc Array2D                    64075000,0 ns 1986237,37          4
vdc Array2D.Parallel         1916850000,0 ns 31358190,49          2
vdc QuadRope                  156550000,0 ns 13573769,64          2
vdc QuadRope.Parallel         183350000,0 ns 13725584,38          2

E:\rad-trees>benchmark -m mmult -s 100
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T10:12:42
mmult Array2D                  95125000,0 ns 1008643,20          4
mmult QuadRope                295350000,0 ns 5467327,20          2
mmult QuadRope.Parallel       139000000,0 ns 11628509,03          2

E:\rad-trees>benchmark -m fibseq -s 1000
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T10:18:02
fibseq Array2D                  5154687,5 ns   54848,47         64
fibseq QuadRope                 1440625,0 ns   11617,04        256

E:\rad-trees>benchmark -m primes -s 100
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T10:23:48
primes Array2D.Parallel       300450000,0 ns 23147414,06          2
primes QuadRope.Parallel       66300000,0 ns 2213594,36          4

Laziness Experiments

Since init seems to be so slow, it may be an interesting idea to defer leaf initialization to later and to apply fusion to the quad rope leaves, such that we save as many intermediate materializations as possible. Such an experiment is implemented on fbie/lazy-leaves.

Every leaf is a record of two starting indices (for efficient slicing), two size values, an initialization function and a lazy array, which will be initialized using the initialization function if forced:

type 'a LazyLeaf =
    { i : int;
	  j : int;
	  h : int;
	  w : int;
	  f : int -> int -> 'a;
	  arr : 'a [,] Lazy }

The idea is that function calls on quad rope produce a new quad rope where the leaves are lazily initialized. This gives us immediately some parallelism for evaluating all leaves and also avoids that we have to materialize all leaves if we only read from part of the quad rope.

The overhead from this seems to be rather significant however. These benchmarks run on a quad rope with already materialized arrays:

E:\rad-trees>benchmark
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T09:49:37
QuadRope.init                  64200000,0 ns 1432751,82          4
QuadRope.map                   62900000,0 ns  561496,02          8
QuadRope.reduce                41537500,0 ns  404188,14          8
QuadRope.zip                  105100000,0 ns  603232,04          4
QuadRope.hfold                 73075000,0 ns 1236313,97          4
QuadRope.vfold                 74150000,0 ns 2986543,90          4
QuadRope.hreduce               86175000,0 ns 1080444,87          4
QuadRope.vreduce               85325000,0 ns 1086853,26          4
Array2D.init                   25612500,0 ns  261539,25         16
Array2D.map                    42437500,0 ns  598754,49          8
Array2D.reduce                 39875000,0 ns  475073,09          8
Array2D.zip                    63375000,0 ns  944648,67          4
Array2D.hfold                  37987500,0 ns  578701,85          8
Array2D.vfold                  48600000,0 ns  772352,11          8
Array2D.hreduce                39600000,0 ns  411636,30          8
Array2D.vreduce                48937500,0 ns  453573,77          8

E:\rad-trees>benchmark -t 4
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T09:51:12
# Threads     4
QuadRope.init                  26087500,0 ns 1210558,30         16
QuadRope.map                   28243750,0 ns 2604325,00         16
QuadRope.reduce                13278125,0 ns  885655,61         32
QuadRope.zip                   40850000,0 ns 2799925,59          8
QuadRope.hfold                 28418750,0 ns 1999576,78         16
QuadRope.vfold                 27668750,0 ns 1166685,27         16
QuadRope.hreduce               27325000,0 ns 1415992,49         16
QuadRope.vreduce               26468750,0 ns 1734116,91         16
Array2D.init                   16059375,0 ns  637964,50         32
Array2D.map                    20387500,0 ns  798109,75         16
Array2D.reduce                 11171875,0 ns  710054,59         32
Array2D.zip                    25412500,0 ns  537806,76         16
Array2D.hfold                  12100000,0 ns  790130,09         32
Array2D.vfold                  15415625,0 ns  500444,68         32
Array2D.hreduce                10756250,0 ns   50603,99         32
Array2D.vreduce                14109375,0 ns  551640,94         32

Also, more complex benchmarks, which gave spark to this idea, do not perform faster:

E:\rad-trees>benchmark -m vdc -s 20
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T09:32:05
vdc Array2D                    63125000,0 ns 2144922,95          4
vdc Array2D.Parallel         1919200000,0 ns 40053575,23          2
vdc QuadRope                  748350000,0 ns 11671546,60          2
vdc QuadRope.Parallel         534650000,0 ns 17496110,68          2

E:\rad-trees>benchmark -m mmult -s 100
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T09:15:35
mmult Array2D                  98650000,0 ns 1297433,36          4
mmult QuadRope                336200000,0 ns 5638163,61          2
mmult QuadRope.Parallel       143650000,0 ns 12563284,25          2

E:\rad-trees>benchmark -m fibseq -s 1000
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T09:16:15
fibseq Array2D                  5275000,0 ns  195183,99         64
fibseq QuadRope                 1659375,0 ns   17056,79        256

E:\rad-trees>benchmark -m primes -s 100
# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 4 "cores"
# Date        2016-07-29T10:36:01
primes Array2D.Parallel       305250000,0 ns 25173012,44          2
primes QuadRope.Parallel      289950000,0 ns 31296476,55          2

Therefore, I won't continue developing lazy leaves.

Nevertheless, we need a faster initialization function. Parallel initialization also does not quite cut it, parallel array initialization is still faster.

2016-04-14

Scan

Scan was broken because we used inclusive scan for substructures which lead to value duplication. We can implement inclusive scan by simply concatenating the result of the exclusive scan to the initial state but that is slightly inefficient.

More surprisingly, vscan seems to be quite a bit faster on quad ropes than on 2D arrays:

# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 2 "cores"
# Date        2016-04-14T15:24:17
...
QuadRope.hscan                124700000,0 ns 2931628,29          4
Array2D.hscan                  97975000,0 ns 4301889,12          4
QuadRope.vscan                102325000,0 ns 1258581,65          4
Array2D.vscan                 151950000,0 ns 4591356,61          2
...

2016-04-13

Inlining

Based on some benchmarking, it seems that we should go forward with inlining all utility functions. Inlining functions on quad rope has a slightly negative effect and also forces us to make some functions public which should better remain private.

Some inlining:

# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 2 "cores"
# Date        2016-04-13T11:37:14
QuadRope.init                  61475000,0 ns  758745,31          8
Array2D.init                   24381250,0 ns  280206,40         16
QuadRope.map                   46237500,0 ns  505009,63          8
Array2D.map                    43312500,0 ns  970484,56          8
QuadRope.hfold                147400000,0 ns 3016620,63          2
Array2D.hfold                 136600000,0 ns 3373096,17          2
QuadRope.vfold                141200000,0 ns 1766981,10          2
Array2D.vfold                 126400000,0 ns 1663329,99          2
QuadRope.hreduce              161650000,0 ns 3223610,81          2
Array2D.hreduce               139600000,0 ns 2401388,49          2
QuadRope.vreduce              151800000,0 ns 3544949,46          2
Array2D.vreduce               155900000,0 ns 1197219,00          2
QuadRope.hcat                       131,2 ns       2,01    2097152
QuadRope.hcat + hbalance         496093,8 ns    7322,41       1024
QuadRope.hcat + reallocate     955150000,0 ns 10729114,08          2
Array2D.hcat                   20362500,0 ns 1483649,31         16
QuadRope.vcat                       128,7 ns       1,56    2097152
QuadRope.vcat + vbalance         452441,4 ns    6890,75       1024
QuadRope.vcat + reallocate     964350000,0 ns 13922503,77          2
Array2D.vcat                   19362500,0 ns  619279,38         16
QuadRope.index                      452,7 ns      10,21    1048576
Array2D.index                         6,1 ns       0,05   67108864
QuadRope.set                       4306,0 ns     102,43      65536
Array2D.set                     9293750,0 ns  223023,73         32

All utility functions inlined:

# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 2 "cores"
# Date        2016-04-13T11:42:11
QuadRope.init                  58125000,0 ns  677003,20          8
Array2D.init                   22337500,0 ns  222829,03         16
QuadRope.map                   43712500,0 ns 1012165,58          8
Array2D.map                    41112500,0 ns  757944,04          8
QuadRope.hfold                148250000,0 ns 4547587,88          2
Array2D.hfold                 136800000,0 ns 4263540,52          2
QuadRope.vfold                138950000,0 ns 1964264,07          2
Array2D.vfold                 125225000,0 ns  982132,03          4
QuadRope.hreduce              157800000,0 ns 3029484,74          2
Array2D.hreduce               124650000,0 ns 1550985,35          4
QuadRope.vreduce              148200000,0 ns 1960725,49          2
Array2D.vreduce               134300000,0 ns 2679137,51          2
QuadRope.hcat                       156,1 ns       2,27    2097152
QuadRope.hcat + hbalance         488183,6 ns    3780,81       1024
QuadRope.hcat + reallocate     912700000,0 ns 5452828,01          2
Array2D.hcat                   18850000,0 ns  508777,13         16
QuadRope.vcat                       160,7 ns       2,44    2097152
QuadRope.vcat + vbalance         443066,4 ns    2394,29       1024
QuadRope.vcat + reallocate     928050000,0 ns 14808874,97          2
Array2D.vcat                   18775000,0 ns  313470,71         16
QuadRope.index                      453,1 ns       4,06    1048576
Array2D.index                         6,1 ns       0,07   67108864
QuadRope.set                       4225,2 ns      57,30      65536
Array2D.set                     9240625,0 ns  150987,78         32

Aggressively inlined:

# OS          Microsoft Windows NT 6.2.9200.0
# .NET vers.  4.0.30319.42000
# 64-bit OS   True
# 64-bit proc False
# CPU         Intel64 Family 6 Model 58 Stepping 9, GenuineIntel; 2 "cores"
# Date        2016-04-13T11:31:20
QuadRope.init                  60875000,0 ns 1611589,97          8
Array2D.init                   24068750,0 ns 1654762,17         16
QuadRope.map                   46500000,0 ns 1699673,17          8
Array2D.map                    42175000,0 ns  957789,70          8
QuadRope.hfold                152850000,0 ns 6936417,58          2
Array2D.hfold                 133300000,0 ns 2699794,23          2
QuadRope.vfold                141800000,0 ns 2382808,80          2
Array2D.vfold                 126000000,0 ns 2527625,15          2
QuadRope.hreduce              164550000,0 ns 2948163,27          2
Array2D.hreduce               131350000,0 ns 1453921,90          2
QuadRope.vreduce              156600000,0 ns 1822696,42          2
Array2D.vreduce               135300000,0 ns 3001851,28          2
QuadRope.hcat                       161,7 ns       2,57    2097152
QuadRope.hcat + hbalance         605078,1 ns    5026,11        512
QuadRope.hcat + reallocate    1037400000,0 ns 80268923,00          2
Array2D.hcat                   19975000,0 ns 1108364,66         16
QuadRope.vcat                       170,3 ns       2,26    2097152
QuadRope.vcat + vbalance         875195,3 ns   11180,11        512
QuadRope.vcat + reallocate    1055400000,0 ns 31931175,99          2
Array2D.vcat                   20212500,0 ns  951606,83         16
QuadRope.index                      520,7 ns      21,41     524288
Array2D.index                         6,7 ns       0,34   67108864
QuadRope.set                       4753,1 ns     181,87      65536
Array2D.set                     8881250,0 ns  461588,82         32

Most functions on quad ropes are recursive and can therefore not be inlined.

2016-04-12

Remembering where splitting occurred

During lazy tree splitting, if threads become available, we want to pause traversal, split the remaining work, that is the part of the quad rope that has not yet been processed, and continue in parallel on the sub-trees of the split.

To do this efficiently by splitting trees at their child-nodes, intermediate trees must be allowed to be non-regular, such that e.g. the NW sub-tree may be an empty node.

NB: In Bergstrom et al., splitting must be guaranteed to produce ropes that differ in length by at most one. If we want to split efficiently (i.e. constant time vs. O (n log n)), we have to sacrifice this guarantee. Moreover, simpler splitting also results in simpler subsequent merging.

We need also a way to remember where we originally have split the "context" of the sequential execution, because we must later re-insert the already processed part of the rope into its old location. Smells like we want another zipper there.

Performance of Re-Allocation

$ ./benchmark # on mono
# OS          Unix 14.5.0.0
# .NET vers.  4.0.30319.17020
# 64-bit OS   True
# 64-bit proc True
# CPU         ; 8 "cores"
# Date        2016-04-12T09:47:06
...
QuadRope.hcat                        99.7 ns       0.70    4194304
QuadRope.hcat + hbalance         561328.1 ns    4801.85        512
QuadRope.hcat + reallocate     647650000.0 ns 3742028.56          2
Array2D.hcat                   16025000.0 ns  224768.40         16
QuadRope.vcat                       100.1 ns       1.14    4194304
QuadRope.vcat + vbalance         501562.5 ns    5109.74        512
QuadRope.vcat + reallocate     647500000.0 ns 2943920.29          2
...

While re-balancing of large quad ropes seems no problem, re-allocation is a big deal (in the literal sense of the word). Re-allocation of the data structure might be necessary though if a user decides to concatenate small arrays in an "adversarial" manner.

A useful re-balancing/re-allocation strategy might be to allow re-allocation only on sufficiently small arrays and to later fall back to simple balancing.

2016-04-06

Filtering

Filtering is a problem because the resulting 2D array might not be regular. If we cannot guarantee regularity, we need to prohibit the use of functions that produce possibly irregular structures.

We can still allow filtering on arrays of width or height = 1. This is implemented now.

Reduce

Reducing empty lists should return nothing.

Fold

There are still problems with fold but they are related to splitting.