`setindex!()` performance #10

rdeits · 2018-06-21T17:40:34Z

Hi! We're considering using the BlockBandedMatrix type in RigidBodyDynamics.jl to store a block-diagonal Jacobian matrix, but it looks like the current performance of setindex!() is a little too slow for our needs. More importantly, we're trying to write our in-place algorithms in a way that doesn't allocate at all, and the current implementation of setindex!() for BlockBandedMatrix allocates memory for a new view() at each call.

For example:

julia> using BlockBandedMatrices

julia> l,u = 0, 0          # block bandwidths
(0, 0)

julia> N = M = 4          # number of row/column blocks
4

julia> cols = rows = 1:N  # block sizes
1:4

julia> m = BlockBandedMatrix(eye(10, 10), (rows,cols), (l,u));

julia> using BenchmarkTools

julia> @btime $m[2, 2] = 5.0
  70.054 ns (2 allocations: 128 bytes)
5.0

as compared to 1.604 ns (0 allocations: 0 bytes) for a plain Matrix{Float64}.

Do you have any thoughts or plans on how to improve the performance? My current solution involves a different AbstractMatrix type which stores all of its block views at construction time, so looking up a block doesn't require allocating a new view. Could we do the same thing in BlockBandedMatrix?

The text was updated successfully, but these errors were encountered:

dlfivefifty · 2018-06-21T17:53:31Z

Are your blocks constant size?

The use of views in setindex! was just a quick hack to get it working, and it would definitely be better to redesign.

rdeits · 2018-06-21T17:55:08Z

No, and they're not even square in general. The blocks in our case represent the mapping from configuration derivative to velocity for a single joint, so the blocks can be of varying sizes and can be non-square for joints with a different number of configurations than velocities.

dlfivefifty · 2018-06-21T18:31:41Z

Are they like in your example cols = rows = 1:N? (This is actually my biggest use case.)

If not, that is, if they are arbitrary sized, I don't believe there is any way around the fact that setindex! will be slow, as it will have to lookup the right blocks in memory. In my use case, the bands are rational so the eventual fast code would not call setindex!, but instead look something like:

view(A, Block(N,M))[band(1)] .= (1:N).^(-1)

in this case, the code could be designed to do the lookups once. (And could even be done on a GPU with no CPU memory access.)

Are you sure that your code needs to call setindex!?

Also, look in BlockArrays.jl at BlockArray vs PseudoBlockArray. Here I'm more following PseudoBlockArray style where the non-zero entries of the matrix are stored consecutively in memory by column. (This will allow the use of LAPACK to do block-wise QR.) An alternative would be to follow BlockArray and store each block as a separate Matrix{T}, that is, wrap a BandedMatrix{Matrix{T}, Matrix{Matrix{T}}}. Which style fits in your application more?

rdeits · 2018-06-21T19:11:54Z

No, that's just pulled from the readme. We don't know anything about the sizes of the blocks ahead of time, but we know that the resulting matrix will have block lower and upper bandwidths of 0 (i.e. there is only a single, contiguous diagonal band of blocks), but each block can be of varying size.

The matrices we're dealing with are also small enough (less than 50x50) that sparsity isn't a huge concern. My current solution is a struct which stores the parent as a dense matrix and then stores a vector of views into that parent. We don't have to do any fancy lookup to find the right block because there's just the one diagonal band of blocks, so I can just store the views in diagonal-order and look them up linearly.

The only two operations that I currently perform on the block-diagonal matrix are:

Updating each block in place with something like update!(blocks(J)[i], data[i]) where i is the index of the block along the diagonal
Matmul using the full J matrix

It looks like BandedMatrix{Matrix{T}, Matrix{Matrix{T}}} might be a good alternative to what I'm using now.

dlfivefifty · 2018-06-21T19:22:23Z

There’s still a lookup as you have to access the view, whose location is stored in memory. So I don’t think there’s anything about your setup that makes it inherently faster, though storing the views will indeed avoid allocation.

I think the fastest option is to lookup the starting index and stride and access the data directly, without creating any views. This is actually a really easy change.

rdeits · 2018-06-21T19:31:41Z

Sorry, yes, I totally agree. My approach will still be slower than directly accessing a plain Matrix because of the cost of looking up the existing view. In my particular case that's OK, as I can amortize a single lookup over all of the indices which are being set within the given view.

And yes, I think looking up the index and stride directly would be faster (and non-allocating) compared to constructing a new view each time.

dlfivefifty · 2018-06-21T19:32:29Z

Try this:

@inline function foosetindex!(A::BlockBandedMatrix{T}, v, i::Int, j::Int) where T
    @boundscheck checkbounds(A, i, j)
    bi = BlockBandedMatrices.global2blockindex(A.block_sizes, (i, j))
    ind = A.block_sizes.block_starts[bi.I...]
    st = A.block_sizes.block_strides[bi.I[2]]
    @inbounds V.data[ind + bi.α[1]-1 + (bi.α[2]-1)*st] = convert(T, v)::T
    return v
end

I get

julia> @btime foosetindex!(m,2, 5,6)
  2.755 μs (4 allocations: 160 bytes)
2

rdeits · 2018-06-21T19:56:07Z

That doesn't look quite right. You're accessing a non-existent variable V (should be A), and the time is 2 microseconds, which is about 1000x too slow. Changing that V to A, I get 53.211 ns (2 allocations: 128 bytes), which is much better but still allocating.

It looks like there's something strange going on with global2blockindex which is where the allocations (and most of the time) are coming from:

julia> s = m.block_sizes;

julia> @btime BlockBandedMatrices.global2blockindex($s, $(5, 6))
42.521 ns (2 allocations: 128 bytes)

rdeits · 2018-06-21T20:17:07Z

Ah, I got it. There's a subtle bug that occurs in several places in BlockArrays.jl. In many of the @generated functions, you have lines like this: https://github.com/JuliaArrays/BlockArrays.jl/blob/d5710bb15357b26c942df8c983a8f14992a8d922/src/blockindices.jl#L73

where you do $Expr(:meta, :inline). But that interpolation isn't quite right: you're actually interpolating only the symbol Expr, not the actual expression. The result is that the generated output, rather than containing the inlining hint, actually constructs an Expr at run-time. That is, the resulting generated function's body looks like:

function foo(x)
  (BlockArrays.Expr)(:meta, :inline)
  _do_stuff(x)
end

Changing that to $(Expr(:meta, :inline)) everywhere, I get global2blockindex down from ~60 ns with 2 allocations to ~10 ns with 0 allocations.

dlfivefifty · 2018-06-21T20:18:38Z

Awesome! Though by "you" you mean @kristofferc, who wrote BlockArrays.jl 😛 Though I've inhereted and now maintain it.

Can you make a PR?

rdeits · 2018-06-21T20:18:55Z

Haha, sorry, didn't mean to accuse 😉 . And yup, working on a PR now.

This was referenced Jun 21, 2018

fix inlining expression usage in generated functions JuliaArrays/BlockArrays.jl#51

Merged

WIP: implement jacobians for qdot <-> v JuliaRobotics/RigidBodyDynamics.jl#440

Merged

dlfivefifty closed this as completed Sep 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`setindex!()` performance #10

`setindex!()` performance #10

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018 •

edited

Loading

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018

setindex!() performance #10

setindex!() performance #10

Comments

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018 • edited Loading

rdeits commented Jun 21, 2018

dlfivefifty commented Jun 21, 2018

rdeits commented Jun 21, 2018

`setindex!()` performance #10

`setindex!()` performance #10

rdeits commented Jun 21, 2018 •

edited

Loading