Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setindex!() performance #10

Closed
rdeits opened this issue Jun 21, 2018 · 11 comments
Closed

setindex!() performance #10

rdeits opened this issue Jun 21, 2018 · 11 comments

Comments

@rdeits
Copy link

rdeits commented Jun 21, 2018

Hi! We're considering using the BlockBandedMatrix type in RigidBodyDynamics.jl to store a block-diagonal Jacobian matrix, but it looks like the current performance of setindex!() is a little too slow for our needs. More importantly, we're trying to write our in-place algorithms in a way that doesn't allocate at all, and the current implementation of setindex!() for BlockBandedMatrix allocates memory for a new view() at each call.

For example:

julia> using BlockBandedMatrices

julia> l,u = 0, 0          # block bandwidths
(0, 0)

julia> N = M = 4          # number of row/column blocks
4

julia> cols = rows = 1:N  # block sizes
1:4

julia> m = BlockBandedMatrix(eye(10, 10), (rows,cols), (l,u));

julia> using BenchmarkTools

julia> @btime $m[2, 2] = 5.0
  70.054 ns (2 allocations: 128 bytes)
5.0

as compared to 1.604 ns (0 allocations: 0 bytes) for a plain Matrix{Float64}.

Do you have any thoughts or plans on how to improve the performance? My current solution involves a different AbstractMatrix type which stores all of its block views at construction time, so looking up a block doesn't require allocating a new view. Could we do the same thing in BlockBandedMatrix?

@dlfivefifty
Copy link
Member

Are your blocks constant size?

The use of views in setindex! was just a quick hack to get it working, and it would definitely be better to redesign.

@rdeits
Copy link
Author

rdeits commented Jun 21, 2018

No, and they're not even square in general. The blocks in our case represent the mapping from configuration derivative to velocity for a single joint, so the blocks can be of varying sizes and can be non-square for joints with a different number of configurations than velocities.

@dlfivefifty
Copy link
Member

Are they like in your example cols = rows = 1:N? (This is actually my biggest use case.)

If not, that is, if they are arbitrary sized, I don't believe there is any way around the fact that setindex! will be slow, as it will have to lookup the right blocks in memory. In my use case, the bands are rational so the eventual fast code would not call setindex!, but instead look something like:

view(A, Block(N,M))[band(1)] .= (1:N).^(-1)

in this case, the code could be designed to do the lookups once. (And could even be done on a GPU with no CPU memory access.)

Are you sure that your code needs to call setindex!?

Also, look in BlockArrays.jl at BlockArray vs PseudoBlockArray. Here I'm more following PseudoBlockArray style where the non-zero entries of the matrix are stored consecutively in memory by column. (This will allow the use of LAPACK to do block-wise QR.) An alternative would be to follow BlockArray and store each block as a separate Matrix{T}, that is, wrap a BandedMatrix{Matrix{T}, Matrix{Matrix{T}}}. Which style fits in your application more?

@rdeits
Copy link
Author

rdeits commented Jun 21, 2018

No, that's just pulled from the readme. We don't know anything about the sizes of the blocks ahead of time, but we know that the resulting matrix will have block lower and upper bandwidths of 0 (i.e. there is only a single, contiguous diagonal band of blocks), but each block can be of varying size.

The matrices we're dealing with are also small enough (less than 50x50) that sparsity isn't a huge concern. My current solution is a struct which stores the parent as a dense matrix and then stores a vector of views into that parent. We don't have to do any fancy lookup to find the right block because there's just the one diagonal band of blocks, so I can just store the views in diagonal-order and look them up linearly.

The only two operations that I currently perform on the block-diagonal matrix are:

  1. Updating each block in place with something like update!(blocks(J)[i], data[i]) where i is the index of the block along the diagonal
  2. Matmul using the full J matrix

It looks like BandedMatrix{Matrix{T}, Matrix{Matrix{T}}} might be a good alternative to what I'm using now.

@dlfivefifty
Copy link
Member

There’s still a lookup as you have to access the view, whose location is stored in memory. So I don’t think there’s anything about your setup that makes it inherently faster, though storing the views will indeed avoid allocation.

I think the fastest option is to lookup the starting index and stride and access the data directly, without creating any views. This is actually a really easy change.

@rdeits
Copy link
Author

rdeits commented Jun 21, 2018

Sorry, yes, I totally agree. My approach will still be slower than directly accessing a plain Matrix because of the cost of looking up the existing view. In my particular case that's OK, as I can amortize a single lookup over all of the indices which are being set within the given view.

And yes, I think looking up the index and stride directly would be faster (and non-allocating) compared to constructing a new view each time.

@dlfivefifty
Copy link
Member

Try this:

@inline function foosetindex!(A::BlockBandedMatrix{T}, v, i::Int, j::Int) where T
    @boundscheck checkbounds(A, i, j)
    bi = BlockBandedMatrices.global2blockindex(A.block_sizes, (i, j))
    ind = A.block_sizes.block_starts[bi.I...]
    st = A.block_sizes.block_strides[bi.I[2]]
    @inbounds V.data[ind + bi.α[1]-1 + (bi.α[2]-1)*st] = convert(T, v)::T
    return v
end

I get

julia> @btime foosetindex!(m,2, 5,6)
  2.755 μs (4 allocations: 160 bytes)
2

@rdeits
Copy link
Author

rdeits commented Jun 21, 2018

That doesn't look quite right. You're accessing a non-existent variable V (should be A), and the time is 2 microseconds, which is about 1000x too slow. Changing that V to A, I get 53.211 ns (2 allocations: 128 bytes), which is much better but still allocating.

It looks like there's something strange going on with global2blockindex which is where the allocations (and most of the time) are coming from:

julia> s = m.block_sizes;

julia> @btime BlockBandedMatrices.global2blockindex($s, $(5, 6))
42.521 ns (2 allocations: 128 bytes)

@rdeits
Copy link
Author

rdeits commented Jun 21, 2018

Ah, I got it. There's a subtle bug that occurs in several places in BlockArrays.jl. In many of the @generated functions, you have lines like this: https://github.com/JuliaArrays/BlockArrays.jl/blob/d5710bb15357b26c942df8c983a8f14992a8d922/src/blockindices.jl#L73

where you do $Expr(:meta, :inline). But that interpolation isn't quite right: you're actually interpolating only the symbol Expr, not the actual expression. The result is that the generated output, rather than containing the inlining hint, actually constructs an Expr at run-time. That is, the resulting generated function's body looks like:

function foo(x)
  (BlockArrays.Expr)(:meta, :inline)
  _do_stuff(x)
end

Changing that to $(Expr(:meta, :inline)) everywhere, I get global2blockindex down from ~60 ns with 2 allocations to ~10 ns with 0 allocations.

@dlfivefifty
Copy link
Member

Awesome! Though by "you" you mean @kristofferc, who wrote BlockArrays.jl 😛 Though I've inhereted and now maintain it.

Can you make a PR?

@rdeits
Copy link
Author

rdeits commented Jun 21, 2018

Haha, sorry, didn't mean to accuse 😉 . And yup, working on a PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants