Support `view` of blocks #22

dlfivefifty · 2017-07-14T10:56:55Z

This issue is the first step in a larger issue that could be called "Unify with ApproxFun.BlockBandedMatrix / ApproxFun.BandedBlockBandedMatrix".

In ApproxFun I need to work with views of blocks, and in particular exploit the fact that these views are equivalent to StridedMatrix, that is, they are compatible with LAPack routines. I also need to work with slices of blocks.

Below I've attached some of the behaviour of ApproxFun.BlockBandedMatrix. Note that it is closer in nature to PsuedoBlockArray in that memory is stored contiguously (though the current ordering of the memory is likely to change).

julia> using ApproxFun; import ApproxFun: Block

julia> A = ApproxFun.bbrand(Float64,1,1,[2,3],[4,1]) # random block banded matrix, the "1"s indicate the bandwidth, while the [2,3] indicates the row block sizes, and the [4,1] indicates the column block sizes
5×5 ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}}:
 0.266809   0.677622  0.229719   0.947088   0.970988 
 0.128803   0.125624  0.107945   0.226405   0.0620217
 0.384811   0.246323  0.410468   0.577572   0.147354 
 0.0765648  0.474993  0.199023   0.486716   0.14736  
 0.949332   0.37327   0.0650809  0.0692731  0.223807 

julia> A[Block(1),Block(1)]  # Analogous to your A[Block(1,1)]
2×4 Array{Float64,2}:
 0.266809  0.677622  0.229719  0.947088
 0.128803  0.125624  0.107945  0.226405

julia> S = view(A,Block(1),Block(2)) # create a view without copying. The implementation is a bit of a hack, but mimics the behaviour of Base.view with Colon
2×1 SubArray{Float64,2,ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}},Tuple{ApproxFun.Block,ApproxFun.Block},false}:
 0.970988 
 0.0620217

julia> S[1] = 6
6

julia> A
5×5 ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}}:
 0.266809   0.677622  0.229719   0.947088   6.0      
 0.128803   0.125624  0.107945   0.226405   0.0620217
 0.384811   0.246323  0.410468   0.577572   0.147354 
 0.0765648  0.474993  0.199023   0.486716   0.14736  
 0.949332   0.37327   0.0650809  0.0692731  0.223807 

julia> S_sub = view(A,Block(1)[1:2],Block(1)[2:3]) # create a view of a sub-block of a block, that is, take the 1:2 x 2:3 entries of Block(1,1).
2×2 SubArray{Float64,2,ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}},Tuple{ApproxFun.SubBlock{UnitRange{Int64}},ApproxFun.SubBlock{UnitRange{Int64}}},false}:
 0.677622  0.229719
 0.125624  0.107945

julia> S_sub[1,1] = 2
2

julia> A
5×5 ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}}:
 0.266809   2.0       0.229719   0.947088   6.0      
 0.128803   0.125624  0.107945   0.226405   0.0620217
 0.384811   0.246323  0.410468   0.577572   0.147354 
 0.0765648  0.474993  0.199023   0.486716   0.14736  
 0.949332   0.37327   0.0650809  0.0692731  0.223807

The text was updated successfully, but these errors were encountered:

KristofferC · 2017-07-14T11:56:00Z

Yes, slicing and views in general is still lacking. I hope it would not be too hard to implement. Question, why is the view of a subblock a subrray of a blocked matrix. It feels this should be a subarray of the same type that each block is stored as.

dlfivefifty · 2017-07-14T13:26:04Z

There is no type for storing blocks, as the data is stored as a Vector{Float64}, (ordered at the moment by block, but soon to be re-written to be ordered by column). Or rather, the type for each block is a SubArray{Float64,2,BlockBandedMatrix{…},Tuple{Block,Block},false}, which has all the features of a StridedMatrix: it is equivalent to a BLAS matrix, as it has a pointer, stride, etc.

For fast algorithms I think this is much better than the BlockArray approach of storing the blocks. For example, if the memory is stored by column you can use LAPACK's QR code to calculate a QR decomposition in a block-wise fashion, as one can get not only just a block, but a contiguous sequence of blocks in a StridedMatrix-like format. Having the blocks stored separately does not allow this.

KristofferC · 2017-07-15T19:28:20Z

Having the blocks stored separately allows for other things though. Say you have a block matrix like:

[A     B]
[B^T   0]

and want to act with the Schur complement S = B^T M^(-1) B on a vector v (as part of preconditioning in an iterative linear solver). If B and M are stored separately this can be done much faster than actually having to extract the blocks in each iteration.

dlfivefifty · 2017-07-15T22:51:58Z

I don't see how that's faster: views of matrices are essentially as fast as `Matrix`s, and what you want to do can be accomplished with views

…

On 15 Jul 2017, at 15:28, Kristoffer Carlsson ***@***.***> wrote: Having the blocks stored separately allows for other things though. Say you have a block matrix like: [A B] [B^T 0] and want to act with the Schur complement S = B^T M^(-1) B on a vector v (as part of preconditioning in an iterative linear solver). If B and M are stored separately this can be done much faster than actually having to extract the blocks in each iteration. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

KristofferC · 2017-07-16T04:53:14Z

In my application these matrices are sparse so I don't think it is as easy then.

dlfivefifty · 2017-07-16T21:46:37Z

For sparse arrays, one could do it with a filter of the indices, which is I think what happens when you call A[1:n,1:n] on a sparse matrix. But I agree there's a cost attached.

In any case, I think this is off topic as both BlockArray and PseudoBlockArray have their usage cases. And each one should have a SubArray{T,…,Tuple{Block}} implemented.

maximerischard · 2017-07-17T00:04:31Z

I agree with @KristofferC, many operations occur block-by-block, and so having the blocks be contiguous in memory feels more natural, and seems like it would be more efficient. It also allows for the possibility of having blocks with different array types.

dlfivefifty · 2017-07-17T00:09:22Z

I'm confused: I'm advocating for contiguous on memory, while the current BlockArray is a wrapper around an array of arrays, so no guarantee of memory locality.

Ordering by columns gives the best of both worlds: you can do things block by block and still use LAPACK, but at the same time get LAPACK compatible views of contiguous blocks together.

In any case, this is not really a central issue as there can exist multiple subtypes of AbstractBlockArray.

dlfivefifty · 2017-07-17T00:13:15Z

PS Using arrays of arrays will be very inefficient when there are many small blocks. This is why I decided to use a single Vector with LAPACK compatible blocks.

Note that my current implementation ofBlockBandedMatrix actually does have the blocks themselves contiguous. I'm actually proposing to reimplement it to order by columns, so that I can use LAPACKs QR to calculate the QR decomposition.

maximerischard · 2017-07-17T00:44:10Z

Right, that makes a lot of sense, I misunderstood how ApproxFun worked, it's actually quite different from PseudoBlockArray (I got misled by that statement). It's really neat that you were able to get views to work, and it does seem like there's a strong argument for storing the data data way.

Could you explain what you mean by “ordered by column?” Are blocks still contiguous in memory in that case?

dlfivefifty · 2017-07-17T00:53:43Z

Sorry, I guess I misunderstood how PseudoBlockArray worked.

Ordering by column just means list the entries by columns, just like in Matrix. In the case of BlockBandedMatrix, we omit the zero block entries. In the case of BlockMatrix, ut becomes essentially just a wrapper around a Matrix.

While the blocks are not contiguous in memory, they still have the structure of a strided matrix: it is enough to know the first entry, size and stride to use LAPACK and BLAS Level 3 routines.

I think at this point the best thing to do would be to implement the ordering by column, so we can see exactly what the speeds of the two implementations are for matrix multiplication.

maximerischard · 2017-07-17T01:32:15Z

That's interesting. It seems to me that the answer might depend on the size of the blocks. The application that got me interested in this package will likely have a small number of large blocks (say, 1000x1000). So I wasn't concerned about the block-level logic being fast. I don't mind slowly iterating over the blocks as long as within-block operations are as fast as possible. I'm getting the impression that your use-case is different, involving lots of little blocks, so you're more motivated to make the block-level logic efficient. Is that accurate?

KristofferC · 2017-07-17T07:06:01Z

I think I misunderstood as well. So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying? Is that correct?

dlfivefifty · 2017-07-17T09:46:19Z

There's almost no benefit in having blocks contiguous in memory:

n = 1000; A = rand(n,n); B = rand(n,n); M = rand(2n,2n); M2 = rand(10n,2n);

@btime A*B # 31.949 ms (2 allocations: 7.63 MiB)
@btime view(M,1:n,1:n)*view(M,n+1:2n,n+1:2n)  # 37.348 ms (74 allocations: 7.63 MiB)
@btime view(M2,1:n,1:n)*view(M2,n+1:2n,n+1:2n)  # 39.615 ms (74 allocations: 7.63 MiB)

dlfivefifty · 2017-07-17T09:49:56Z

So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying?

Yes that's right. In the non-banded case this would mean the data is stored exactly like a Matrix, but with the block structure attached on top.

KristofferC · 2017-07-17T09:56:47Z

Oh but that is exactly "PseudoBlockArray" then? (crappy name I know)

dlfivefifty · 2017-07-17T10:02:34Z

That's what I thought. Let me clarify there's three things under discussion:

The current ApproxFun.BlockBandedMatrix, which is not like PseudoBlockArray since blocks are contiguous in memory
The proposed ApproxFun.BlockBandedMatrix, which is like PseudoBlockArray but without the zero blocks stored in memory. The reason to rewrite is that there is almost no penalty for not having contiguous blocks, and it allows LAPack compatible views like view(A,Block.(1:2),Block(1)), that allow for a block-wise QR
The current ApproxFun.BandedBlockBandedMatrix, which hasn't been discussed but stores the blocks in a format where they can be retrieved in a way compatible with BandedMatrices.jl

But this discussion is off topic, my only point is that I am fairly confident that PseudoBlockArray approach is faster computationally overall, especially when you have many blocks.

(Provided the blocks are dense.)

dlfivefifty · 2017-07-18T01:23:44Z

OK, I did the following quick experiment should that using views really is fast:

using BlockArrays

# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
    fill!(Y.blocks,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        for P = 1:N
            BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
                        1.0,cur_block)
        end
    end
    Y
end


b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 1.572 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks #  95.898 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) #   348.103 μs (2100 allocations: 131.25 KiB)


b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 13.259 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks #  324.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) #     707.913 μs (20 allocations: 1.25 KiB)




# Convert A to Block Array
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 6.474 ms (120016 allocations: 7.40 MiB)


b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 47.627 ms (960016 allocations: 58.90 MiB)

maximerischard · 2017-07-18T06:07:20Z

Those are some really interesting results. It's pretty clear that multiplication hasn't been properly implemented yet for BlockArrays and PseudoBlockArrays. To make the comparison complete, I've taken your A_mul_B_block and written a version of it for BlockArrays, and rerun the benchmarks (I've also rearranged things slightly to make it easier for me to follow). It does end up being ever so slightly faster using a matrix of blocks than views with a single matrix, and there's no memory allocation at all anymore. So you're entirely correct that views are almost as fast as contiguous blocks, but I'm not understanding the benefit yet. Could you expand on that a little bit?

Also, what do you think accounts for the big difference between A.blocks*A.blocks and A_mul_B_block!(Y, A, A)? I don't have a very good intuition for that.

using BenchmarkTools

using BlockArrays

# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
    fill!(Y.blocks,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        for P = 1:N
            BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
                        1.0,cur_block)
        end
    end
    Y
end

b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 1.446 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks # 37.028 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) # 386.679 μs (2100 allocations: 131.25 KiB)

# Convert A to Block Array
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 5.557 ms (120016 allocations: 7.40 MiB)

# replace views with array of blocks
function A_mul_B_blockarray!(Y, A, B)
    fill!(Y,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = Y.blocks[K,J]
        for P = 1:N
            BLAS.gemm!('N','N',1.0,A.blocks[K,P]
                       , B.blocks[P,J]
                       , 1.0,cur_block)
        end
    end
    Y
end

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 337.071 μs (0 allocations: 0 bytes)

b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 11.767 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks # 143.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) # 226.180 μs (20 allocations: 1.25 KiB)

B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
    B[Block(K,J)] = A[Block(K,J)]
end

@btime B*B # 38.895 ms (960016 allocations: 58.90 MiB)

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 224.157 μs (0 allocations: 0 bytes)

# Even bigger blocks, removing the slow methods.
b = 1000; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Z = similar(A.blocks)
@btime BLAS.gemm!('N', 'N', 1.0, A.blocks, A.blocks, 0.0, Z) # 109.154 ms (0 allocations: 0 bytes)
Y = similar(A)
@btime A_mul_B_block!(Y, A, A) # 143.229 ms (20 allocations: 1.25 KiB)

B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
    B[Block(K,J)] = A[Block(K,J)]
end

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 131.430 ms (0 allocations: 0 bytes)

dlfivefifty · 2017-07-18T13:32:55Z

So you're entirely correct that views are almost as fast as contiguous blocks, but I'm not understanding the benefit yet. Could you expand on that a little bit?

The benefit is that you can combine neighbouring blocks and still get a strided matrix. For example:

function A_mul_B_col_block!(Y, A, B)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        A_mul_B!(cur_block,
                 view(A.blocks,(K-1)b+1:K*b,:),
                 view(B.blocks,:,(J-1)b+1:J*b))
    end
    Y
end

@btime A_mul_B_col_block!(Y, A, A) #   275.924 μs (300 allocations: 18.75 KiB)

In particular, I want to calculate a QR factorization and this is very complicated with BlockArrays, and will be slow.

Note that the allocations should disappear eventually: https://github.com/JuliaLang/julia/issues/14955`

dlfivefifty · 2017-10-11T10:34:49Z

This is now merged.

dlfivefifty mentioned this issue Jul 16, 2017

block tridiagonal matrices #21

Closed

dlfivefifty closed this as completed Oct 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `view` of blocks #22

Support `view` of blocks #22

dlfivefifty commented Jul 14, 2017

KristofferC commented Jul 14, 2017

dlfivefifty commented Jul 14, 2017

KristofferC commented Jul 15, 2017

dlfivefifty commented Jul 15, 2017 via email

KristofferC commented Jul 16, 2017

dlfivefifty commented Jul 16, 2017

maximerischard commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

maximerischard commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

maximerischard commented Jul 17, 2017

KristofferC commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

KristofferC commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017 •

edited

Loading

dlfivefifty commented Jul 18, 2017

maximerischard commented Jul 18, 2017 •

edited

Loading

dlfivefifty commented Jul 18, 2017

dlfivefifty commented Oct 11, 2017

Support view of blocks #22

Support view of blocks #22

Comments

dlfivefifty commented Jul 14, 2017

KristofferC commented Jul 14, 2017

dlfivefifty commented Jul 14, 2017

KristofferC commented Jul 15, 2017

dlfivefifty commented Jul 15, 2017 via email

KristofferC commented Jul 16, 2017

dlfivefifty commented Jul 16, 2017

maximerischard commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

maximerischard commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

maximerischard commented Jul 17, 2017

KristofferC commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017

KristofferC commented Jul 17, 2017

dlfivefifty commented Jul 17, 2017 • edited Loading

dlfivefifty commented Jul 18, 2017

maximerischard commented Jul 18, 2017 • edited Loading

dlfivefifty commented Jul 18, 2017

dlfivefifty commented Oct 11, 2017

Support `view` of blocks #22

Support `view` of blocks #22

dlfivefifty commented Jul 17, 2017 •

edited

Loading

maximerischard commented Jul 18, 2017 •

edited

Loading