Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support view of blocks #22

Closed
dlfivefifty opened this issue Jul 14, 2017 · 21 comments
Closed

Support view of blocks #22

dlfivefifty opened this issue Jul 14, 2017 · 21 comments

Comments

@dlfivefifty
Copy link
Member

This issue is the first step in a larger issue that could be called "Unify with ApproxFun.BlockBandedMatrix / ApproxFun.BandedBlockBandedMatrix".

In ApproxFun I need to work with views of blocks, and in particular exploit the fact that these views are equivalent to StridedMatrix, that is, they are compatible with LAPack routines. I also need to work with slices of blocks.

Below I've attached some of the behaviour of ApproxFun.BlockBandedMatrix. Note that it is closer in nature to PsuedoBlockArray in that memory is stored contiguously (though the current ordering of the memory is likely to change).

julia> using ApproxFun; import ApproxFun: Block

julia> A = ApproxFun.bbrand(Float64,1,1,[2,3],[4,1]) # random block banded matrix, the "1"s indicate the bandwidth, while the [2,3] indicates the row block sizes, and the [4,1] indicates the column block sizes
5×5 ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}}:
 0.266809   0.677622  0.229719   0.947088   0.970988 
 0.128803   0.125624  0.107945   0.226405   0.0620217
 0.384811   0.246323  0.410468   0.577572   0.147354 
 0.0765648  0.474993  0.199023   0.486716   0.14736  
 0.949332   0.37327   0.0650809  0.0692731  0.223807 

julia> A[Block(1),Block(1)]  # Analogous to your A[Block(1,1)]
2×4 Array{Float64,2}:
 0.266809  0.677622  0.229719  0.947088
 0.128803  0.125624  0.107945  0.226405

julia> S = view(A,Block(1),Block(2)) # create a view without copying. The implementation is a bit of a hack, but mimics the behaviour of Base.view with Colon
2×1 SubArray{Float64,2,ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}},Tuple{ApproxFun.Block,ApproxFun.Block},false}:
 0.970988 
 0.0620217

julia> S[1] = 6
6

julia> A
5×5 ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}}:
 0.266809   0.677622  0.229719   0.947088   6.0      
 0.128803   0.125624  0.107945   0.226405   0.0620217
 0.384811   0.246323  0.410468   0.577572   0.147354 
 0.0765648  0.474993  0.199023   0.486716   0.14736  
 0.949332   0.37327   0.0650809  0.0692731  0.223807 

julia> S_sub = view(A,Block(1)[1:2],Block(1)[2:3]) # create a view of a sub-block of a block, that is, take the 1:2 x 2:3 entries of Block(1,1).
2×2 SubArray{Float64,2,ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}},Tuple{ApproxFun.SubBlock{UnitRange{Int64}},ApproxFun.SubBlock{UnitRange{Int64}}},false}:
 0.677622  0.229719
 0.125624  0.107945

julia> S_sub[1,1] = 2
2

julia> A
5×5 ApproxFun.BlockBandedMatrix{Float64,Array{Int64,1},Array{Int64,1}}:
 0.266809   2.0       0.229719   0.947088   6.0      
 0.128803   0.125624  0.107945   0.226405   0.0620217
 0.384811   0.246323  0.410468   0.577572   0.147354 
 0.0765648  0.474993  0.199023   0.486716   0.14736  
 0.949332   0.37327   0.0650809  0.0692731  0.223807 
@KristofferC
Copy link
Collaborator

Yes, slicing and views in general is still lacking. I hope it would not be too hard to implement. Question, why is the view of a subblock a subrray of a blocked matrix. It feels this should be a subarray of the same type that each block is stored as.

@dlfivefifty
Copy link
Member Author

There is no type for storing blocks, as the data is stored as a Vector{Float64}, (ordered at the moment by block, but soon to be re-written to be ordered by column). Or rather, the type for each block is a SubArray{Float64,2,BlockBandedMatrix{…},Tuple{Block,Block},false}, which has all the features of a StridedMatrix: it is equivalent to a BLAS matrix, as it has a pointer, stride, etc.

For fast algorithms I think this is much better than the BlockArray approach of storing the blocks. For example, if the memory is stored by column you can use LAPACK's QR code to calculate a QR decomposition in a block-wise fashion, as one can get not only just a block, but a contiguous sequence of blocks in a StridedMatrix-like format. Having the blocks stored separately does not allow this.

@KristofferC
Copy link
Collaborator

Having the blocks stored separately allows for other things though. Say you have a block matrix like:

[A     B]
[B^T   0]

and want to act with the Schur complement S = B^T M^(-1) B on a vector v (as part of preconditioning in an iterative linear solver). If B and M are stored separately this can be done much faster than actually having to extract the blocks in each iteration.

@dlfivefifty
Copy link
Member Author

dlfivefifty commented Jul 15, 2017 via email

@KristofferC
Copy link
Collaborator

In my application these matrices are sparse so I don't think it is as easy then.

@dlfivefifty
Copy link
Member Author

For sparse arrays, one could do it with a filter of the indices, which is I think what happens when you call A[1:n,1:n] on a sparse matrix. But I agree there's a cost attached.

In any case, I think this is off topic as both BlockArray and PseudoBlockArray have their usage cases. And each one should have a SubArray{T,…,Tuple{Block}} implemented.

@maximerischard
Copy link

I agree with @KristofferC, many operations occur block-by-block, and so having the blocks be contiguous in memory feels more natural, and seems like it would be more efficient. It also allows for the possibility of having blocks with different array types.

@dlfivefifty
Copy link
Member Author

I'm confused: I'm advocating for contiguous on memory, while the current BlockArray is a wrapper around an array of arrays, so no guarantee of memory locality.

Ordering by columns gives the best of both worlds: you can do things block by block and still use LAPACK, but at the same time get LAPACK compatible views of contiguous blocks together.

In any case, this is not really a central issue as there can exist multiple subtypes of AbstractBlockArray.

@dlfivefifty
Copy link
Member Author

PS Using arrays of arrays will be very inefficient when there are many small blocks. This is why I decided to use a single Vector with LAPACK compatible blocks.

Note that my current implementation ofBlockBandedMatrix actually does have the blocks themselves contiguous. I'm actually proposing to reimplement it to order by columns, so that I can use LAPACKs QR to calculate the QR decomposition.

@maximerischard
Copy link

Right, that makes a lot of sense, I misunderstood how ApproxFun worked, it's actually quite different from PseudoBlockArray (I got misled by that statement). It's really neat that you were able to get views to work, and it does seem like there's a strong argument for storing the data data way.

Could you explain what you mean by “ordered by column?” Are blocks still contiguous in memory in that case?

@dlfivefifty
Copy link
Member Author

Sorry, I guess I misunderstood how PseudoBlockArray worked.

Ordering by column just means list the entries by columns, just like in Matrix. In the case of BlockBandedMatrix, we omit the zero block entries. In the case of BlockMatrix, ut becomes essentially just a wrapper around a Matrix.

While the blocks are not contiguous in memory, they still have the structure of a strided matrix: it is enough to know the first entry, size and stride to use LAPACK and BLAS Level 3 routines.

I think at this point the best thing to do would be to implement the ordering by column, so we can see exactly what the speeds of the two implementations are for matrix multiplication.

@maximerischard
Copy link

That's interesting. It seems to me that the answer might depend on the size of the blocks. The application that got me interested in this package will likely have a small number of large blocks (say, 1000x1000). So I wasn't concerned about the block-level logic being fast. I don't mind slowly iterating over the blocks as long as within-block operations are as fast as possible. I'm getting the impression that your use-case is different, involving lots of little blocks, so you're more motivated to make the block-level logic efficient. Is that accurate?

@KristofferC
Copy link
Collaborator

I think I misunderstood as well. So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying? Is that correct?

@dlfivefifty
Copy link
Member Author

There's almost no benefit in having blocks contiguous in memory:

n = 1000; A = rand(n,n); B = rand(n,n); M = rand(2n,2n); M2 = rand(10n,2n);

@btime A*B # 31.949 ms (2 allocations: 7.63 MiB)
@btime view(M,1:n,1:n)*view(M,n+1:2n,n+1:2n)  # 37.348 ms (74 allocations: 7.63 MiB)
@btime view(M2,1:n,1:n)*view(M2,n+1:2n,n+1:2n)  # 39.615 ms (74 allocations: 7.63 MiB)

@dlfivefifty
Copy link
Member Author

So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying?

Yes that's right. In the non-banded case this would mean the data is stored exactly like a Matrix, but with the block structure attached on top.

@KristofferC
Copy link
Collaborator

Oh but that is exactly "PseudoBlockArray" then? (crappy name I know)

@dlfivefifty
Copy link
Member Author

dlfivefifty commented Jul 17, 2017

That's what I thought. Let me clarify there's three things under discussion:

  1. The current ApproxFun.BlockBandedMatrix, which is not like PseudoBlockArray since blocks are contiguous in memory
  2. The proposed ApproxFun.BlockBandedMatrix, which is like PseudoBlockArray but without the zero blocks stored in memory. The reason to rewrite is that there is almost no penalty for not having contiguous blocks, and it allows LAPack compatible views like view(A,Block.(1:2),Block(1)), that allow for a block-wise QR
  3. The current ApproxFun.BandedBlockBandedMatrix, which hasn't been discussed but stores the blocks in a format where they can be retrieved in a way compatible with BandedMatrices.jl

But this discussion is off topic, my only point is that I am fairly confident that PseudoBlockArray approach is faster computationally overall, especially when you have many blocks.

(Provided the blocks are dense.)

@dlfivefifty
Copy link
Member Author

OK, I did the following quick experiment should that using views really is fast:

using BlockArrays

# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
    fill!(Y.blocks,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        for P = 1:N
            BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
                        1.0,cur_block)
        end
    end
    Y
end


b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 1.572 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks #  95.898 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) #   348.103 μs (2100 allocations: 131.25 KiB)


b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 13.259 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks #  324.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) #     707.913 μs (20 allocations: 1.25 KiB)




# Convert A to Block Array
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 6.474 ms (120016 allocations: 7.40 MiB)


b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 47.627 ms (960016 allocations: 58.90 MiB)

@maximerischard
Copy link

maximerischard commented Jul 18, 2017

Those are some really interesting results. It's pretty clear that multiplication hasn't been properly implemented yet for BlockArrays and PseudoBlockArrays. To make the comparison complete, I've taken your A_mul_B_block and written a version of it for BlockArrays, and rerun the benchmarks (I've also rearranged things slightly to make it easier for me to follow). It does end up being ever so slightly faster using a matrix of blocks than views with a single matrix, and there's no memory allocation at all anymore. So you're entirely correct that views are almost as fast as contiguous blocks, but I'm not understanding the benefit yet. Could you expand on that a little bit?

Also, what do you think accounts for the big difference between A.blocks*A.blocks and A_mul_B_block!(Y, A, A)? I don't have a very good intuition for that.

using BenchmarkTools

using BlockArrays

# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
    fill!(Y.blocks,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        for P = 1:N
            BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
                        1.0,cur_block)
        end
    end
    Y
end

b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 1.446 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks # 37.028 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) # 386.679 μs (2100 allocations: 131.25 KiB)

# Convert A to Block Array
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
    for K = 1:N, J = 1:N
        B[Block(K,J)] = A[Block(K,J)]
    end

@btime B*B # 5.557 ms (120016 allocations: 7.40 MiB)

# replace views with array of blocks
function A_mul_B_blockarray!(Y, A, B)
    fill!(Y,0)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = Y.blocks[K,J]
        for P = 1:N
            BLAS.gemm!('N','N',1.0,A.blocks[K,P]
                       , B.blocks[P,J]
                       , 1.0,cur_block)
        end
    end
    Y
end

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 337.071 μs (0 allocations: 0 bytes)

b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)

@btime A*A # 11.767 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks # 143.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) # 226.180 μs (20 allocations: 1.25 KiB)

B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
    B[Block(K,J)] = A[Block(K,J)]
end

@btime B*B # 38.895 ms (960016 allocations: 58.90 MiB)

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 224.157 μs (0 allocations: 0 bytes)

# Even bigger blocks, removing the slow methods.
b = 1000; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Z = similar(A.blocks)
@btime BLAS.gemm!('N', 'N', 1.0, A.blocks, A.blocks, 0.0, Z) # 109.154 ms (0 allocations: 0 bytes)
Y = similar(A)
@btime A_mul_B_block!(Y, A, A) # 143.229 ms (20 allocations: 1.25 KiB)

B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
    B[Block(K,J)] = A[Block(K,J)]
end

W = similar(B)
for K = 1:N, J = 1:N
    setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 131.430 ms (0 allocations: 0 bytes)

@dlfivefifty
Copy link
Member Author

So you're entirely correct that views are almost as fast as contiguous blocks, but I'm not understanding the benefit yet. Could you expand on that a little bit?

The benefit is that you can combine neighbouring blocks and still get a strided matrix. For example:

function A_mul_B_col_block!(Y, A, B)
    b = blocksize(A,1,1)[1] # not general
    N = nblocks(A)[1]
    for J = 1:N, K = 1:N
        cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
        A_mul_B!(cur_block,
                 view(A.blocks,(K-1)b+1:K*b,:),
                 view(B.blocks,:,(J-1)b+1:J*b))
    end
    Y
end

@btime A_mul_B_col_block!(Y, A, A) #   275.924 μs (300 allocations: 18.75 KiB)

In particular, I want to calculate a QR factorization and this is very complicated with BlockArrays, and will be slow.

Note that the allocations should disappear eventually: https://github.com/JuliaLang/julia/issues/14955`

@dlfivefifty
Copy link
Member Author

This is now merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants