-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support view
of blocks
#22
Comments
Yes, slicing and views in general is still lacking. I hope it would not be too hard to implement. Question, why is the view of a subblock a subrray of a blocked matrix. It feels this should be a subarray of the same type that each block is stored as. |
There is no type for storing blocks, as the data is stored as a For fast algorithms I think this is much better than the |
Having the blocks stored separately allows for other things though. Say you have a block matrix like:
and want to act with the Schur complement |
I don't see how that's faster: views of matrices are essentially as fast as `Matrix`s, and what you want to do can be accomplished with views
… On 15 Jul 2017, at 15:28, Kristoffer Carlsson ***@***.***> wrote:
Having the blocks stored separately allows for other things though. Say you have a block matrix like:
[A B]
[B^T 0]
and want to act with the Schur complement S = B^T M^(-1) B on a vector v (as part of preconditioning in an iterative linear solver). If B and M are stored separately this can be done much faster than actually having to extract the blocks in each iteration.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
In my application these matrices are sparse so I don't think it is as easy then. |
For sparse arrays, one could do it with a In any case, I think this is off topic as both |
I agree with @KristofferC, many operations occur block-by-block, and so having the blocks be contiguous in memory feels more natural, and seems like it would be more efficient. It also allows for the possibility of having blocks with different array types. |
I'm confused: I'm advocating for contiguous on memory, while the current Ordering by columns gives the best of both worlds: you can do things block by block and still use LAPACK, but at the same time get LAPACK compatible views of contiguous blocks together. In any case, this is not really a central issue as there can exist multiple subtypes of |
PS Using arrays of arrays will be very inefficient when there are many small blocks. This is why I decided to use a single Note that my current implementation of |
Right, that makes a lot of sense, I misunderstood how Could you explain what you mean by “ordered by column?” Are blocks still contiguous in memory in that case? |
Sorry, I guess I misunderstood how Ordering by column just means list the entries by columns, just like in While the blocks are not contiguous in memory, they still have the structure of a strided matrix: it is enough to know the first entry, size and stride to use LAPACK and BLAS Level 3 routines. I think at this point the best thing to do would be to implement the ordering by column, so we can see exactly what the speeds of the two implementations are for matrix multiplication. |
That's interesting. It seems to me that the answer might depend on the size of the blocks. The application that got me interested in this package will likely have a small number of large blocks (say, 1000x1000). So I wasn't concerned about the block-level logic being fast. I don't mind slowly iterating over the blocks as long as within-block operations are as fast as possible. I'm getting the impression that your use-case is different, involving lots of little blocks, so you're more motivated to make the block-level logic efficient. Is that accurate? |
I think I misunderstood as well. So, you store the whole matrix continuously with the caveat that you have to use views to extract a block without copying? Is that correct? |
There's almost no benefit in having blocks contiguous in memory: n = 1000; A = rand(n,n); B = rand(n,n); M = rand(2n,2n); M2 = rand(10n,2n);
@btime A*B # 31.949 ms (2 allocations: 7.63 MiB)
@btime view(M,1:n,1:n)*view(M,n+1:2n,n+1:2n) # 37.348 ms (74 allocations: 7.63 MiB)
@btime view(M2,1:n,1:n)*view(M2,n+1:2n,n+1:2n) # 39.615 ms (74 allocations: 7.63 MiB) |
Yes that's right. In the non-banded case this would mean the data is stored exactly like a |
Oh but that is exactly "PseudoBlockArray" then? (crappy name I know) |
That's what I thought. Let me clarify there's three things under discussion:
But this discussion is off topic, my only point is that I am fairly confident that (Provided the blocks are dense.) |
OK, I did the following quick experiment should that using using BlockArrays
# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
fill!(Y.blocks,0)
b = blocksize(A,1,1)[1] # not general
N = nblocks(A)[1]
for J = 1:N, K = 1:N
cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
for P = 1:N
BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
1.0,cur_block)
end
end
Y
end
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)
@btime A*A # 1.572 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks # 95.898 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) # 348.103 μs (2100 allocations: 131.25 KiB)
b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)
@btime A*A # 13.259 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks # 324.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) # 707.913 μs (20 allocations: 1.25 KiB)
# Convert A to Block Array
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
@btime B*B # 6.474 ms (120016 allocations: 7.40 MiB)
b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
@btime B*B # 47.627 ms (960016 allocations: 58.90 MiB) |
Those are some really interesting results. It's pretty clear that multiplication hasn't been properly implemented yet for Also, what do you think accounts for the big difference between A.blocks*A.blocks and A_mul_B_block!(Y, A, A)? I don't have a very good intuition for that. using BenchmarkTools
using BlockArrays
# fast blockwise multiplication
function A_mul_B_block!(Y, A, B)
fill!(Y.blocks,0)
b = blocksize(A,1,1)[1] # not general
N = nblocks(A)[1]
for J = 1:N, K = 1:N
cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
for P = 1:N
BLAS.gemm!('N','N',1.0,view(A.blocks,(K-1)b+1:K*b,(P-1)b+1:P*b),view(B.blocks,(P-1)b+1:P*b,(J-1)b+1:J*b),
1.0,cur_block)
end
end
Y
end
b = 10; N = 10; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)
@btime A*A # 1.446 ms (8 allocations: 78.53 KiB)
@btime A.blocks*A.blocks # 37.028 μs (2 allocations: 78.20 KiB)
@btime A_mul_B_block!(Y, A, A) # 386.679 μs (2100 allocations: 131.25 KiB)
# Convert A to Block Array
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
@btime B*B # 5.557 ms (120016 allocations: 7.40 MiB)
# replace views with array of blocks
function A_mul_B_blockarray!(Y, A, B)
fill!(Y,0)
b = blocksize(A,1,1)[1] # not general
N = nblocks(A)[1]
for J = 1:N, K = 1:N
cur_block = Y.blocks[K,J]
for P = 1:N
BLAS.gemm!('N','N',1.0,A.blocks[K,P]
, B.blocks[P,J]
, 1.0,cur_block)
end
end
Y
end
W = similar(B)
for K = 1:N, J = 1:N
setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 337.071 μs (0 allocations: 0 bytes)
b = 100; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Y = similar(A)
@btime A*A # 11.767 ms (8 allocations: 312.91 KiB)
@btime A.blocks*A.blocks # 143.398 μs (2 allocations: 312.58 KiB)
@btime A_mul_B_block!(Y, A, A) # 226.180 μs (20 allocations: 1.25 KiB)
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
@btime B*B # 38.895 ms (960016 allocations: 58.90 MiB)
W = similar(B)
for K = 1:N, J = 1:N
setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 224.157 μs (0 allocations: 0 bytes)
# Even bigger blocks, removing the slow methods.
b = 1000; N = 2; n = N*b; A = PseudoBlockArray(rand(n,n),fill(b,N),fill(b,N))
Z = similar(A.blocks)
@btime BLAS.gemm!('N', 'N', 1.0, A.blocks, A.blocks, 0.0, Z) # 109.154 ms (0 allocations: 0 bytes)
Y = similar(A)
@btime A_mul_B_block!(Y, A, A) # 143.229 ms (20 allocations: 1.25 KiB)
B = BlockArray(Matrix{Float64}, fill(b,N),fill(b,N))
for K = 1:N, J = 1:N
B[Block(K,J)] = A[Block(K,J)]
end
W = similar(B)
for K = 1:N, J = 1:N
setblock!(W, zeros(b,b), Block(K,J))
end
@btime A_mul_B_blockarray!(W, B, B) # 131.430 ms (0 allocations: 0 bytes) |
The benefit is that you can combine neighbouring blocks and still get a strided matrix. For example: function A_mul_B_col_block!(Y, A, B)
b = blocksize(A,1,1)[1] # not general
N = nblocks(A)[1]
for J = 1:N, K = 1:N
cur_block = view(Y.blocks,(K-1)b+1:K*b,(J-1)b+1:J*b)
A_mul_B!(cur_block,
view(A.blocks,(K-1)b+1:K*b,:),
view(B.blocks,:,(J-1)b+1:J*b))
end
Y
end
@btime A_mul_B_col_block!(Y, A, A) # 275.924 μs (300 allocations: 18.75 KiB) In particular, I want to calculate a QR factorization and this is very complicated with Note that the allocations should disappear eventually: https://github.com/JuliaLang/julia/issues/14955` |
This is now merged. |
This issue is the first step in a larger issue that could be called "Unify with
ApproxFun.BlockBandedMatrix
/ApproxFun.BandedBlockBandedMatrix
".In ApproxFun I need to work with views of blocks, and in particular exploit the fact that these views are equivalent to
StridedMatrix
, that is, they are compatible with LAPack routines. I also need to work with slices of blocks.Below I've attached some of the behaviour of
ApproxFun.BlockBandedMatrix
. Note that it is closer in nature toPsuedoBlockArray
in that memory is stored contiguously (though the current ordering of the memory is likely to change).The text was updated successfully, but these errors were encountered: