Skip to content

Conversation

@mcabbott
Copy link
Owner

@mcabbott mcabbott commented Jul 8, 2020

This pre-computes the number of spawns to perform (and the number of recursive blocks) and then inlines everything. Improves times & allocations quite a bit.

With the example of #15, but now threads=true, times on my laptop (Julia 1.4.2) were:

julia> tulliomul_multi!(C, A, B) = @tullio C[m,n] = A[m,j,k] * B[k,n,j];

julia> @btime tmul!($C₁, $A, $B);
  227.691 μs (0 allocations: 0 bytes)

julia> @btime einmul!($C₂, $A, $B);
  4.321 ms (0 allocations: 0 bytes)

julia> @btime tulliomul_multi!($C₃, $A, $B);
  137.678 μs (187 allocations: 10.00 KiB)

julia> C₁ ≈ C₂ ≈ C₃
true

and become:

julia> @btime tulliomul_multi!($C₃, $A, $B);
  120.737 μs (42 allocations: 3.14 KiB)

Simple matrix multiplication, before this change:

julia> using Tullio, LoopVectorization, LinearAlgebra

julia> N = 100; X = rand(N,N); Y = rand(N,N);

julia> mul(X, Y) = @tullio Z[i,k] := X[i,j] * Y[j,k];

julia> @btime mul($X, $Y);
  48.914 μs (49 allocations: 80.98 KiB)

julia> @btime *($X, $Y);
  44.157 μs (2 allocations: 78.20 KiB) # -> 24.138 μs (2 allocations: 78.20 KiB) with MKL

julia> BLAS.vendor()
:openblas64

julia> mul(X, Y) ≈ X * Y
true

julia> N = 1000; X = rand(N,N); Y = rand(N,N);

julia> @btime mul($X, $Y);
  27.212 ms (12386 allocations: 8.01 MiB)

julia> @btime *($X, $Y); 
  22.355 ms (2 allocations: 7.63 MiB) # -> 20.801 ms (2 allocations: 7.63 MiB) with MKL

after

julia> @btime mul($X, $Y);
  25.699 μs (48 allocations: 81.44 KiB)

julia> N = 1000; X = rand(N,N); Y = rand(N,N);

julia> @btime mul($X, $Y);
  23.812 ms (1572 allocations: 7.68 MiB)

Permuted-batched-mul example from the readme, before:

julia> using Tullio, LoopVectorization, NNlib, OMEinsum

julia> A′ = randn(20,30,500); B′ = randn(500,40,30);

julia> bmm_rev(A′, B′) = @tullio C[i,k,b] := A′[i,j,b] * B′[b,k,j] 

julia> bmm_rev(A′, B′) ≈ NNlib.batched_mul(A′, permutedims(B′, (3,2,1)))
true

julia> @btime bmm_rev($A′, $B′);
  663.462 μs (357 allocations: 3.07 MiB)

julia> @btime NNlib.batched_mul($A′, permutedims($B′, (3,2,1)));
  2.066 ms (4 allocations: 7.63 MiB)

julia> using OMEinsum # uses better permutedims() from TensorOperations

julia> bmm_ein(A′, B′) = @ein C[i,k,b] := A′[i,j,b] * B′[b,k,j]

julia> bmm_rev(A′, B′) ≈ bmm_ein(A′, B′)
true

julia> @btime bmm_ein($A′, $B′);
  1.723 ms (72 allocations: 7.64 MiB)

after

julia> @btime bmm_rev($A′, $B′);
  658.432 μs (60 allocations: 3.06 MiB)

@mcabbott mcabbott closed this Jul 11, 2020
@mcabbott mcabbott reopened this Jul 17, 2020
@mcabbott mcabbott merged commit 97c1ea9 into master Aug 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants