How finalizers `|>` work #165

RainerHeintzmann · 2023-03-27T07:37:44Z

I am having some trouble to grasp how to limit the action of the einsum:

julia> a = ones(3,3); b= ones(3,3);

julia> @tullio y[j,i] := a[i,k] * b[j,k]
3×3 Matrix{Float64}:
 3.0  3.0  3.0
 3.0  3.0  3.0
 3.0  3.0  3.0

julia> @tullio y[j,i] += a[m,j] * b[m,i]
3×3 Matrix{Float64}:
 6.0  6.0  6.0
 6.0  6.0  6.0
 6.0  6.0  6.0

julia> @tullio y[j,i] := a[i,k] * b[j,k] + a[m,j] * b[m,i]
3×3 Matrix{Float64}:
 18.0  18.0  18.0
 18.0  18.0  18.0
 18.0  18.0  18.0

julia> @tullio y[j,i] := (a[i,k] * b[j,k] |> identity) + (a[m,j] * b[m,i] |> identity)
3×3 Matrix{Float64}:
 18.0  18.0  18.0
 18.0  18.0  18.0
 18.0  18.0  18.0

my current implementation uses @tullio by splitting the expression into multiple @tullio calls as in the first two examples. 6*ones(3,3) is the correct result, which I want to achieve. Yet it seems like a single expression (applied over a large array) should be faster. But the 3rd result yields not what I expected. Rethinking, I can understand that it is probably interpreted by moving both sums (over k and over l) to the very outside. To limit this effect, I thought the |> operation would limit ("finalize") the action, but this does not seem to be the case. What did I get wrong here? I guess in this example one could use the same index for m and k, but in my case this is not possible, since the ranges differ.

The text was updated successfully, but these errors were encountered:

mcabbott · 2023-03-28T01:17:26Z

Rethinking, I can understand that it is probably interpreted by moving both sums (over k and over l) to the very outside.

Yes, this is exactly right. Unlike Einstein it does not know that + is special, it sees y[j,i] := f(g(a[i,k], b[j,k]), g(a[m,j], b[m,i])) as something so sum over k and j.

The macro always generates just one loop nest. All that you can do with |> is apply an operation later in the nest. The canonical example is @tullio y[i] := x[i,k]^2 |> sqrt which makes

for i in axes(y,1)  # outer loops for LHS
  tmp = 0
  for k in axes(x,2)  # inner loops are the sum
    tmp += x[i,k]^2
  end
  y[i] = sqrt(tmp)  # sqrt moved later
end

Maybe "finaliser" is the wrong word, but that's all it does. I see what you're hoping for but that requires a more complicated set of loops which Tullio doesn't understand. I think the macro has not noticed the |> at all (since it's not at top level) and hence calls Base's version which does nothing here. Probably it should throw an error instead.

RainerHeintzmann · 2023-03-28T06:42:58Z

Thanks for the explanation. This makes sense. I found a way to write it, but I guess this is still doing pretty much the same as the first example above.

julia> @btime $c .= (@tullio $y[j,i] := $a[i,k] * $b[j,k]) .+ (@tullio $y[j,i] := $a[m,j] * $b[m,i]);
  673.596 ms (4 allocations: 15.26 MiB)

The timings and memory are for 1k x1k arrays. The memory consumption got me a little worried, which is why I also tried a simple matrix multiplication:

julia> @btime c .= $a * transpose($b);
  9.269 ms (3 allocations: 7.63 MiB)

julia> @btime @tullio c[j,i] = $a[i,k] * $b[j,k];
  549.171 ms (9 allocations: 176 bytes)

Memory is no problem here, but speed is (probably cache usage?).
Is this (Tullio being 5x slower) a known issue?

RainerHeintzmann · 2023-03-28T06:49:37Z

The non-tullio way of writing this is also using 15 Mb, but is faster:

julia> @btime $c .= $a*transpose($b) .+ transpose($a)*$b;
  19.053 ms (4 allocations: 15.26 MiB)

I guess this may have to do with the magic of efficient (<O(N²)) implementation of matrix multiplications.

mcabbott · 2023-03-28T13:35:22Z

If you have c then this will avoid the allocations:

mul!(c, a, transpose(b))
mul!(c, transpose(a), b, true, true)

So will things like @tullio y[j,i] += a[m,j] * b[m,i], with = or += but not :=.

For straight matrix multiplication, Tullio will usually lose to more specialised routines. See e.g. this graph: https://github.com/JuliaLinearAlgebra/Octavian.jl Around size 100, it suffers from the overhead of using Base's threads. Around size 3000, it suffers from not knowing about some optimisations. (I don't think <N^3 algorithms like Strassen are actually used in BLAS, but not very sure.) Tullio's main purpose in life is handling weird contractions which aren't served at all by such libraries, or which would require expensive permutedims operations before/after. These are where it can be 5x faster sometimes.

RainerHeintzmann · 2023-03-31T09:59:15Z

Thanks. Of course <O(N^3) is what I meant. Interesting to know that Strassen or alike are not actually used in BLAS as discussed here:
I was using Tullio for exactly this use-case, a weired contraction. See this code.
Yet it seems to be hard to avoid allocations or storing intermediate results in large arrays.

mcabbott added the parsing label May 6, 2023

mcabbott changed the title ~~finalizers~~ How finalizers |> work May 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How finalizers `|>` work #165

How finalizers `|>` work #165

RainerHeintzmann commented Mar 27, 2023

mcabbott commented Mar 28, 2023

RainerHeintzmann commented Mar 28, 2023

RainerHeintzmann commented Mar 28, 2023

mcabbott commented Mar 28, 2023

RainerHeintzmann commented Mar 31, 2023 •

edited

Loading

How finalizers |> work #165

How finalizers |> work #165

Comments

RainerHeintzmann commented Mar 27, 2023

mcabbott commented Mar 28, 2023

RainerHeintzmann commented Mar 28, 2023

RainerHeintzmann commented Mar 28, 2023

mcabbott commented Mar 28, 2023

RainerHeintzmann commented Mar 31, 2023 • edited Loading

How finalizers `|>` work #165

How finalizers `|>` work #165

RainerHeintzmann commented Mar 31, 2023 •

edited

Loading