Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance when weaving many branches together #64

Closed
Moelf opened this issue Jul 24, 2021 · 30 comments · Fixed by #65
Closed

Poor performance when weaving many branches together #64

Moelf opened this issue Jul 24, 2021 · 30 comments · Fixed by #65

Comments

@Moelf
Copy link
Member

Moelf commented Jul 24, 2021

Sometimes in HEP the memory access pattern is inherently bad. For example we have jagged branches for electron and muon related quantities: Lorentz Vectors, isolation flags etc.

Let's say the e_mask is quality cut, and people want to extract isolation cuts for electrons that passed the e_mask:

function get_Isos(e_mask, m_mask, evt)
    v_l_passIso = Vector{Bool}[]

    v_e_passIso_HighPtCaloOnly = evt.v_e_passIso_HighPtCaloOnly
    v_e_passIso_TightTrackOnly_VarRad = evt.v_e_passIso_TightTrackOnly_VarRad
    v_e_passIso_TightTrackOnly_FixedRad = evt.v_e_passIso_TightTrackOnly_FixedRad
    v_e_passIso_Tight_VarRad = evt.v_e_passIso_Tight_VarRad
    #.... 12 more

    @inbounds for (i,f) in enumerate(e_mask)
        f || continue
        push!(
            v_l_passIso,
            Bool[
                v_e_passIso_HighPtCaloOnly[i],
                v_e_passIso_TightTrackOnly_VarRad[i],
                v_e_passIso_TightTrackOnly_FixedRad[i],
                v_e_passIso_Tight_VarRad[i],
                v_e_passIso_Loose_VarRad[i],
            ],
        )
    end
	# one more for loop for muons

	return v_l_passIso

which results in
image

Each of those piles is actually just the v_e_passIso_HighPtCaloOnly = evt.v_e_passIso_HighPtCaloOnly line.

This is probably horrible no matter what language you use, but I'm wondering if we should explore https://github.com/JuliaArrays/ArraysOfArrays.jl/ to see if we can cut down allocations of making concrete vectors (or vector of vectors) since in many cases they just get teared apart manually in the looper.

@oschulz
Copy link
Member

oschulz commented Jul 25, 2021

Using ArrayOfArrays.jl should definitely help to reduce the number of memory allocations a lot.

@oschulz
Copy link
Member

oschulz commented Jul 25, 2021

Why is v_l_passIso be a vector of boolean vectors, instead of a vector of NamedTuples?

@Moelf
Copy link
Member Author

Moelf commented Jul 25, 2021

because it's used like this later:

    #isolation cut, require all 4 of them to be true
    if all((
        v_l_passIso[pr1[1]][1],
        v_l_passIso[pr1[2]][1],
        v_l_passIso[W_id[1]][1],
        v_l_passIso[W_id[2]][1],
    ))
    else
        return false, wgt
    end

and

    if (abs(v_l_pid[fifth_l]) == 11)
        !v_l_passIso[fifth_l][4] && return false, wgt
    end
    if (abs(v_l_pid[fifth_l]) == 13)
        !v_l_passIso[fifth_l][3] && return false, wgt
    end

in short, it needs to have the same order as all other "lepton related" branches so you can index all of them with the same number that identifies a unique lepton in the event

@oschulz
Copy link
Member

oschulz commented Jul 25, 2021

I guess one option would be to user a static array or tuple, since the length of the elements of v_l_passIso is fixed. An ArraysOfArrays.VectorOrVectors would work too, of course.

@Moelf
Copy link
Member Author

Moelf commented Jul 25, 2021

length of the elements of v_l_passIso is

it is not, it only contains stuff for leptons that passed e_mask and m_mask, the inner most vector is of length 16, not worth StaticArray probably.

@Moelf
Copy link
Member Author

Moelf commented Jul 25, 2021

index 86d8909..e26ff74 100644
--- a/src/root.jl
+++ b/src/root.jl
@@ -215,11 +215,16 @@ function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:J
-        @views [
-                ntoh.(reinterpret(
-                                  T, rawdata[ (rawoffsets[i]+jagg_offset+1):rawoffsets[i+1] ]
-                                 )) for i in 1:(length(rawoffsets) - 1)
-               ]
+        _size = sizeof(eltype(T))
+        data = UInt8[]
+        offset = Int64[0] # god damn 0-based index
+        @views @inbounds for i in 1:(length(rawoffsets) - 1)
+            rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
+            append!(data, rawdata[rg])
+            push!(offset, last(offset) + length(rg))
+        end
+        real_data = ntoh.(reinterpret(T, data))
+        VectorOfVectors(real_data, offset .÷ _size .+ 1)
     end

here's the number AFTER applying this diff:

# 1st run
 17.403734 seconds (72.69 M allocations: 6.877 GiB, 9.20% gc time, 3.31% compilation time)
# 2nd run
3.135352 seconds (34.34 M allocations: 2.249 GiB, 12.11% gc time)
# 3rd run
2.990273 seconds (34.34 M allocations: 2.249 GiB, 9.52% gc time)

BEFORE this diff:

# 1st run
32.069656 seconds (213.86 M allocations: 11.560 GiB, 21.16% gc time, 1.26% compilation time)

# 2nd run
3.976044 seconds (27.73 M allocations: 1.779 GiB, 21.24% gc time)

# 3rd run
  3.725007 seconds (27.73 M allocations: 1.779 GiB, 13.36% gc time)

@oschulz definitely helpful at asymptotic limit! A little bit confused by why subsequent runs allocates more, may be because some weird hidden conversion somewhere...

@Moelf
Copy link
Member Author

Moelf commented Jul 25, 2021

image

hmmm, I tried a Ref{}() hack and it seems to be working, will make a PR now and we can discuss technical details there

@oschulz
Copy link
Member

oschulz commented Jul 25, 2021

Using view(rawdata, rg) instead of rawdata[rg] could save allocations, since views are stack-allocated now.

@oschulz
Copy link
Member

oschulz commented Jul 25, 2021

Ah, sorry, overlooked the @views

@oschulz
Copy link
Member

oschulz commented Jul 25, 2021

Maybe try this version:

function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:JaggType}
    # ...
        _size = Base.SignedMultiplicativeInverse{Int}(sizeof(eltype(T))
        data = UInt8[]
        elem_ptr = sizehint!(Vector{Int}(), length(eachindex(rawoffsets)))
        push!(elem_ptr,  0)
        # Can we sizehint! data too?
        @inbounds for i in eachindex(rawoffsets)[1:end-1] 
            rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
            append!(data, view(rawdata, rg))
            push!(elem_ptr, last(elem_ptr) + div(length(rg), _size))
        end
        real_data = ntoh.(reinterpret(T, data))
        elem_ptr .+= firstindex(real_data)
        VectorOfVectors(real_data, elem_ptr)
    #...
end

@Moelf
Copy link
Member Author

Moelf commented Jul 25, 2021

Can we sizehint! data too?

maybe something smart like last(offset) \div sizeof(eltype(T)

@Moelf
Copy link
Member Author

Moelf commented Jul 25, 2021

before:
image

your version:
image

no visible difference

@oschulz
Copy link
Member

oschulz commented Jul 26, 2021

Hm, I'm not sure where these allocs are coming from, but it shouldn't be from within interped_data, since it only does a small, fixed number of allocations. What does main_looper do? And can you benchmark just the data reading phase?

@Moelf
Copy link
Member Author

Moelf commented Jul 26, 2021

see #65 (comment)

for minimal reproduction of the mysterious allocation

@oschulz
Copy link
Member

oschulz commented Jul 26, 2021

I'll try to have go at it ...

@Moelf
Copy link
Member Author

Moelf commented Jul 26, 2021

I have reduced the minimal reproduction:

julia> using ArraysOfArrays

julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}

julia> foreach(V) do i
           push!(Vov, i)
       end

julia> @benchmark begin
           for i in eachindex($V)
               length($V[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  50.451 μs  512.027 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     56.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.019 μs ±  14.257 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▂▂ ▇█▂▆▃▂▂▂▁     ▁▁                                         ▂
  ███▁██████████████████▇▇▇▆▆▇█▇▆▆▇▇▆▆▆▆▇▆▅▅▇▅▆▅▅▆▅▆▇▆▆▆▆▆▆▇▇▇ █
  50.5 μs       Histogram: log(frequency) by time       110 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           for i in eachindex($Vov)
               length($Vov[i])
           end
       end
BenchmarkTools.Trial: 9314 samples with 1 evaluation.
 Range (min  max):  472.014 μs   1.195 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     514.337 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   533.550 μs ± 70.951 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▂▆▇▇█▆▅▅▄▃▃▂▁▂                                             ▂
  ███████████████████▇▇▆▆▆▆▅▅▄▆▄▄▄▅▄▅▃▄▄▄▅▆▆▆▆▅▅▆▅▆▇███▇█▇▇▆▄▄ █
  472 μs        Histogram: log(frequency) by time       870 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> Vov == V
true

maybe this can be a ArraysOfArrays issue, not sure.

I tested the newly added Int32-offset of VoV, it's 30% faster than Int64 VoV. I really hope this is not some false sharing shenanigans

@oschulz
Copy link
Member

oschulz commented Jul 26, 2021

I don't think this is the issue - VectorOfVectors has to do a little bit more work to check bounds and create the views than getindex for a vector of separate vectors. But it's still very little time, and without bounds checking the difference is actually just about a factor two:

julia> using ArraysOfArrays, BenchmarkTools

julia> V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> Vov = VectorOfVectors(V)

julia> @benchmark begin
           @inbounds for i in eachindex($V)
               length($V[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  34.919 μs  132.067 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     38.391 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.404 μs ±   7.572 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄▁▅▆█▄▃▄▃ ▄  ▂ ▂▂▃                                           ▁
  █████████▄█▄▄█▄███▅▅▆▇▇█▇▆▆▄▅▅▆▅▄▅▄▅▅▄▄▅▅▅▆▄▅▆▄▄▆▅▄▄▆▅▂▄▄▄▃▄ █
  34.9 μs       Histogram: log(frequency) by time      79.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           @inbounds for i in eachindex($Vov)
               length($Vov[i])
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  77.848 μs  226.026 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     87.407 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   92.920 μs ±  16.003 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁ ▄▅ ▇█ ▅▄ ▂  ▃▂▄▁ ▁   ▂  ▁   ▁                              ▂
  █▅██▆█████▇█▆▇████████▇██▆█▇▇████▆▇▆▆▆▆▇▇▆▅▆▆▆▆▅▅▅▅▆▅▅▅▆▆▃▅▆ █
  77.8 μs       Histogram: log(frequency) by time       165 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

So that's about 1 ns for each element access (view creation plus length()) for VectorOfVectors (with @inbounds), and there are no memory allocations.

@Moelf
Copy link
Member Author

Moelf commented Jul 26, 2021

some new observation:

julia> const t1 = LazyTree(ROOTFile("/home/akako/Downloads/doublemu.root"), "t", [r"^Muon_(pt|eta|phi|mass)$","MET_pt"]);

julia> length(t1.Muon_pt) ÷ 10^4
247

julia> @time begin
           for idx in 1:10^4:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.000530 seconds (3.56 k allocations: 406.859 KiB)

julia> @time begin
           for idx in 1:10^3:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.001023 seconds (8.01 k allocations: 546.172 KiB)

julia> @time begin
           for idx in 1:10^2:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.005169 seconds (51.95 k allocations: 1.884 MiB)


julia> @time begin
           for idx in 1:10:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.014718 seconds (494.29 k allocations: 15.714 MiB)

julia> @time begin
           for idx in 1:1:length(t1.Muon_pt)
               t1.Muon_pt[idx]
           end
       end
  0.194961 seconds (4.89 M allocations: 150.909 MiB, 44.43% gc time)

notice how the allocation is not linear, initially, it seems to me cache is working properly because accessing 10x more elements didn't cause 10x allocation, but later you see it scales linearly

@Moelf
Copy link
Member Author

Moelf commented Jul 26, 2021

@oschulz I think I found a "real" minimal reproducing example

julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}

julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

julia> foreach(V) do i
           push!(Vov, i)
       end

julia> struct A
           v::VectorOfVectors{Float32}
       end

julia> struct B
           v::Vector{Vector{Float32}}
       end

julia> const _a = A(Vov);

julia> const _b = B(V);

julia> function f(stru, idx)
           @inbounds stru.v[idx]
       end
f (generic function with 1 method)

# warm run
julia> @time begin
           for i = 1:10^5
               f(_a, i)
           end
       end
  0.005362 seconds (199.49 k allocations: 6.096 MiB)

# warm run
julia> @time begin
           for i = 1:10^5
               f(_b, i)
           end
       end
  0.000216 seconds

in fact for some reason, getting index is only allocating in @benchmark even with @inbounds

julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (min  max):  22.168 ns   1.808 μs  ┊ GC (min  max): 0.00%  96.08%
 Time  (median):     23.265 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.942 ns ± 48.507 ns  ┊ GC (mean ± σ):  5.52% ±  3.06%

  ▇██▆▅▄▃▁        ▁                ▁▁▁▁                       ▂
  ████████▇▆▆▄▆▆███▇▆▅▅▄▃▃▂▂▂▄▄▇█████████████▇███▇█▇▅▅▆▅▄▆▅▆▆ █
  22.2 ns      Histogram: log(frequency) by time      50.5 ns <

 Memory estimate: 48 bytes, allocs estimate: 1.

julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.344 ns  15.178 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.514 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.521 ns ±  0.242 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        █▁                                                    
  ▂▃▂▂▁▂███▄▂▂▄▃▂▂▁▁▄▃▂▂▁▁▂▄▂▂▁▁▁▅▇▅▄▂▁▁▂█▄▄▂▂▁▁▃▇▅▄▃▂▁▁▁▂▂▂ ▂
  1.34 ns        Histogram: frequency by time        1.73 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@oschulz
Copy link
Member

oschulz commented Jul 26, 2021

That one's easy to explain, your struct A doesn't allow for type-stability since VectorOfVectors{Float32} isa UnionAll (because VectorOfVectors has several type parameters). Try this version:

using ArraysOfArrays, BenchmarkTools

const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];

const Vov = VectorOfVectors(V)

struct Foo{VV<:AbstractVector{<:AbstractVector{<:Real}}}
    v::VV
end

const _a = Foo(Vov);

const _b = Foo(V);

function f(stru, idx)
    @inbounds stru.v[idx]
end
julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.977 ns  42.563 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     2.307 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.337 ns ±  0.692 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                      ▇    ▅▁                 █▃    █▆     ▁  
  ▆▅▂▂▁▂▁▁▁▃▄▂▂▁▄▇▂▂▁▆█▄▅█▃██▃▃▁▃▇▄▅▃▂▂▂▆▇▂▂▁▃██▃▃▁▃██▃▃▁▁▆█ ▄
  1.98 ns        Histogram: frequency by time        2.54 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.108 ns  29.334 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.219 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.215 ns ±  0.492 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                         ▅█▅▁▇▄               
  ▂▅▇▆▆▄▅▃▂▂▁▁▁▁▁▁▁▁▁▃▄▃▄▄▄▃▂▂▃▆▅▃▅▅▄▂▂▂▅███████▄▃▂▂▃▂▂▂▂▂▁▁ ▃
  1.11 ns        Histogram: frequency by time        1.27 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

@tamasgal
Copy link
Member

Well spotted @oschulz. I think JET.jl should reveal such things quite easily.

@Moelf
Copy link
Member Author

Moelf commented Jul 26, 2021

but it doesn't solve all the problem

julia> struct C
           v::VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
       end

julia> @benchmark begin
           for i = 1:10^5
               f(_c, i)
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  89.424 μs  547.878 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     89.486 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.538 μs ±  14.213 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █   ▂ ▅▃       ▁  ▁  ▁                                       ▁
  █▆█▅█▃██▅█▇▆█▇██▇▇█▆▆█▆▆█▆▆▅▅▅▅▅▇▇▆▆▇▇▅▅▆▅▂▃▅▄▄▄▅▃▄▅▅▅▄▃▃▃▃▂ █
  89.4 μs       Histogram: log(frequency) by time       153 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark begin
           for i = 1:10^5
               f(_b, i)
           end
       end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  39.804 μs  117.688 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     39.957 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   41.774 μs ±   6.077 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █  ▂ ▄▂          ▁  ▁                                        ▁
  ██▇█▇███▇▅▇▅█▄█▇▅█▅▅█▃▄▅▇▇▇▇▇▆▅▃▆▆▆▆▆▆▅▅▆▆▅▅▅▄▅▅▃▄▄▁▄▃▁▃▅▅▅▇ █
  39.8 μs       Histogram: log(frequency) by time      74.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

tested to have same speed as Foo(), and when I tested this for LazyBranch, it still allocates a lot (although 2x faster than when type is not stable)

@oschulz
Copy link
Member

oschulz commented Jul 26, 2021

But this is just 1 ns vs. 0.5 ns for the access to the element vectors. Compared to an actual operation that does something real-life on those element vectors, that time difference should be negligible.

I don't think ArraysOfArrays causes the allocations, as long as things are type-stable around it.

@tamasgal
Copy link
Member

I am wondering if this is somehow related to runtime dispatch deeper in the rabbit hole, but the difference is still quite "small".

@Moelf
Copy link
Member Author

Moelf commented Jul 26, 2021

I tried the ultimate solution which is just to add another type to track buffer type and it seems to be working:

now:

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.154825 seconds (26.07 k allocations: 177.184 MiB, 15.19% gc time, 3.46% compilation time)


julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.026082 seconds (12.45 k allocations: 1.470 MiB)

master:

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  3.676705 seconds (26.81 M allocations: 1.196 GiB, 15.58% gc time, 25.18% compilation time)

julia> @time begin
           for evt in t1
               length(evt.Muon_pt)
           end
       end
  0.021609 seconds (5.90 k allocations: 767.688 KiB)

I still can't believe how well current master works with cache....

@tamasgal
Copy link
Member

Ha! Indeed 😆 That looks promising.

@tamasgal
Copy link
Member

@all-contributors please add @oschulz ideas

@allcontributors
Copy link
Contributor

@tamasgal

I've put up a pull request to add @oschulz! 🎉

@Moelf Moelf closed this as completed in #65 Jul 27, 2021
@oschulz
Copy link
Member

oschulz commented Jul 27, 2021

Very nice @Moelf !

I hope I can find some time in the future to contribute more directly. I'm happy with UpROOT.jl as a stopgap, but being able to deal with ROOT files without installing a whole Python stack is very attractive. :-) Also, we should be able to outperform uproot a bit, esp. on ROOT files with small buffer sizes.

@Moelf
Copy link
Member Author

Moelf commented Jul 27, 2021

we already do ;) I believe our lazy iteration (when cached) is within 2x concrete Numba loop over allocated arrays, which means we're probably a lot faster than chunked uproot code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants