-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance when weaving many branches together #64
Comments
Using ArrayOfArrays.jl should definitely help to reduce the number of memory allocations a lot. |
Why is |
because it's used like this later: #isolation cut, require all 4 of them to be true
if all((
v_l_passIso[pr1[1]][1],
v_l_passIso[pr1[2]][1],
v_l_passIso[W_id[1]][1],
v_l_passIso[W_id[2]][1],
))
else
return false, wgt
end and if (abs(v_l_pid[fifth_l]) == 11)
!v_l_passIso[fifth_l][4] && return false, wgt
end
if (abs(v_l_pid[fifth_l]) == 13)
!v_l_passIso[fifth_l][3] && return false, wgt
end in short, it needs to have the same order as all other "lepton related" branches so you can index all of them with the same number that identifies a unique lepton in the event |
I guess one option would be to user a static array or tuple, since the length of the elements of |
it is not, it only contains stuff for leptons that passed |
index 86d8909..e26ff74 100644
--- a/src/root.jl
+++ b/src/root.jl
@@ -215,11 +215,16 @@ function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:J
- @views [
- ntoh.(reinterpret(
- T, rawdata[ (rawoffsets[i]+jagg_offset+1):rawoffsets[i+1] ]
- )) for i in 1:(length(rawoffsets) - 1)
- ]
+ _size = sizeof(eltype(T))
+ data = UInt8[]
+ offset = Int64[0] # god damn 0-based index
+ @views @inbounds for i in 1:(length(rawoffsets) - 1)
+ rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
+ append!(data, rawdata[rg])
+ push!(offset, last(offset) + length(rg))
+ end
+ real_data = ntoh.(reinterpret(T, data))
+ VectorOfVectors(real_data, offset .÷ _size .+ 1)
end here's the number AFTER applying this diff: # 1st run
17.403734 seconds (72.69 M allocations: 6.877 GiB, 9.20% gc time, 3.31% compilation time)
# 2nd run
3.135352 seconds (34.34 M allocations: 2.249 GiB, 12.11% gc time)
# 3rd run
2.990273 seconds (34.34 M allocations: 2.249 GiB, 9.52% gc time) BEFORE this diff: # 1st run
32.069656 seconds (213.86 M allocations: 11.560 GiB, 21.16% gc time, 1.26% compilation time)
# 2nd run
3.976044 seconds (27.73 M allocations: 1.779 GiB, 21.24% gc time)
# 3rd run
3.725007 seconds (27.73 M allocations: 1.779 GiB, 13.36% gc time) @oschulz definitely helpful at asymptotic limit! A little bit confused by why subsequent runs allocates more, may be because some weird hidden conversion somewhere... |
Using |
Ah, sorry, overlooked the |
Maybe try this version: function interped_data(rawdata, rawoffsets, ::Type{T}, ::Type{J}) where {T, J<:JaggType}
# ...
_size = Base.SignedMultiplicativeInverse{Int}(sizeof(eltype(T))
data = UInt8[]
elem_ptr = sizehint!(Vector{Int}(), length(eachindex(rawoffsets)))
push!(elem_ptr, 0)
# Can we sizehint! data too?
@inbounds for i in eachindex(rawoffsets)[1:end-1]
rg = (rawoffsets[i]+jagg_offset+1) : rawoffsets[i+1]
append!(data, view(rawdata, rg))
push!(elem_ptr, last(elem_ptr) + div(length(rg), _size))
end
real_data = ntoh.(reinterpret(T, data))
elem_ptr .+= firstindex(real_data)
VectorOfVectors(real_data, elem_ptr)
#...
end |
maybe something smart like |
Hm, I'm not sure where these allocs are coming from, but it shouldn't be from within |
see #65 (comment) for minimal reproduction of the mysterious allocation |
I'll try to have go at it ... |
I have reduced the minimal reproduction: julia> using ArraysOfArrays
julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];
julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
julia> foreach(V) do i
push!(Vov, i)
end
julia> @benchmark begin
for i in eachindex($V)
length($V[i])
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 50.451 μs … 512.027 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 56.100 μs ┊ GC (median): 0.00%
Time (mean ± σ): 59.019 μs ± 14.257 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▂ ▇█▂▆▃▂▂▂▁ ▁▁ ▂
███▁██████████████████▇▇▇▆▆▇█▇▆▆▇▇▆▆▆▆▇▆▅▅▇▅▆▅▅▆▅▆▇▆▆▆▆▆▆▇▇▇ █
50.5 μs Histogram: log(frequency) by time 110 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark begin
for i in eachindex($Vov)
length($Vov[i])
end
end
BenchmarkTools.Trial: 9314 samples with 1 evaluation.
Range (min … max): 472.014 μs … 1.195 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 514.337 μs ┊ GC (median): 0.00%
Time (mean ± σ): 533.550 μs ± 70.951 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂▆▇▇█▆▅▅▄▃▃▂▁▂ ▂
███████████████████▇▇▆▆▆▆▅▅▄▆▄▄▄▅▄▅▃▄▄▄▅▆▆▆▆▅▅▆▅▆▇███▇█▇▇▆▄▄ █
472 μs Histogram: log(frequency) by time 870 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> Vov == V
true maybe this can be a ArraysOfArrays issue, not sure. I tested the newly added |
I don't think this is the issue - VectorOfVectors has to do a little bit more work to check bounds and create the views than julia> using ArraysOfArrays, BenchmarkTools
julia> V = [rand(Float32, rand(0:4)) for _ = 1:10^5];
julia> Vov = VectorOfVectors(V)
julia> @benchmark begin
@inbounds for i in eachindex($V)
length($V[i])
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 34.919 μs … 132.067 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 38.391 μs ┊ GC (median): 0.00%
Time (mean ± σ): 40.404 μs ± 7.572 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▁▅▆█▄▃▄▃ ▄ ▂ ▂▂▃ ▁
█████████▄█▄▄█▄███▅▅▆▇▇█▇▆▆▄▅▅▆▅▄▅▄▅▅▄▄▅▅▅▆▄▅▆▄▄▆▅▄▄▆▅▂▄▄▄▃▄ █
34.9 μs Histogram: log(frequency) by time 79.2 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark begin
@inbounds for i in eachindex($Vov)
length($Vov[i])
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 77.848 μs … 226.026 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 87.407 μs ┊ GC (median): 0.00%
Time (mean ± σ): 92.920 μs ± 16.003 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▄▅ ▇█ ▅▄ ▂ ▃▂▄▁ ▁ ▂ ▁ ▁ ▂
█▅██▆█████▇█▆▇████████▇██▆█▇▇████▆▇▆▆▆▆▇▇▆▅▆▆▆▆▅▅▅▅▆▅▅▅▆▆▃▅▆ █
77.8 μs Histogram: log(frequency) by time 165 μs <
Memory estimate: 0 bytes, allocs estimate: 0. So that's about 1 ns for each element access (view creation plus |
some new observation: julia> const t1 = LazyTree(ROOTFile("/home/akako/Downloads/doublemu.root"), "t", [r"^Muon_(pt|eta|phi|mass)$","MET_pt"]);
julia> length(t1.Muon_pt) ÷ 10^4
247
julia> @time begin
for idx in 1:10^4:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.000530 seconds (3.56 k allocations: 406.859 KiB)
julia> @time begin
for idx in 1:10^3:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.001023 seconds (8.01 k allocations: 546.172 KiB)
julia> @time begin
for idx in 1:10^2:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.005169 seconds (51.95 k allocations: 1.884 MiB)
julia> @time begin
for idx in 1:10:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.014718 seconds (494.29 k allocations: 15.714 MiB)
julia> @time begin
for idx in 1:1:length(t1.Muon_pt)
t1.Muon_pt[idx]
end
end
0.194961 seconds (4.89 M allocations: 150.909 MiB, 44.43% gc time) notice how the allocation is not linear, initially, it seems to me cache is working properly because accessing 10x more elements didn't cause 10x allocation, but later you see it scales linearly |
@oschulz I think I found a "real" minimal reproducing example julia> const Vov = VectorOfVectors{Float32}()
0-element VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
julia> const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];
julia> foreach(V) do i
push!(Vov, i)
end
julia> struct A
v::VectorOfVectors{Float32}
end
julia> struct B
v::Vector{Vector{Float32}}
end
julia> const _a = A(Vov);
julia> const _b = B(V);
julia> function f(stru, idx)
@inbounds stru.v[idx]
end
f (generic function with 1 method)
# warm run
julia> @time begin
for i = 1:10^5
f(_a, i)
end
end
0.005362 seconds (199.49 k allocations: 6.096 MiB)
# warm run
julia> @time begin
for i = 1:10^5
f(_b, i)
end
end
0.000216 seconds in fact for some reason, getting index is only allocating in julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
Range (min … max): 22.168 ns … 1.808 μs ┊ GC (min … max): 0.00% … 96.08%
Time (median): 23.265 ns ┊ GC (median): 0.00%
Time (mean ± σ): 26.942 ns ± 48.507 ns ┊ GC (mean ± σ): 5.52% ± 3.06%
▇██▆▅▄▃▁ ▁ ▁▁▁▁ ▂
████████▇▆▆▄▆▆███▇▆▅▅▄▃▃▂▂▂▄▄▇█████████████▇███▇█▇▅▅▆▅▄▆▅▆▆ █
22.2 ns Histogram: log(frequency) by time 50.5 ns <
Memory estimate: 48 bytes, allocs estimate: 1.
julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.344 ns … 15.178 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.514 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.521 ns ± 0.242 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
█▁
▂▃▂▂▁▂███▄▂▂▄▃▂▂▁▁▄▃▂▂▁▁▂▄▂▂▁▁▁▅▇▅▄▂▁▁▂█▄▄▂▂▁▁▃▇▅▄▃▂▁▁▁▂▂▂ ▂
1.34 ns Histogram: frequency by time 1.73 ns <
Memory estimate: 0 bytes, allocs estimate: 0. |
That one's easy to explain, your struct using ArraysOfArrays, BenchmarkTools
const V = [rand(Float32, rand(0:4)) for _ = 1:10^5];
const Vov = VectorOfVectors(V)
struct Foo{VV<:AbstractVector{<:AbstractVector{<:Real}}}
v::VV
end
const _a = Foo(Vov);
const _b = Foo(V);
function f(stru, idx)
@inbounds stru.v[idx]
end julia> @benchmark f($_a, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.977 ns … 42.563 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.307 ns ┊ GC (median): 0.00%
Time (mean ± σ): 2.337 ns ± 0.692 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▇ ▅▁ █▃ █▆ ▁
▆▅▂▂▁▂▁▁▁▃▄▂▂▁▄▇▂▂▁▆█▄▅█▃██▃▃▁▃▇▄▅▃▂▂▂▆▇▂▂▁▃██▃▃▁▃██▃▃▁▁▆█ ▄
1.98 ns Histogram: frequency by time 2.54 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark f($_b, 123)
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min … max): 1.108 ns … 29.334 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.219 ns ┊ GC (median): 0.00%
Time (mean ± σ): 1.215 ns ± 0.492 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▅▁▇▄
▂▅▇▆▆▄▅▃▂▂▁▁▁▁▁▁▁▁▁▃▄▃▄▄▄▃▂▂▃▆▅▃▅▅▄▂▂▂▅███████▄▃▂▂▃▂▂▂▂▂▁▁ ▃
1.11 ns Histogram: frequency by time 1.27 ns <
Memory estimate: 0 bytes, allocs estimate: 0. |
Well spotted @oschulz. I think |
but it doesn't solve all the problem julia> struct C
v::VectorOfVectors{Float32, Vector{Float32}, Vector{Int64}, Vector{Tuple{}}}
end
julia> @benchmark begin
for i = 1:10^5
f(_c, i)
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 89.424 μs … 547.878 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 89.486 μs ┊ GC (median): 0.00%
Time (mean ± σ): 94.538 μs ± 14.213 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂ ▅▃ ▁ ▁ ▁ ▁
█▆█▅█▃██▅█▇▆█▇██▇▇█▆▆█▆▆█▆▆▅▅▅▅▅▇▇▆▆▇▇▅▅▆▅▂▃▅▄▄▄▅▃▄▅▅▅▄▃▃▃▃▂ █
89.4 μs Histogram: log(frequency) by time 153 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark begin
for i = 1:10^5
f(_b, i)
end
end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 39.804 μs … 117.688 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 39.957 μs ┊ GC (median): 0.00%
Time (mean ± σ): 41.774 μs ± 6.077 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█ ▂ ▄▂ ▁ ▁ ▁
██▇█▇███▇▅▇▅█▄█▇▅█▅▅█▃▄▅▇▇▇▇▇▆▅▃▆▆▆▆▆▆▅▅▆▆▅▅▅▄▅▅▃▄▄▁▄▃▁▃▅▅▅▇ █
39.8 μs Histogram: log(frequency) by time 74.4 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
tested to have same speed as |
But this is just 1 ns vs. 0.5 ns for the access to the element vectors. Compared to an actual operation that does something real-life on those element vectors, that time difference should be negligible. I don't think ArraysOfArrays causes the allocations, as long as things are type-stable around it. |
I am wondering if this is somehow related to runtime dispatch deeper in the rabbit hole, but the difference is still quite "small". |
I tried the ultimate solution which is just to add another type to track buffer type and it seems to be working: now: julia> @time begin
for evt in t1
length(evt.Muon_pt)
end
end
0.154825 seconds (26.07 k allocations: 177.184 MiB, 15.19% gc time, 3.46% compilation time)
julia> @time begin
for evt in t1
length(evt.Muon_pt)
end
end
0.026082 seconds (12.45 k allocations: 1.470 MiB) master: julia> @time begin
for evt in t1
length(evt.Muon_pt)
end
end
3.676705 seconds (26.81 M allocations: 1.196 GiB, 15.58% gc time, 25.18% compilation time)
julia> @time begin
for evt in t1
length(evt.Muon_pt)
end
end
0.021609 seconds (5.90 k allocations: 767.688 KiB) I still can't believe how well current |
Ha! Indeed 😆 That looks promising. |
@all-contributors please add @oschulz ideas |
I've put up a pull request to add @oschulz! 🎉 |
Very nice @Moelf ! I hope I can find some time in the future to contribute more directly. I'm happy with UpROOT.jl as a stopgap, but being able to deal with ROOT files without installing a whole Python stack is very attractive. :-) Also, we should be able to outperform uproot a bit, esp. on ROOT files with small buffer sizes. |
we already do ;) I believe our lazy iteration (when cached) is within 2x concrete Numba loop over allocated arrays, which means we're probably a lot faster than chunked uproot code |
Sometimes in HEP the memory access pattern is inherently bad. For example we have jagged branches for electron and muon related quantities: Lorentz Vectors, isolation flags etc.
Let's say the
e_mask
is quality cut, and people want to extract isolation cuts for electrons that passed thee_mask
:which results in
![image](https://user-images.githubusercontent.com/5306213/126881332-55be53dc-2f2c-4e67-b49e-a25bbbba4b4c.png)
Each of those piles is actually just the
v_e_passIso_HighPtCaloOnly = evt.v_e_passIso_HighPtCaloOnly
line.This is probably horrible no matter what language you use, but I'm wondering if we should explore https://github.com/JuliaArrays/ArraysOfArrays.jl/ to see if we can cut down allocations of making concrete vectors (or vector of vectors) since in many cases they just get teared apart manually in the looper.
The text was updated successfully, but these errors were encountered: