Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression since v0.8.0 #1075

Open
schlurp opened this issue Feb 28, 2023 · 1 comment
Open

Performance regression since v0.8.0 #1075

schlurp opened this issue Feb 28, 2023 · 1 comment

Comments

@schlurp
Copy link

schlurp commented Feb 28, 2023

Hi @quinnj ,

I think a similar performance regression like in #752 happened again. If I test with the fixed version from the mentioned issue v0.8.0 version, I get

`````````(blu) pkg> st
      Status `/tmp/blu/Project.toml`
   [6e4b80f9] BenchmarkTools v1.3.2
   [336ed68f] CSV v0.8.0 `~/repos/CSV.jl`
   [9a3f8284] Random
julia> @benchmark read($rows)

BenchmarkTools.Trial: 72 samples with 1 evaluation.
 Range (min … max):  55.617 ms … 82.286 ms  ┊ GC (min … max): 0.00% … 1.79%
 Time  (median):     70.487 ms              ┊ GC (median):    2.00%
 Time  (mean ± σ):   69.935 ms ±  5.109 ms  ┊ GC (mean ± σ):  1.19% ± 1.05%

                      ▁▁  ▁ ▁       █ ▁▁▁▁                     
  ▄▁▁▁▁▁▁▄▁▁▁▁▄▁▁▄▇▄▄▄██▇▇█▄█▇▇▇▄▄▇▄█▇████▇▇▁▇▁▇▁▄▇▄▁▁▁▄▁▁▁▁▄ ▁
  55.6 ms         Histogram: frequency by time        82.1 ms <

 Memory estimate: 24.41 MiB, allocs estimate: 800000.

With the latest 0.10.9 version I get

(blu) pkg> st
      Status `/tmp/blu/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [336ed68f] CSV v0.10.9
  [9a3f8284] Random
julia> @benchmark read($rows)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  519.511 ms … 575.751 ms  ┊ GC (min … max): 2.39% … 1.39%
 Time  (median):     537.934 ms               ┊ GC (median):    1.90%
 Time  (mean ± σ):   537.647 ms ±  16.968 ms  ┊ GC (mean ± σ):  1.91% ± 0.42%

  ██  █   █         █  █ ██         █                         █  
  ██▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁█▁██▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  520 ms           Histogram: frequency by time          576 ms <

 Memory estimate: 109.86 MiB, allocs estimate: 2400000.

Profiling shows a lot of calls to coltype

                                               351   100% |   getcolumn /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:319
         7  0.38% 79.55%        351 18.99%                | getcolumn(::CSV.Row2{Parsers.PosLen, PosLenString}, ::Int64) /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:320
                                               258 73.50% |   coltype(::CSV.Column) /home/domi/.julia/packages/CSV/b8ebJ/src/utils.jl:23
                                                45 12.82% |   jl_apply_generic /home/domi/software/julia/julia/src/gf.c:2425
                                                16  4.56% |   getcolumn(::CSV.Row2{Parsers.PosLen, PosLenString}, ::Type{Union{Missing, PosLenString}}, ::Int64, ::Symbol) /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:-1
                                                 3  0.85% |   coltype(::CSV.Column) /home/domi/.julia/packages/CSV/b8ebJ/src/utils.jl
@schlurp schlurp changed the title Performance regression since v0.5 Performance regression since v0.8.0 Feb 28, 2023
@schlurp
Copy link
Author

schlurp commented Feb 28, 2023

git bisect tells me that commit bfd415d is the first bad commit

git bisect start
# status: waiting for both good and bad commits
# bad: [cfb4ffb5d9847df4f8d5efa5129b27026a83a4a8] `typemap`: switch to IdDict (#1069)
git bisect bad cfb4ffb5d9847df4f8d5efa5129b27026a83a4a8
# status: waiting for good commit(s), bad commit known
# good: [c94256adbe6c2ae017be90ae91976f3b5bb74aa4] bump version
git bisect good c94256adbe6c2ae017be90ae91976f3b5bb74aa4
# bad: [549b1ab03155c8c96485406881addcc65ef914e8] Take dependency on InlineStrings package (#923)
git bisect bad 549b1ab03155c8c96485406881addcc65ef914e8
# good: [f405361298ac09692024c0afdf546df880899223] Bump version
git bisect good f405361298ac09692024c0afdf546df880899223
# bad: [ea9eca2de56470a8e585ae4ee92495a88a632fbb] bump version
git bisect bad ea9eca2de56470a8e585ae4ee92495a88a632fbb
# bad: [17bffa1a5e2bc570e913d1f0bac98e65e3aeb1e4] Overhaul CSV.jl docs (#869)
git bisect bad 17bffa1a5e2bc570e913d1f0bac98e65e3aeb1e4
# bad: [0eaacb4c4d5787a5186dac8edf523ee9052db27f] Keyword argument cleanup in preparation for 1.0 release (#846)
git bisect bad 0eaacb4c4d5787a5186dac8edf523ee9052db27f
# bad: [ffda8d35793eea4f254f51de213527e5ed55359a] Fix nightly
git bisect bad ffda8d35793eea4f254f51de213527e5ed55359a
# bad: [bfd415d0af4c7c1842ecc5b54f1e4a18b125a264] CSV parsing internals refactoring (#837)
git bisect bad bfd415d0af4c7c1842ecc5b54f1e4a18b125a264
# good: [fc209672d0a894954c5dc1e0835e0426b9d2925c] make "Edit on Github" points to main branch (#835)
git bisect good fc209672d0a894954c5dc1e0835e0426b9d2925c
# first bad commit: [bfd415d0af4c7c1842ecc5b54f1e4a18b125a264] CSV parsing internals refactoring (#837)

tested with the script

using CSV
using BenchmarkTools
using Random
using Pkg

Pkg.resolve()
Pkg.status()

Random.seed!(0)
open("test.csv", "w") do f
    for _ in 1:100_000
        write(f, join([randstring('a':'z') for _ in 1:8], ","))
        write(f, "\n")
    end
end
function read(rows)
    bla = 0
    for r in rows
        bla += hash(r.a)
        bla += hash(r.b)
        bla += hash(r.c)
        bla += hash(r.d)
        bla += hash(r.e)
        bla += hash(r.f)
        bla += hash(r.g)
        bla += hash(r.h)
    end
    bla
end

rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
bench = @benchmarkable read($rows)
tune!(bench)
results = run(bench)
show(stdout::IO, MIME"text/plain"(), results)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants