Performance regression since v0.8.0 #1075

schlurp · 2023-02-28T15:30:31Z

I think a similar performance regression like in #752 happened again. If I test with the fixed version from the mentioned issue v0.8.0 version, I get

`````````(blu) pkg> st
      Status `/tmp/blu/Project.toml`
   [6e4b80f9] BenchmarkTools v1.3.2
   [336ed68f] CSV v0.8.0 `~/repos/CSV.jl`
   [9a3f8284] Random

julia> @benchmark read($rows)

BenchmarkTools.Trial: 72 samples with 1 evaluation.
 Range (min … max):  55.617 ms … 82.286 ms  ┊ GC (min … max): 0.00% … 1.79%
 Time  (median):     70.487 ms              ┊ GC (median):    2.00%
 Time  (mean ± σ):   69.935 ms ±  5.109 ms  ┊ GC (mean ± σ):  1.19% ± 1.05%

                      ▁▁  ▁ ▁       █ ▁▁▁▁                     
  ▄▁▁▁▁▁▁▄▁▁▁▁▄▁▁▄▇▄▄▄██▇▇█▄█▇▇▇▄▄▇▄█▇████▇▇▁▇▁▇▁▄▇▄▁▁▁▄▁▁▁▁▄ ▁
  55.6 ms         Histogram: frequency by time        82.1 ms <

 Memory estimate: 24.41 MiB, allocs estimate: 800000.

With the latest 0.10.9 version I get

(blu) pkg> st
      Status `/tmp/blu/Project.toml`
  [6e4b80f9] BenchmarkTools v1.3.2
  [336ed68f] CSV v0.10.9
  [9a3f8284] Random

julia> @benchmark read($rows)
BenchmarkTools.Trial: 10 samples with 1 evaluation.
 Range (min … max):  519.511 ms … 575.751 ms  ┊ GC (min … max): 2.39% … 1.39%
 Time  (median):     537.934 ms               ┊ GC (median):    1.90%
 Time  (mean ± σ):   537.647 ms ±  16.968 ms  ┊ GC (mean ± σ):  1.91% ± 0.42%

  ██  █   █         █  █ ██         █                         █  
  ██▁▁█▁▁▁█▁▁▁▁▁▁▁▁▁█▁▁█▁██▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  520 ms           Histogram: frequency by time          576 ms <

 Memory estimate: 109.86 MiB, allocs estimate: 2400000.

Profiling shows a lot of calls to coltype

                                               351   100% |   getcolumn /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:319
         7  0.38% 79.55%        351 18.99%                | getcolumn(::CSV.Row2{Parsers.PosLen, PosLenString}, ::Int64) /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:320
                                               258 73.50% |   coltype(::CSV.Column) /home/domi/.julia/packages/CSV/b8ebJ/src/utils.jl:23
                                                45 12.82% |   jl_apply_generic /home/domi/software/julia/julia/src/gf.c:2425
                                                16  4.56% |   getcolumn(::CSV.Row2{Parsers.PosLen, PosLenString}, ::Type{Union{Missing, PosLenString}}, ::Int64, ::Symbol) /home/domi/.julia/packages/CSV/b8ebJ/src/rows.jl:-1
                                                 3  0.85% |   coltype(::CSV.Column) /home/domi/.julia/packages/CSV/b8ebJ/src/utils.jl

The text was updated successfully, but these errors were encountered:

schlurp · 2023-02-28T16:18:23Z

git bisect tells me that commit bfd415d is the first bad commit

git bisect start
# status: waiting for both good and bad commits
# bad: [cfb4ffb5d9847df4f8d5efa5129b27026a83a4a8] `typemap`: switch to IdDict (#1069)
git bisect bad cfb4ffb5d9847df4f8d5efa5129b27026a83a4a8
# status: waiting for good commit(s), bad commit known
# good: [c94256adbe6c2ae017be90ae91976f3b5bb74aa4] bump version
git bisect good c94256adbe6c2ae017be90ae91976f3b5bb74aa4
# bad: [549b1ab03155c8c96485406881addcc65ef914e8] Take dependency on InlineStrings package (#923)
git bisect bad 549b1ab03155c8c96485406881addcc65ef914e8
# good: [f405361298ac09692024c0afdf546df880899223] Bump version
git bisect good f405361298ac09692024c0afdf546df880899223
# bad: [ea9eca2de56470a8e585ae4ee92495a88a632fbb] bump version
git bisect bad ea9eca2de56470a8e585ae4ee92495a88a632fbb
# bad: [17bffa1a5e2bc570e913d1f0bac98e65e3aeb1e4] Overhaul CSV.jl docs (#869)
git bisect bad 17bffa1a5e2bc570e913d1f0bac98e65e3aeb1e4
# bad: [0eaacb4c4d5787a5186dac8edf523ee9052db27f] Keyword argument cleanup in preparation for 1.0 release (#846)
git bisect bad 0eaacb4c4d5787a5186dac8edf523ee9052db27f
# bad: [ffda8d35793eea4f254f51de213527e5ed55359a] Fix nightly
git bisect bad ffda8d35793eea4f254f51de213527e5ed55359a
# bad: [bfd415d0af4c7c1842ecc5b54f1e4a18b125a264] CSV parsing internals refactoring (#837)
git bisect bad bfd415d0af4c7c1842ecc5b54f1e4a18b125a264
# good: [fc209672d0a894954c5dc1e0835e0426b9d2925c] make "Edit on Github" points to main branch (#835)
git bisect good fc209672d0a894954c5dc1e0835e0426b9d2925c
# first bad commit: [bfd415d0af4c7c1842ecc5b54f1e4a18b125a264] CSV parsing internals refactoring (#837)

tested with the script

using CSV
using BenchmarkTools
using Random
using Pkg

Pkg.resolve()
Pkg.status()

Random.seed!(0)
open("test.csv", "w") do f
    for _ in 1:100_000
        write(f, join([randstring('a':'z') for _ in 1:8], ","))
        write(f, "\n")
    end
end
function read(rows)
    bla = 0
    for r in rows
        bla += hash(r.a)
        bla += hash(r.b)
        bla += hash(r.c)
        bla += hash(r.d)
        bla += hash(r.e)
        bla += hash(r.f)
        bla += hash(r.g)
        bla += hash(r.h)
    end
    bla
end

rows = CSV.Rows("test.csv", reusebuffer=true, header=Symbol.('a':'h'))
bench = @benchmarkable read($rows)
tune!(bench)
results = run(bench)
show(stdout::IO, MIME"text/plain"(), results)

schlurp mentioned this issue Feb 28, 2023

Performance regressions CSV.Rows since 0.5? #752

Closed

schlurp changed the title ~~Performance regression since v0.5~~ Performance regression since v0.8.0 Feb 28, 2023

nickrobinson251 added the performance label May 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression since v0.8.0 #1075

Performance regression since v0.8.0 #1075

schlurp commented Feb 28, 2023 •

edited

Loading

schlurp commented Feb 28, 2023

Performance regression since v0.8.0 #1075

Performance regression since v0.8.0 #1075

Comments

schlurp commented Feb 28, 2023 • edited Loading

schlurp commented Feb 28, 2023

schlurp commented Feb 28, 2023 •

edited

Loading