Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review benchmarks #3

Open
jeremiedb opened this issue Jan 20, 2023 · 6 comments
Open

Review benchmarks #3

jeremiedb opened this issue Jan 20, 2023 · 6 comments

Comments

@jeremiedb
Copy link

Following a conversation on Julia's Slack (https://julialang.slack.com/archives/C674VR0HH/p1674245762657489), it was raised that there might be caveats on how the benchmark against DataFrames.jl was conduct.

By wrapping operations into functions, it can be seen that DataFrames.jl is actually significantly outperforming tidytable.

# base DataFrames.jl
function f0(df)
        _df = subset(df, :Year => (x -> x .>= 2000))
        _df = groupby(_df, :Year)
        _df = combine(_df, :Budget => (x -> mean(skipmissing(x))) => :Budget)
        _df = transform!(_df, :Budget => (x -> x / 1e6) => :Budget)
    return _df
end
# 1.257 ms (903 allocations: 1.78 MiB)
@btime f0($movies)

# chained DataFrames.jl
function f1(df)
  @chain df begin
      subset(:Year => (x -> x .>= 2000))
      groupby(:Year)
      combine(:Budget => (x -> mean(skipmissing(x))) => :Budget)
      transform(:Budget => (x -> x / 1e6) => :Budget)
  end
end
# 1.279 ms (905 allocations: 1.78 MiB)
@btime f1($movies)

# tidytable
function f2(df)
    @chain tidytable(df) begin
        @filter(Year >= 2000)
        @group_by(Year)
        @summarize(Budget = mean(Budget, na.rm = TRUE))
        @mutate(Budget = Budget / 1e6)
        collect()
    end
end
# 26.660 ms (118073 allocations: 5.40 MiB)
@btime f2($movies)
@pdeffebach
Copy link

pdeffebach commented Jan 20, 2023

Here are some other functions to benchmark, using DataFramesMeta.jl, which will compile to the same as DataFrames.jl but has nicer syntax (more in line with tidyverse)

julia> using DataFramesMeta 

julia> function clean_tt(movies)
           @chain tidytable(movies) begin
               @filter(Year >= 2000)
               @group_by(Year)
               @summarize(Budget = mean(Budget, na.rm = TRUE))
               @mutate(Budget = Budget/1e6)
               collect()
           end
       end;

julia> function clean_df(movies)
           @chain movies begin
               subset(:Year => (x -> x .>= 2000))
               groupby(:Year)
               combine(:Budget => (x -> mean(skipmissing(x))) => :Budget)
               transform(:Budget => (x -> x/1e6) => :Budget)
           end
       end;

julia> function clean_dfm(movies)
           @chain movies begin
               @rsubset :Year >= 2000
               groupby(:Year)
               @combine :Budget = mean(skipmissing(:Budget))
               @rtransform :Budget = :Budget / 1e6
           end
       end;

@kdpsingh
Copy link
Owner

Thanks! I'll review and will fix. As I said on Twitter, this package shouldn't be taken too seriously because it's more of a learning project for me. That said, appreciate the advice on wrapping inside a function for benchmarking purposes.

Is the first run of the function any faster with wrapping in a function? Or only subsequent runs bc of precompilation?

Have read the Julia docs on optimization so I would think first run is still same speed.

I like both the DataFramesMeta and DataFrameMacros syntax. I wish there were a way to try to run a devectorized version of code first and then vectorize when required. While most times I want @rtransform, if I'm standardizing a variable, as in x - mean(x), then I need to remember to use @Transform. That's more of a Julia style thing (borrowed from Matlab) than anything specific to do with DataFrames.jl. Loving playing with Julia.

Thanks again.

@pdeffebach
Copy link

Julia has no way to detect if a transformation should be done col-wise or row-wise, unfortunately. You can't vectorize a function the way you do in R. But it might make sense to make row-wise the default and introduce @ctransform

@bkamins
Copy link

bkamins commented Jan 21, 2023

Let me add one more comment. This is the way to write this transformation with pure DataFrames.jl that is compiler-friendly (which was the major time cost with the example in README.md):

julia> @time @chain movies begin
         subset(:Year => ByRow(>=(2000)), view=true)
         groupby(:Year)
         combine(:Budget => mean∘skipmissing => :Budget)
         transform!(:Budget => Base.Fix2(/, 1e6) => :Budget)
       end
  0.000462 seconds (670 allocations: 275.789 KiB)
6×2 DataFrame
 Row │ Year   Budget
     │ Int32  Float64
─────┼────────────────
   1 │  2000  23.9477
   2 │  2001  19.2356
   3 │  2002  19.3971
   4 │  2003  15.8683
   5 │  2004  13.9057
   6 │  2005  16.4682

@kdpsingh
Copy link
Owner

Thank you so much! This is helpful for my learning, and it's very kind of you to take the time to share!

@bkamins
Copy link

bkamins commented Jan 21, 2023

Also as is commented on Slack using filter is faster than subset (but I assume you wanted to use subset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants