Add Datafusion solution [updated] #240

matthewmturner · 2021-12-26T00:54:54Z

Updated PR to get Datafusion added to benchmarks.

Right now missing group by queries 6,8, and 9. I am going to look into those missing queries and then start looking into the flow / required output.

Let me know if anything in particular would make your life easier to add this :)

One question - can someone just confirm that this will be able to be run with cargo? Similar to the work @Dandandan did (I picked up from there) I am running the queries with the below commands:

# Group By
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --bin groupby --release

# Join
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --bin join --release

matthewmturner · 2021-12-26T07:47:17Z

@Dandandan fyi took a first stab at group by q8.

q1 took 62 ms
q2 took 322 ms
q3 took 1230 ms
q4 took 61 ms
q5 took 1242 ms
q7 took 1262 ms
q8 took 2733 ms
q10 took 24071 ms

results currently similar to spark

Dandandan · 2021-12-26T08:20:58Z

@Dandandan fyi took a first stab at group by q8.
q1 took 62 ms
q2 took 322 ms
q3 took 1230 ms
q4 took 61 ms
q5 took 1242 ms
q7 took 1262 ms
q8 took 2733 ms
q10 took 24071 ms
results currently similar to spark

Nice! The spark solution has DESC ordering btw, I guess that's what we should use.

matthewmturner · 2021-12-27T19:17:20Z

@Dandandan FYI i migrated to the python bindings, should make integrating with their flow easier as im using the existing python helpers.

I still have to migrate the join suite.

let me know if any thoughts.

results below - something odd going on with Q10 maybe?

0.11225258399999993 # q1
0.695109333 # q2
2.932470125 # q3
0.07341450000000016 # q4
3.3075385419999996 # q5
2.9051008750000005 # q7
4.573697916 # q8
68.875322208 # q10

jangorecki · 2021-12-28T16:02:36Z

datafusion/groupby-datafusion.py

+ans = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1").collect()
+t = timeit.default_timer() - t_start
+print(t)
+shape = ans_shape(ans)


For every solution in this benchmark checking shape is a part of timing, to ensure no laziness happens. I can imagine data fusion is not lazy, yet it seems to be unfair to skip this step in the timing.

Makes sense. I'll update!

matthewmturner · 2022-01-04T19:56:12Z

@jangorecki ive made a number of updates including adding datafusion to some of your utilities / runners which will hopefully make your life easier.

would you be able to see how close this is?

one thing i havent been able to test locally is running against the larger datasets so im not sure if / what errors we may get on those. do you have a recommendation for how to handle?

thanks for your help!

matthewmturner · 2022-01-20T03:51:05Z

hi @jangorecki - just checking in on this and if there is anything i can do to help.

as some additional context, datafusion has / will soon have several new features that will improve our query coverage and likely performance. from your perspective would you rather we submit once those are all completed or can we get the current submission merged as is and iterate from there?

thanks!

jangorecki · 2022-01-20T14:18:35Z

I am no longer a maintainer of this project as I don't work for H2O anymore. I would start by contacting maintainer of the project to ensure that effort you are going to undertake will be merged in. H2O support is very helpful so you should not have problems about finding out who now takes care of the project. Aside from support channel you should also easily reach h2o on twitter etc.
You can of course always make a fork and publish results of your fork as this is an open source project and there are no restrictions like this.

matthewmturner · 2022-01-20T14:38:57Z

@jangorecki thank you for your work on this and for letting us know :) i will reach out to H2O for support.

Dandandan and others added 18 commits January 17, 2021 14:41

Datafusion solution

3a983fd

Datafusion solution

b1f613e

Query fix

51ce127

Undo change

3343428

Increase batch size

d1e7ff3

Rename to ans

58be012

Fix

d87c92d

Add q7/q10

2b67e2a

Use multiple threads better

d217e37

Add exec script

5a3e5ec

Some cleanup

f839050

Rename

6cb14f5

Fix disabled snmalloc

63fe38b

Use arrow master again

cbecfbc

Update benchmark code

88ba391

Make queries work again

fbb50dc

Add join queries

20978b7

group by q8

4042b3c

Dandandan mentioned this pull request Dec 26, 2021

DataFusion solution [WIP] #182

Closed

Add python bindings

2cca309

jangorecki reviewed Dec 28, 2021

View reviewed changes

matthewmturner added 2 commits January 4, 2022 13:05

Fix join and utils

c345446

Remove rust impl and update utilities

82f34fb

matthewmturner mentioned this pull request Jan 4, 2022

Add DataFusion to h2oai/db-benchmark apache/datafusion#147

Closed

torsstei pushed a commit to IBM-Cloud/db-benchmark that referenced this pull request Mar 23, 2022

Rebased datafusion impl from h2oai#240

5e76dfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Datafusion solution [updated] #240

Add Datafusion solution [updated] #240

matthewmturner commented Dec 26, 2021

matthewmturner commented Dec 26, 2021

Dandandan commented Dec 26, 2021

matthewmturner commented Dec 27, 2021 •

edited

Loading

jangorecki Dec 28, 2021

matthewmturner Dec 28, 2021

matthewmturner commented Jan 4, 2022

matthewmturner commented Jan 20, 2022

jangorecki commented Jan 20, 2022

matthewmturner commented Jan 20, 2022

Add Datafusion solution [updated] #240

Are you sure you want to change the base?

Add Datafusion solution [updated] #240

Conversation

matthewmturner commented Dec 26, 2021

matthewmturner commented Dec 26, 2021

Dandandan commented Dec 26, 2021

matthewmturner commented Dec 27, 2021 • edited Loading

jangorecki Dec 28, 2021

Choose a reason for hiding this comment

matthewmturner Dec 28, 2021

Choose a reason for hiding this comment

matthewmturner commented Jan 4, 2022

matthewmturner commented Jan 20, 2022

jangorecki commented Jan 20, 2022

matthewmturner commented Jan 20, 2022

matthewmturner commented Dec 27, 2021 •

edited

Loading