Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Datafusion solution [updated] #240

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

matthewmturner
Copy link

Updated PR to get Datafusion added to benchmarks.

Right now missing group by queries 6,8, and 9. I am going to look into those missing queries and then start looking into the flow / required output.

Let me know if anything in particular would make your life easier to add this :)

One question - can someone just confirm that this will be able to be run with cargo? Similar to the work @Dandandan did (I picked up from there) I am running the queries with the below commands:

# Group By
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --bin groupby --release

# Join
RUSTFLAGS='-C target-cpu=native' cargo +nightly run --bin join --release

@matthewmturner
Copy link
Author

@Dandandan fyi took a first stab at group by q8.

q1 took 62 ms
q2 took 322 ms
q3 took 1230 ms
q4 took 61 ms
q5 took 1242 ms
q7 took 1262 ms
q8 took 2733 ms
q10 took 24071 ms

results currently similar to spark

@Dandandan
Copy link

@Dandandan fyi took a first stab at group by q8.

q1 took 62 ms
q2 took 322 ms
q3 took 1230 ms
q4 took 61 ms
q5 took 1242 ms
q7 took 1262 ms
q8 took 2733 ms
q10 took 24071 ms

results currently similar to spark

Nice! The spark solution has DESC ordering btw, I guess that's what we should use.

@matthewmturner
Copy link
Author

matthewmturner commented Dec 27, 2021

@Dandandan FYI i migrated to the python bindings, should make integrating with their flow easier as im using the existing python helpers.

I still have to migrate the join suite.

let me know if any thoughts.

results below - something odd going on with Q10 maybe?

0.11225258399999993 # q1
0.695109333 # q2
2.932470125 # q3
0.07341450000000016 # q4
3.3075385419999996 # q5
2.9051008750000005 # q7
4.573697916 # q8
68.875322208 # q10

ans = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1").collect()
t = timeit.default_timer() - t_start
print(t)
shape = ans_shape(ans)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For every solution in this benchmark checking shape is a part of timing, to ensure no laziness happens. I can imagine data fusion is not lazy, yet it seems to be unfair to skip this step in the timing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll update!

@matthewmturner
Copy link
Author

@jangorecki ive made a number of updates including adding datafusion to some of your utilities / runners which will hopefully make your life easier.

would you be able to see how close this is?

one thing i havent been able to test locally is running against the larger datasets so im not sure if / what errors we may get on those. do you have a recommendation for how to handle?

thanks for your help!

@matthewmturner
Copy link
Author

hi @jangorecki - just checking in on this and if there is anything i can do to help.

as some additional context, datafusion has / will soon have several new features that will improve our query coverage and likely performance. from your perspective would you rather we submit once those are all completed or can we get the current submission merged as is and iterate from there?

thanks!

@jangorecki
Copy link
Contributor

I am no longer a maintainer of this project as I don't work for H2O anymore. I would start by contacting maintainer of the project to ensure that effort you are going to undertake will be merged in. H2O support is very helpful so you should not have problems about finding out who now takes care of the project. Aside from support channel you should also easily reach h2o on twitter etc.
You can of course always make a fork and publish results of your fork as this is an open source project and there are no restrictions like this.

@matthewmturner
Copy link
Author

@jangorecki thank you for your work on this and for letting us know :) i will reach out to H2O for support.

torsstei pushed a commit to IBM-Cloud/db-benchmark that referenced this pull request Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants