Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sorted_merge_join #3505

Merged
merged 4 commits into from
Jun 4, 2022
Merged

sorted_merge_join #3505

merged 4 commits into from
Jun 4, 2022

Conversation

ritchie46
Copy link
Member

@ritchie46 ritchie46 commented May 26, 2022

Sorted merge join optimization.

closes #3474

Tested on:

os.environ["POLARS_VERBOSE"] = "1"
for i in range(0, 300):
    np.random.seed(i)
    n = 10_000
    df_a = pl.DataFrame({
        'a': np.sort(np.random.randint(0, n // 2, n))
    }).with_row_count("row_a")


    df_b = pl.DataFrame({
        'a': np.sort(np.random.randint(0, n // 2, n // 2))
    }).with_row_count("row_b")

    for how in ["left", "inner"]:
        
        # hash join
        out_hash_join = df_a.join(df_b, on="a", how=how)

        # sorted merge join
        out_sorted_merge_join = df_a.with_column(
            pl.col("a").set_sorted()
        ).join(df_b.with_column(
            pl.col("a").set_sorted()
        ), on="a", how=how)

        assert out_hash_join.frame_equal(out_sorted_merge_join)

Performance

image

@github-actions github-actions bot added the rust Related to Rust Polars label May 26, 2022
@ritchie46 ritchie46 changed the title WIP Sorted merge join optimization Jun 4, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #3505 (5d0babc) into master (e5f3173) will decrease coverage by 0.27%.
The diff coverage is 3.17%.

@@            Coverage Diff             @@
##           master    #3505      +/-   ##
==========================================
- Coverage   61.56%   61.29%   -0.28%     
==========================================
  Files         424      427       +3     
  Lines       71057    71429     +372     
==========================================
+ Hits        43747    43782      +35     
- Misses      27310    27647     +337     
Impacted Files Coverage Δ
polars/polars-arrow/src/data_types.rs 60.00% <ø> (ø)
polars/polars-arrow/src/kernels/mod.rs 63.69% <ø> (ø)
...lars/polars-arrow/src/kernels/sorted_join/inner.rs 0.00% <0.00%> (ø)
...olars/polars-arrow/src/kernels/sorted_join/left.rs 0.00% <0.00%> (ø)
polars/polars-arrow/src/kernels/take.rs 68.14% <ø> (-0.25%) ⬇️
polars/polars-arrow/src/lib.rs 0.00% <ø> (ø)
...lars/polars-core/src/frame/hash_join/sort_merge.rs 7.44% <7.44%> (ø)
polars/polars-core/src/frame/hash_join/mod.rs 41.51% <28.57%> (-0.03%) ⬇️
py-polars/polars/io.py 68.51% <0.00%> (-6.97%) ⬇️
polars/polars-core/src/series/ops/round.rs 41.89% <0.00%> (-6.76%) ⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e5f3173...5d0babc. Read the comment docs.

@ritchie46 ritchie46 changed the title Sorted merge join optimization sorted_merge_join Jun 4, 2022
@ritchie46 ritchie46 merged commit 9460e46 into master Jun 4, 2022
@ritchie46 ritchie46 deleted the sorted_join branch June 4, 2022 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge sorted array implementation
2 participants