Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semi/anti join #5508

Closed
marsupialtail opened this issue Nov 14, 2022 · 2 comments
Closed

semi/anti join #5508

marsupialtail opened this issue Nov 14, 2022 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@marsupialtail
Copy link

Problem description

Currently if you want to do BOTH a semi join and an anti join, you have to, well, do them both. Most of the work is duplicated.

It would be amazing if Polars support something like a ""semi/anti" join where it produces two results, one for the semi join and one for the anti join in one pass.

@marsupialtail marsupialtail added the enhancement New feature or an improvement of an existing feature label Nov 14, 2022
@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Dec 15, 2023

thanks @marsupialtail for the request

I think @orlp had a suggestion to do materialise multiple expressions together

I think this is a good idea, because otherwise, splitting lazyframes means you risk "shooting yourself in the foot" with repeated computations

For example

df_collected = df.collect()
other_collected = other.collect()
semi, anti = (
    df_collected.join(other_collected, on='acoustic_data', how='semi'),
    df_collected.join(other_collected, on='acoustic_data', how='anti')
)

is twice as fast as

%%time
semi, anti = pl.collect_all(
    [
        df.join(other, on='acoustic_data', how='semi'),
        df.join(other, on='acoustic_data', how='anti')
    ]
)

notebook where I time this: https://www.kaggle.com/code/marcogorelli/semi-anti-join-in-polars?scriptVersionId=155140639

@orlp
Copy link
Collaborator

orlp commented Dec 15, 2023

In the future we'll support smart join sharing like this under the hood using the optimizing engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants