Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Improve unique performance by adding RangedUniqueKernel for primitive arrays #17166

Merged
merged 1 commit into from
Jun 28, 2024

Conversation

coastalwhite
Copy link
Collaborator

@coastalwhite coastalwhite commented Jun 24, 2024

This PR adds a unique value kernel that is selected based on the metadata for PrimitiveArray. When the difference between the metadata min and max value is small enough a different kernel is used that does not require sorting the data first.

This is mostly to show how the new metadata can be used to select a different kernel.

For a microbenchmark on release mode, we see the following results:

import polars as pl
import numpy as np
from timeit import timeit

xs = list(np.random.randint(5, 100, size = 500000))
df = pl.DataFrame({ "x": xs, }, schema = { "x": pl.Int32 })

def rand_unique():
    df.select(pl.col.x.unique())

t = timeit(rand_unique, number = 10000)
print(f'Before Time = {t}')
df.select(xmin = pl.col.x.min(), xmax = pl.col.x.max())
t = timeit(rand_unique, number = 10000)
print(f'After Time = {t}')
Before Time = 22.795969667999998
After Time = 4.657802337999783

This is a ~4.9x improvement. I feel like this can also be further improved if needed.

@github-actions github-actions bot added performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars labels Jun 24, 2024
@coastalwhite coastalwhite marked this pull request as ready for review June 25, 2024 10:01
@coastalwhite coastalwhite marked this pull request as draft June 25, 2024 10:04
This PR adds a unique value kernel that is selected based on the metadata for
`PrimitiveArray`. When the difference between the metadata min and max value is
small enough a different kernel is used that does not require sorting the data
first.

This is mostly to show how the new metadata can be used to select a different
kernel.
Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 34.52915% with 146 lines in your changes missing coverage. Please review.

Project coverage is 80.78%. Comparing base (4731834) to head (54e83dc).
Report is 5 commits behind head on main.

Files Patch % Lines
crates/polars-compute/src/unique/primitive.rs 0.00% 112 Missing ⚠️
...es/polars-core/src/chunked_array/ops/unique/mod.rs 40.00% 21 Missing ⚠️
crates/polars-compute/src/unique/boolean.rs 82.66% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17166      +/-   ##
==========================================
- Coverage   80.94%   80.78%   -0.16%     
==========================================
  Files        1464     1465       +1     
  Lines      191928   192093     +165     
  Branches     2742     2743       +1     
==========================================
- Hits       155349   155183     -166     
- Misses      36070    36399     +329     
- Partials      509      511       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@coastalwhite coastalwhite marked this pull request as ready for review June 25, 2024 10:58
@ritchie46 ritchie46 merged commit bcc8a92 into pola-rs:main Jun 28, 2024
21 checks passed
@stinodego stinodego changed the title perf: add RangedUniqueKernel for primitive array perf: Add RangedUniqueKernel for primitive array Jun 30, 2024
@stinodego stinodego changed the title perf: Add RangedUniqueKernel for primitive array perf: Improve unique performance by adding RangedUniqueKernel for primitive arrays Jun 30, 2024
@c-peters c-peters added the accepted Ready for implementation label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

3 participants