Skip to content

fix: Descending search in search_sorted(), and nested dtypes from Python in search_sorted()#21266

Closed
itamarst wants to merge 51 commits intopola-rs:mainfrom
itamarst:21100-index_of-array-series-length-1
Closed

fix: Descending search in search_sorted(), and nested dtypes from Python in search_sorted()#21266
itamarst wants to merge 51 commits intopola-rs:mainfrom
itamarst:21100-index_of-array-series-length-1

Conversation

@itamarst
Copy link
Contributor

@itamarst itamarst commented Feb 14, 2025

Fixes #21100

In addition:

  • Make search_sorted() from Python work with nested dtypes, keeping in mind some ambiguities in the API.
  • Ensure search_sorted() give correct answers for descending sort; previously it silently gave the wrong answer. For nested dtypes this required a fix to row encoding.
  • Start setting the is_sorted flags on sorted pl.Array Series, which will probably result in performance improvements in some cases.

@itamarst itamarst changed the title Fix: Edge case where length 1 Series of array couldn't use index_of() fix: Edge case where length 1 Series of array couldn't use index_of() Feb 14, 2025
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars and removed title needs formatting labels Feb 14, 2025
@codecov
Copy link

codecov bot commented Feb 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.09%. Comparing base (322827c) to head (41e4414).
Report is 132 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #21266      +/-   ##
==========================================
+ Coverage   80.73%   81.09%   +0.35%     
==========================================
  Files        1640     1651      +11     
  Lines      235830   232630    -3200     
  Branches     2719     2749      +30     
==========================================
- Hits       190403   188650    -1753     
+ Misses      44784    43324    -1460     
- Partials      643      656      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@itamarst itamarst marked this pull request as ready for review February 14, 2025 14:22
@itamarst itamarst marked this pull request as draft February 14, 2025 14:23
@itamarst itamarst closed this Feb 14, 2025
@itamarst itamarst reopened this Feb 14, 2025
@itamarst itamarst marked this pull request as ready for review February 14, 2025 14:46
@itamarst
Copy link
Contributor Author

itamarst commented Mar 3, 2025

Going to start on fixing the underlying issue, will resubmit when that's done.

@itamarst itamarst closed this Mar 3, 2025
@itamarst itamarst reopened this Mar 3, 2025
@coastalwhite
Copy link
Collaborator

Sorry @itamarst for not reviewing this earlier. It would take me quite a bit of time to properly review this now, and to also solve the problem in this PR with nested datatypes in the row encoding. There are some other items that need more urgent attention.

@itamarst
Copy link
Contributor Author

This will be a lot nicer once #21946 is fixed (I am working on it) so closing for now.

@itamarst itamarst reopened this Apr 30, 2025
@itamarst
Copy link
Contributor Author

I have fixed the issues with nested dtypes, so it's a lot nicer now.

Copy link
Collaborator

@coastalwhite coastalwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice that you were able to resolve the row encoding issue.

I am not sure about the signature you are giving search_sorted now. The usage pl.Series([1, 2, 3]).search_sorted([2, 3]) implies that it is (T, List(T)) -> IdxSize instead of what it is which is (T, T) -> IdxSize. I just solved a bunch of other functions with the same problem. If you want to search multiple items you should probably do pl.Series([1, 2, 3]).search_sorted(pl.Series([2, 3])). This would also resolve your comment in the python docstring because the function would be consistent across nested and non-nested types.

Maybe have a quick look at #22149 to see the problem. This one is essentially the reverse of that.

@itamarst
Copy link
Contributor Author

Thank you! Will look into this more tomorrow or later and respond/change. But as context (and it's been a while, so this is from memory) I was trying to be backwards compatible with this change.

@coastalwhite
Copy link
Collaborator

Thank you! Will look into this more tomorrow or later and respond/change. But as context (and it's been a while, so this is from memory) I was trying to be backwards compatible with this change.

Those other functions required an implicit .implode() to stay backwards compatible. Search sorted probably requires an implicit scalar .explode().

@coastalwhite
Copy link
Collaborator

Could you extract the row encoding fix to a separate PR?

@itamarst
Copy link
Contributor Author

itamarst commented May 1, 2025

I can, yeah. I tried to do that originally in separate PR but failed to find a test that would reproduce the issue in isolation.

@itamarst
Copy link
Contributor Author

itamarst commented May 1, 2025

Done, #22557. Once that is merged, will close this and probably open a new PR from new branch for the rest, the git history on this one is too messy, and the merge is not going to be fun.

@itamarst
Copy link
Contributor Author

itamarst commented May 6, 2025

Opened new PR for this (#22633), now that part has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fix Bug fix python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

index_of() with Series of length 1 and dtype Array doesn't work

4 participants