Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: incorrect negative offset in multi-byte string slicing #15140

Merged
merged 2 commits into from Mar 22, 2024

Conversation

orlp
Copy link
Collaborator

@orlp orlp commented Mar 18, 2024

Fixes #15136.

Issue identified by @mcrumiller here: #14425 (comment).

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Mar 18, 2024
@mcrumiller
Copy link
Contributor

Ahh I have a fix for this too, but I'm assuming yours is better. Also the primary issue is now #15136.

@orlp
Copy link
Collaborator Author

orlp commented Mar 18, 2024

@mcrumiller It was rather tricky, we also were doing things wrong with 'very negative' indices. For example pl.Series(["abcd"]).str.slice(-100, 2) should return "", not "ab".

@mcrumiller
Copy link
Contributor

pl.Series(["abcd"]).str.slice(-100, 2) should return "", not "ab".

I wasn't sure what the correct usage here was--is it a sliding window that gets "stuck" when it hits the end (in which case it would be "ab") or do we let the window go anywhere and see what's in it? I'm guessing the latter per your comment.

@orlp
Copy link
Collaborator Author

orlp commented Mar 18, 2024

@mcrumiller This is the current behavior of pl.Series(["abcd"]).str.slice(i, 2):

       -7 ab 
       -6 ab 
       -5 ab 
0 ab   -4 ab 
1 bc   -3 bc 
2 cd   -2 cd 
3 d    -1 d  
4
5

This is the bugfix behavior:

        -7    
        -6    
        -5 a  
0 ab    -4 ab 
1 bc    -3 bc 
2 cd    -2 cd 
3 d     -1 d  
4 
5 

To me the latter is much more logical. It's also consistent with Python:

>>> "abcd"[-5:-3]
"a"

This is "a", not "ab".

@orlp
Copy link
Collaborator Author

orlp commented Mar 18, 2024

Ugh but this is inconsistent with .slice... Will put this on draft for now until we've discussed it.

@orlp orlp marked this pull request as draft March 18, 2024 21:46
@mcrumiller
Copy link
Contributor

Most other languages implement slice with start/stop/step instead of start/length. Is this worth considering? It would then allow us to use slice syntax i.e. pl.col("s").str[0:5] as pandas does, which is a nice syntax that I've always wished we had in polars. We could also then deprecate gather_every as it could be consumed by the slice syntax.

Copy link

codecov bot commented Mar 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 81.22%. Comparing base (066c262) to head (00ea716).
Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15140      +/-   ##
==========================================
+ Coverage   81.08%   81.22%   +0.13%     
==========================================
  Files        1342     1346       +4     
  Lines      174150   175228    +1078     
  Branches     2459     2506      +47     
==========================================
+ Hits       141210   142324    +1114     
+ Misses      32473    32424      -49     
- Partials      467      480      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46
Copy link
Member

Any news on this on?

@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 22, 2024

@ritchie46 I think the question up for discussion is how str.slice should work with negative offsets that extend past the start:

s = "abcdefghij"  # length 10
s2 = s.str.slice(-12, 5)

Which of the two options should this return?

# option 1: window sticks to left edge

        [        s2         ]
          0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
        ----|---|---|---|---|---|---|---|---|---
        | a | b | c | d | e | f | g | h | i | j

= "abcde"

# option 2: window positioned absolutely

[        s2         ]
-1 | -2 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
        ----|---|---|---|---|---|---|---|---|---
        | a | b | c | d | e | f | g | h | i | j

= "abc"

My other comment is that in both Rust and python, slice is defined by start and stop args (which would remove this issue entirely) instead of a start and length. Why do we do it this way (with length), and should we change the slice behavior?

@orlp
Copy link
Collaborator Author

orlp commented Mar 22, 2024

@ritchie46 This PR will be merged later, as part of this issue: #15232. Unless you temporarily want to make str.slice inconsistent with .slice until we fix .slice as well.

@mcrumiller We've discussed this internally and we want to go forward with the solution described in that issue. We want to keep Polars' offset-length style, as it would literally break everyone's code out there to switch to Python's style if we even wanted to.

@orlp orlp marked this pull request as ready for review March 22, 2024 16:29
@ritchie46 ritchie46 merged commit 911abc3 into pola-rs:main Mar 22, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str.slice() does not respect codepoints for negative offsets
3 participants