Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rust, python): expressify offset and length parameters for str.slice #12071

Closed

Conversation

cmdlineluser
Copy link
Contributor

Attempts to resolve #10890

I see it had been previously self-assigned @orlp - hopefully I'm not intruding here.

(I imagine you had more important things to work on.)

df = pl.DataFrame({
   "foo": ["abcde", "fghijk"],
   "offset": [1, 2],
   "length": [3, 4],
})

df.with_columns(slice = pl.col("foo").str.slice("offset", "length"))

# shape: (2, 4)
# ┌────────┬────────┬────────┬───────┐
# │ foo    ┆ offset ┆ length ┆ slice │
# │ ---    ┆ ---    ┆ ---    ┆ ---   │
# │ str    ┆ i64    ┆ i64    ┆ str   │
# ╞════════╪════════╪════════╪═══════╡
# │ abcde  ┆ 1      ┆ 3      ┆ bcd   │
# │ fghijk ┆ 2      ┆ 4      ┆ hijk  │
# └────────┴────────┴────────┴───────┘

I used the strip_prefix impl as a guide/template https://github.com/pola-rs/polars/blob/main/crates/polars-ops/src/chunked_array/strings/strip.rs

Some things I'm not sure about if anybody wants to give feedback:

  • Instead of all the length checks, I just broadcasted offset and length using .clear().extend_constant() - is this an acceptable approach or is it a bad idea + performance concern?
  • Is there a way to "typecheck" the params, or is it okay to just .cast()?
  • Is UInt64 the correct type for length?
  • The param name is offset but internally start is used, I changed everything to offset, is this the right approach?
  • I had to modify polars-sql as it used .slice() (incidentally, that uses start as the param name, should it be start everywhere instead of offset?)
  • Is it okay to use !matches!?
  • As length is a param name, I had to re-jig the error format a bit to avoid ending up with length value length.

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Oct 27, 2023
let offset = offset.cast(&DataType::Int64)?;
let offset = offset.i64()?;

let length = match s2.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to add broadcasting by extending memory. See discussion here; #11900

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Excellent, thanks @ritchie46 - that explains the need for the added length checks / fallback logic.

I'll switch it around.

@cmdlineluser cmdlineluser marked this pull request as draft October 28, 2023 16:14
@cmdlineluser
Copy link
Contributor Author

cmdlineluser commented Oct 28, 2023

binary_elementwise(ca, length, |opt_str_val, opt_length| utf8_substring_ternary(opt_str_val, offset.get(0), opt_length))

Does anybody know what I'm doing wrong here?

error: implementation of `FnOnce` is not general enough
  --> /Users/user/git/polars/crates/polars-ops/src/chunked_array/strings/substring.rs:63:5
   |
63 |     binary_elementwise(ca, length, |opt_str_val, opt_length| utf8_substring_ternary(opt_str_val, offset.get(0), opt_length))
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ implementation of `FnOnce` is not general enough
   |
   = note: closure with signature `fn(std::option::Option<&'2 str>, std::option::Option<u64>) -> std::option::Option<&str>` must implement `FnOnce<(std::option::Option<&'1 str>, std::option::Option<u64>)>`, for any lifetime `'1`...
   = note: ...but it actually implements `FnOnce<(std::option::Option<&'2 str>, std::option::Option<u64>)>`, for some specific lifetime `'2`

@orlp
Copy link
Collaborator

orlp commented Oct 30, 2023

@cmdlineluser You're not doing anything wrong, the compiler is just inferring stuff wrong here. Let me check if I can rework this...

@orlp
Copy link
Collaborator

orlp commented Oct 30, 2023

@cmdlineluser Unfortunately this is one place where we must help the compiler a bit. This works:

match (offset.len(), length.len()) {
    (1, 1) => ca.apply_generic(|opt_str_val| {
        utf8_substring_ternary(opt_str_val, offset.get(0), length.get(0))
    }),
    (1, _) => {
        fn infer<F: for<'a> FnMut(Option<&'a str>, Option<u64>) -> Option<&'a str>>(f: F) -> F where
        {
            f
        }

        let off = offset.get(0);
        binary_elementwise(
            ca,
            length,
            infer(|val, len| utf8_substring_ternary(val, off, len)),
        )
    },
    (_, 1) => {
        fn infer<F: for<'a> FnMut(Option<&'a str>, Option<i64>) -> Option<&'a str>>(f: F) -> F where
        {
            f
        }

        let len = length.get(0);
        binary_elementwise(
            ca,
            offset,
            infer(|val, off| utf8_substring_ternary(val, off, len)),
        )
    },
    _ => ternary_elementwise(ca, offset, length, utf8_substring_ternary),
}

@cmdlineluser
Copy link
Contributor Author

Ah, thank you @orlp

So that's what the infer_re_match() is doing in some of the other examples:


// We need this to infer the right lifetimes for the match closure.
#[inline(always)]
fn infer_re_match<F>(f: F) -> F
where
F: for<'a, 'b> FnMut(Option<&'a str>, Option<&'b str>) -> Option<bool>,
{
f
}

@orlp
Copy link
Collaborator

orlp commented Oct 31, 2023

@cmdlineluser Yes, that's why. Perhaps we should be using the https://crates.io/crates/higher-order-closure crates for this.

@reswqa
Copy link
Collaborator

reswqa commented Nov 8, 2023

Perhaps we should be using the https://crates.io/crates/higher-order-closure crates for this.

It's a real pain fighting the compiler here, it would be awesome if this crates could save us. 😞

@reswqa
Copy link
Collaborator

reswqa commented Jan 12, 2024

Hi @cmdlineluser. Thank you for this nice work! If you don't have free time, can I take over and keep pushing it as we have some future work that depends on this one. Of course, I will preserve your authorship. :D)

@cmdlineluser
Copy link
Contributor Author

@reswqa Please feel free to take over - thank you.

@cmdlineluser
Copy link
Contributor Author

Superseded by #13747

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow str.slice to take Exprs instead of a fixed offset/length
4 participants