-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
df.filter( pl.lit("some_str").str.contains( pl.col(...) )
does not filter properly
#12632
Comments
Thanks for the report - if you run In [59]: df.select(
...: pl.lit('COMPANY A PLUS EXTRA TEXT').str.contains(pl.col('name'), literal=True)
...: )
Out[59]:
shape: (1, 1)
┌─────────┐
│ literal │
│ --- │
│ bool │
╞═════════╡
│ true │
└─────────┘ then the output is a single row What you probably want to do is add an extra column first, so it gets broadcast to the right length, before you do the filtering: In [64]: df.with_columns(name_with_extra_text=pl.lit('COMPANY A PLUS EXTRA TEXT')).filter(
...: pl.col('name_with_extra_text').str.contains(pl.col('name'), literal=True)
...: )
Out[64]:
shape: (1, 3)
┌───────────┬─────┬───────────────────────────┐
│ name ┆ id ┆ name_with_extra_text │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str │
╞═══════════╪═════╪═══════════════════════════╡
│ COMPANY A ┆ 1 ┆ COMPANY A PLUS EXTRA TEXT │
└───────────┴─────┴───────────────────────────┘ Having said that, maybe |
I think that it should broadcast to the length. My intuition suggests that what I wrote should function properly. This feel like the sort of "bug" (design decision) that bothered me to no end about pandas, which polars seemed to solve with their API. This is the first time I've encountered this sort of funky behaviour in polars, where serieses in expressions don't work out to be the correct length. |
I brought this one up a while back #8120 |
yeah just tested some variations myself and there is definately some unexpected behaviour (
pl.DataFrame({
'text': 'hello world!!',
'word': ['hello', 'world', 'hi']
})
.with_columns(
contains_col=pl.col('text').str.contains(pl.col('word'), literal=True),
contains_lit=pl.lit('hello world!!').str.contains(pl.col('word'), literal=True),
contains_lit2=pl.lit('XXXXhelloXXXXX').str.contains(pl.col('word'), literal=True),
contains_lit3=pl.lit('world 123').str.contains(pl.col('word'), literal=True),
)
)
┌───────────────┬───────┬──────────────┬──────────────┬───────────────┬───────────────┐
│ text ┆ word ┆ contains_col ┆ contains_lit ┆ contains_lit2 ┆ contains_lit3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ bool ┆ bool ┆ bool ┆ bool │
╞═══════════════╪═══════╪══════════════╪══════════════╪═══════════════╪═══════════════╡
│ hello world!! ┆ hello ┆ true ┆ true ┆ true ┆ false │
│ hello world!! ┆ world ┆ true ┆ true ┆ true ┆ false │
│ hello world!! ┆ hi ┆ false ┆ true ┆ true ┆ false │
└───────────────┴───────┴──────────────┴──────────────┴───────────────┴───────────────┘
|
I genuinely cannot understand why this should be expected behaviour. Can anyone explain? df = pl.DataFrame({
'val': [1, 2, 3],
'text': ['a', 'b', 'c']
})
df.with_columns(
val1=1 + pl.col("val"),
val2=pl.lit(1) + pl.col("val"),
val3=pl.lit(1).add(pl.col("val")),
text1="hi " + pl.col("text"),
text2=pl.lit("hi ") + pl.col("text"),
text3=pl.lit("hi ").add(pl.col("text")),
text4=pl.lit("aaa").str.contains(pl.col("text")), # Only checking first column value "a"
text5=pl.lit("bbb").str.contains(pl.col("text")), # Only checking first column value "a"
)
shape: (3, 10)
┌─────┬──────┬──────┬──────┬──────┬───────┬───────┬───────┬───────┬───────┐
│ val ┆ text ┆ val1 ┆ val2 ┆ val3 ┆ text1 ┆ text2 ┆ text3 ┆ text4 ┆ text5 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ str ┆ str ┆ str ┆ bool ┆ bool │
╞═════╪══════╪══════╪══════╪══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ 1 ┆ a ┆ 2 ┆ 2 ┆ 2 ┆ hi a ┆ hi a ┆ hi a ┆ true ┆ false │
│ 2 ┆ b ┆ 3 ┆ 3 ┆ 3 ┆ hi b ┆ hi b ┆ hi b ┆ true ┆ false │
│ 3 ┆ c ┆ 4 ┆ 4 ┆ 4 ┆ hi c ┆ hi c ┆ hi c ┆ true ┆ false │
└─────┴──────┴──────┴──────┴──────┴───────┴───────┴───────┴───────┴───────┘ some kind of broadcasting seems to be happening for I think checking if some literal TEXT contains values of a column should definitely work for all values in that column. |
It seems to be an order of operations thing. Where |
@deanm0000 yeah you are probably right. df.with_columns(
word_col=pl.lit('bbb'),
).with_columns(
contain_buggy=pl.lit("bbb").str.contains(pl.col("text")), # these 2 rows "mean" the same and should be identical
contain_working=pl.col("word_col").str.contains(pl.col("text")), # these 2 rows "mean" the same and should be identical
)
shape: (3, 5)
┌─────┬──────┬──────────┬───────────────┬─────────────────┐
│ val ┆ text ┆ word_col ┆ contain_buggy ┆ contain_working │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ bool ┆ bool │
╞═════╪══════╪══════════╪═══════════════╪═════════════════╡
│ 1 ┆ a ┆ bbb ┆ false ┆ false │
│ 2 ┆ b ┆ bbb ┆ false ┆ true │
│ 3 ┆ c ┆ bbb ┆ false ┆ false │
└─────┴──────┴──────────┴───────────────┴─────────────────┘ I think in general adding a constant literal value to a column and afterwards using that in a calculation should have the same result than using that literal directly. |
I think anything that accept expressions behaves like this and needs broadcasting or a length check to be implemented "manually".
Just trying a few, pl.select(pl.lit("bbb").str.starts_with(pl.Series(["a", "b", "c"])))
# shape: (1, 1)
# ┌─────────┐
# │ literal │
# │ --- │
# │ bool │
# ╞═════════╡
# │ false │
# └─────────┘ Some also have length checks in place: pl.select(pl.lit("bbb").str.extract_all(pl.Series(["a", "b", "c"])))
# ComputeError: pattern's length: 3 does not match that of the argument series: 1 Probably a good idea to test every expression function to see which ones behave this way. |
I'd like to revise my earlier guess about order of operations. I pulled up the fix for the temporal and saw that it's explicitly checking the length of each input. I think this is what needs to be fixed for from: polars/crates/polars-ops/src/chunked_array/strings
I think it needs to check the length of ca which is essentially the |
That does look like the spot @deanm0000 You can highlight the lines and copy the permalink to have Github inline the code. polars/crates/polars-ops/src/chunked_array/strings/namespace.rs Lines 116 to 117 in ab83c2a
Because What the offset_by example does, is it switches around and uses let src = ca.get(0).unwrap() # this should be pattern matched, but for example purposes
pat.apply_values_generic(|pattern| {
reg = ...;
reg.is_match(src)
}) |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
No response
Issue description
When filtering a dataframe to a string literal containing the value within a column, the result behaves unpredictably.
Note that I would love to see unit test cases added for this, as it's quirky it wasn't noticed earlier (or only just recently broke)
Expected behavior
See expected values in comments in the Reproducable Example section
Installed versions
The text was updated successfully, but these errors were encountered: