`df.filter( pl.lit("some_str").str.contains( pl.col(...) )` does not filter properly #12632

DeflateAwning · 2023-11-22T16:08:30Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame([
	{'name': 'COMPANY A', 'id': 1},
	{'name': 'COMPANY B', 'id': 2},
	{'name': 'COMPANY C', 'id': 3},
])

# this works as expected (returns all rows)
df.filter(
    pl.col('name').str.contains(pl.lit('COMPANY'), literal=True)
)

# this works as expected (returns row with id=2 only)
df.filter(
    pl.col('name').str.contains(pl.lit('B'), literal=True)
)

# this does not work!
# you would expect it to filter the df to row with id=1, but instead it returns all rows
df.filter(
    pl.lit('COMPANY A PLUS EXTRA TEXT').str.contains(pl.col('name'), literal=True)
)

# this does not work!
# you would expect it to filter the df to row with id=2, but instead it returns no rows
df.filter(
    pl.lit('COMPANY B PLUS EXTRA TEXT').str.contains(pl.col('name'), literal=True)
)

Log output

No response

Issue description

When filtering a dataframe to a string literal containing the value within a column, the result behaves unpredictably.

Note that I would love to see unit test cases added for this, as it's quirky it wasn't noticed earlier (or only just recently broke)

Expected behavior

See expected values in comments in the Reproducable Example section

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:              0.19.13
Index type:          UInt32
Platform:            Linux-6.1.0-1026-oem-x86_64-with-glibc2.35
Python:              3.9.18 (main, Aug 25 2023, 13:20:14) 
[GCC 11.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         3.0.0
connectorx:          0.3.2
deltalake:           <not installed>
fsspec:              2023.10.0
gevent:              <not installed>
matplotlib:          3.8.1
numpy:               1.26.2
openpyxl:            3.0.10
pandas:              1.5.3
pyarrow:             11.0.0
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              1.0.10
sqlalchemy:          1.4.50
xlsx2csv:            0.8.1
xlsxwriter:          3.0.9

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2023-11-22T16:20:16Z

Thanks for the report - if you run

In [59]: df.select(
    ...:     pl.lit('COMPANY A PLUS EXTRA TEXT').str.contains(pl.col('name'), literal=True)
    ...: )
Out[59]:
shape: (1, 1)
┌─────────┐
│ literal │
│ ---     │
│ bool    │
╞═════════╡
│ true    │
└─────────┘

then the output is a single row

What you probably want to do is add an extra column first, so it gets broadcast to the right length, before you do the filtering:

In [64]: df.with_columns(name_with_extra_text=pl.lit('COMPANY A PLUS EXTRA TEXT')).filter(
    ...:     pl.col('name_with_extra_text').str.contains(pl.col('name'), literal=True)
    ...: )
Out[64]:
shape: (1, 3)
┌───────────┬─────┬───────────────────────────┐
│ name      ┆ id  ┆ name_with_extra_text      │
│ ---       ┆ --- ┆ ---                       │
│ str       ┆ i64 ┆ str                       │
╞═══════════╪═════╪═══════════════════════════╡
│ COMPANY A ┆ 1   ┆ COMPANY A PLUS EXTRA TEXT │
└───────────┴─────┴───────────────────────────┘

Having said that, maybe Expr.str.contains should raise if pattern is longer than self, or broadcast to the length of pattern? @reswqa

DeflateAwning · 2023-11-22T16:45:02Z

I think that it should broadcast to the length. My intuition suggests that what I wrote should function properly.

This feel like the sort of "bug" (design decision) that bothered me to no end about pandas, which polars seemed to solve with their API. This is the first time I've encountered this sort of funky behaviour in polars, where serieses in expressions don't work out to be the correct length.

deanm0000 · 2023-11-22T17:34:47Z

I brought this one up a while back #8120

Julian-J-S · 2023-11-22T17:42:57Z

yeah just tested some variations myself and there is definately some unexpected behaviour

(
    pl.DataFrame({
        'text': 'hello world!!',
        'word': ['hello', 'world', 'hi']
    })
    .with_columns(
        contains_col=pl.col('text').str.contains(pl.col('word'), literal=True),
        contains_lit=pl.lit('hello world!!').str.contains(pl.col('word'), literal=True),
        contains_lit2=pl.lit('XXXXhelloXXXXX').str.contains(pl.col('word'), literal=True),
        contains_lit3=pl.lit('world 123').str.contains(pl.col('word'), literal=True),
    )
)
┌───────────────┬───────┬──────────────┬──────────────┬───────────────┬───────────────┐
│ text          ┆ word  ┆ contains_col ┆ contains_lit ┆ contains_lit2 ┆ contains_lit3 │
│ ---           ┆ ---   ┆ ---          ┆ ---          ┆ ---           ┆ ---           │
│ str           ┆ str   ┆ bool         ┆ bool         ┆ bool          ┆ bool          │
╞═══════════════╪═══════╪══════════════╪══════════════╪═══════════════╪═══════════════╡
│ hello world!! ┆ hello ┆ true         ┆ true         ┆ true          ┆ false         │
│ hello world!! ┆ world ┆ true         ┆ true         ┆ true          ┆ false         │
│ hello world!! ┆ hi    ┆ false        ┆ true         ┆ true          ┆ false         │
└───────────────┴───────┴──────────────┴──────────────┴───────────────┴───────────────┘

pl.lit(<TEXT>).str.contains(pl.col(<COL>), literal=True) only seems to check if TEXT contains the first value of COL
It will apparently ignore all other values in the specified column. This is hardy ever what you want/expect 😄

deanm0000 · 2023-11-22T18:05:35Z

Here are some possibly relevant/related issues:

These 2 seem to suggest it's intended
#9713
#9712

These 2 suggest a fix is warranted
#5971
#10530

Julian-J-S · 2023-11-22T18:39:02Z

I genuinely cannot understand why this should be expected behaviour. Can anyone explain?

df = pl.DataFrame({
    'val': [1, 2, 3],
    'text': ['a', 'b', 'c']
})
df.with_columns(
    val1=1 + pl.col("val"),
    val2=pl.lit(1) + pl.col("val"),
    val3=pl.lit(1).add(pl.col("val")),
    text1="hi " + pl.col("text"),
    text2=pl.lit("hi ") + pl.col("text"),
    text3=pl.lit("hi ").add(pl.col("text")),
    text4=pl.lit("aaa").str.contains(pl.col("text")),  # Only checking first column value "a"
    text5=pl.lit("bbb").str.contains(pl.col("text")),  # Only checking first column value "a"
)
shape: (3, 10)
┌─────┬──────┬──────┬──────┬──────┬───────┬───────┬───────┬───────┬───────┐
│ val ┆ text ┆ val1 ┆ val2 ┆ val3 ┆ text1 ┆ text2 ┆ text3 ┆ text4 ┆ text5 │
│ --- ┆ ---  ┆ ---  ┆ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---   ┆ ---   │
│ i64 ┆ str  ┆ i64  ┆ i64  ┆ i64  ┆ str   ┆ str   ┆ str   ┆ bool  ┆ bool  │
╞═════╪══════╪══════╪══════╪══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ 1   ┆ a    ┆ 2    ┆ 2    ┆ 2    ┆ hi a  ┆ hi a  ┆ hi a  ┆ true  ┆ false │
│ 2   ┆ b    ┆ 3    ┆ 3    ┆ 3    ┆ hi b  ┆ hi b  ┆ hi b  ┆ true  ┆ false │
│ 3   ┆ c    ┆ 4    ┆ 4    ┆ 4    ┆ hi c  ┆ hi c  ┆ hi c  ┆ true  ┆ false │
└─────┴──────┴──────┴──────┴──────┴───────┴───────┴───────┴───────┴───────┘

some kind of broadcasting seems to be happening for lit(1)... and lit("hi")...
why does it "work" there but other logic is applied with lit().contains?

I think checking if some literal TEXT contains values of a column should definitely work for all values in that column.
Just checking the first value in the column and broadcasting that result is like intentionally seeing a problem/bug but hiding it from the user. 🤔
Please feel free to convince me from the opposite!

deanm0000 · 2023-11-22T18:55:55Z

It seems to be an order of operations thing. Where .str.contains is happening before broadcast but +, .add, etc are after broadcast.

Julian-J-S · 2023-11-22T19:07:23Z

@deanm0000 yeah you are probably right.
I think this is a bug/unexpected and should be fixed.
Here another example of why this is "problematic"

df.with_columns(
    word_col=pl.lit('bbb'),
).with_columns(
    contain_buggy=pl.lit("bbb").str.contains(pl.col("text")),  # these 2 rows "mean" the same and should be identical
    contain_working=pl.col("word_col").str.contains(pl.col("text")),  # these 2 rows "mean" the same and should be identical
)
shape: (3, 5)
┌─────┬──────┬──────────┬───────────────┬─────────────────┐
│ val ┆ text ┆ word_col ┆ contain_buggy ┆ contain_working │
│ --- ┆ ---  ┆ ---      ┆ ---           ┆ ---             │
│ i64 ┆ str  ┆ str      ┆ bool          ┆ bool            │
╞═════╪══════╪══════════╪═══════════════╪═════════════════╡
│ 1   ┆ a    ┆ bbb      ┆ false         ┆ false           │
│ 2   ┆ b    ┆ bbb      ┆ false         ┆ true            │
│ 3   ┆ c    ┆ bbb      ┆ false         ┆ false           │
└─────┴──────┴──────────┴───────────────┴─────────────────┘

I think in general adding a constant literal value to a column and afterwards using that in a calculation should have the same result than using that literal directly.

cmdlineluser · 2023-11-22T19:43:56Z

I think anything that accept expressions behaves like this and needs broadcasting or a length check to be implemented "manually".

.dt.offset_by is the most recent example that I recall being fixed: #11088

Just trying a few, .str.starts_with, .str.ends_with, .str.strip_chars all behave this way:

pl.select(pl.lit("bbb").str.starts_with(pl.Series(["a", "b", "c"])))
# shape: (1, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ bool    │
# ╞═════════╡
# │ false   │
# └─────────┘

Some also have length checks in place:

pl.select(pl.lit("bbb").str.extract_all(pl.Series(["a", "b", "c"])))
# ComputeError: pattern's length: 3 does not match that of the argument series: 1

Probably a good idea to test every expression function to see which ones behave this way.

deanm0000 · 2023-11-22T20:36:34Z

I'd like to revise my earlier guess about order of operations. I pulled up the fix for the temporal and saw that it's explicitly checking the length of each input.

I think this is what needs to be fixed for str.contains

from: polars/crates/polars-ops/src/chunked_array/strings
/namespace.rs

    fn contains_chunked(
        &self,
        pat: &Utf8Chunked,
        literal: bool,
        strict: bool,
    ) -> PolarsResult<BooleanChunked> {
        let ca = self.as_utf8();
        match pat.len() {
            1 => match pat.get(0) {
                Some(pat) => {
                    if literal {
                        ca.contains_literal(pat)
                    } else {
                        ca.contains(pat, strict)
                    }
                },
                None => Ok(BooleanChunked::full_null(ca.name(), ca.len())),
            },
            _ => {
                if literal {
                    Ok(binary_elementwise_values(ca, pat, |src, pat| {
                        src.contains(pat)
                    }))
                } else if strict {
                    // A sqrt(n) regex cache is not too small, not too large.
                    let mut reg_cache = FastFixedCache::new((ca.len() as f64).sqrt() as usize);
                    try_binary_elementwise(ca, pat, |opt_src, opt_pat| match (opt_src, opt_pat) {
                        (Some(src), Some(pat)) => {
                            let reg = reg_cache.try_get_or_insert_with(pat, |p| Regex::new(p))?;
                            Ok(Some(reg.is_match(src)))
                        },
                        _ => Ok(None),
                    })
                } else {
                    // A sqrt(n) regex cache is not too small, not too large.
                    let mut reg_cache = FastFixedCache::new((ca.len() as f64).sqrt() as usize);
                    Ok(binary_elementwise(
                        ca,
                        pat,
                        infer_re_match(|src, pat| {
                            let reg = reg_cache.try_get_or_insert_with(pat?, |p| Regex::new(p));
                            Some(reg.ok()?.is_match(src?))
                        }),
                    ))
                }
            },
        }
    }

I think it needs to check the length of ca which is essentially the pl.lit() of the above examples. I don't really know rust I'm just pattern matching so I don't know how to actually fix it unfortunately.

cmdlineluser · 2023-11-22T22:45:40Z

That does look like the spot @deanm0000

You can highlight the lines and copy the permalink to have Github inline the code.

polars/crates/polars-ops/src/chunked_array/strings/namespace.rs

Lines 116 to 117 in ab83c2a

    
           if literal { 
        
               Ok(binary_elementwise_values(ca, pat, |src, pat| {

binary_elementwise_values() behaves like zip() in Python.

Because ca has a length of 1 it only processes 1 item from pat

What the offset_by example does, is it switches around and uses rhs.apply_values_generic instead when lhs.len() == 1

let src = ca.get(0).unwrap() # this should be pattern matched, but for example purposes
pat.apply_values_generic(|pattern| {
    reg = ...;
    reg.is_match(src)
})

DeflateAwning added bug Something isn't working python Related to Python Polars labels Nov 22, 2023

nameexhaustion mentioned this issue Nov 29, 2023

fix: broadcasting of unit LHS in string operations #12737

Merged

orlp closed this as completed in #12737 Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`df.filter( pl.lit("some_str").str.contains( pl.col(...) )` does not filter properly #12632

`df.filter( pl.lit("some_str").str.contains( pl.col(...) )` does not filter properly #12632

DeflateAwning commented Nov 22, 2023

MarcoGorelli commented Nov 22, 2023

DeflateAwning commented Nov 22, 2023

deanm0000 commented Nov 22, 2023

Julian-J-S commented Nov 22, 2023 •

edited

Loading

deanm0000 commented Nov 22, 2023 •

edited

Loading

Julian-J-S commented Nov 22, 2023

deanm0000 commented Nov 22, 2023

Julian-J-S commented Nov 22, 2023

cmdlineluser commented Nov 22, 2023

deanm0000 commented Nov 22, 2023

cmdlineluser commented Nov 22, 2023

df.filter( pl.lit("some_str").str.contains( pl.col(...) ) does not filter properly #12632

df.filter( pl.lit("some_str").str.contains( pl.col(...) ) does not filter properly #12632

Comments

DeflateAwning commented Nov 22, 2023

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

MarcoGorelli commented Nov 22, 2023

DeflateAwning commented Nov 22, 2023

deanm0000 commented Nov 22, 2023

Julian-J-S commented Nov 22, 2023 • edited Loading

deanm0000 commented Nov 22, 2023 • edited Loading

Julian-J-S commented Nov 22, 2023

deanm0000 commented Nov 22, 2023

Julian-J-S commented Nov 22, 2023

cmdlineluser commented Nov 22, 2023

deanm0000 commented Nov 22, 2023

cmdlineluser commented Nov 22, 2023

`df.filter( pl.lit("some_str").str.contains( pl.col(...) )` does not filter properly #12632

`df.filter( pl.lit("some_str").str.contains( pl.col(...) )` does not filter properly #12632

Julian-J-S commented Nov 22, 2023 •

edited

Loading

deanm0000 commented Nov 22, 2023 •

edited

Loading