Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.filter( pl.lit("some_str").str.contains( pl.col(...) ) does not filter properly #12632

Closed
2 tasks done
DeflateAwning opened this issue Nov 22, 2023 · 11 comments · Fixed by #12737
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@DeflateAwning
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame([
	{'name': 'COMPANY A', 'id': 1},
	{'name': 'COMPANY B', 'id': 2},
	{'name': 'COMPANY C', 'id': 3},
])

# this works as expected (returns all rows)
df.filter(
    pl.col('name').str.contains(pl.lit('COMPANY'), literal=True)
)

# this works as expected (returns row with id=2 only)
df.filter(
    pl.col('name').str.contains(pl.lit('B'), literal=True)
)

# this does not work!
# you would expect it to filter the df to row with id=1, but instead it returns all rows
df.filter(
    pl.lit('COMPANY A PLUS EXTRA TEXT').str.contains(pl.col('name'), literal=True)
)

# this does not work!
# you would expect it to filter the df to row with id=2, but instead it returns no rows
df.filter(
    pl.lit('COMPANY B PLUS EXTRA TEXT').str.contains(pl.col('name'), literal=True)
)

Log output

No response

Issue description

When filtering a dataframe to a string literal containing the value within a column, the result behaves unpredictably.

Note that I would love to see unit test cases added for this, as it's quirky it wasn't noticed earlier (or only just recently broke)

Expected behavior

See expected values in comments in the Reproducable Example section

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:              0.19.13
Index type:          UInt32
Platform:            Linux-6.1.0-1026-oem-x86_64-with-glibc2.35
Python:              3.9.18 (main, Aug 25 2023, 13:20:14) 
[GCC 11.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         3.0.0
connectorx:          0.3.2
deltalake:           <not installed>
fsspec:              2023.10.0
gevent:              <not installed>
matplotlib:          3.8.1
numpy:               1.26.2
openpyxl:            3.0.10
pandas:              1.5.3
pyarrow:             11.0.0
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              1.0.10
sqlalchemy:          1.4.50
xlsx2csv:            0.8.1
xlsxwriter:          3.0.9
@DeflateAwning DeflateAwning added bug Something isn't working python Related to Python Polars labels Nov 22, 2023
@MarcoGorelli
Copy link
Collaborator

Thanks for the report - if you run

In [59]: df.select(
    ...:     pl.lit('COMPANY A PLUS EXTRA TEXT').str.contains(pl.col('name'), literal=True)
    ...: )
Out[59]:
shape: (1, 1)
┌─────────┐
│ literal │
│ ---     │
│ bool    │
╞═════════╡
│ true    │
└─────────┘

then the output is a single row

What you probably want to do is add an extra column first, so it gets broadcast to the right length, before you do the filtering:

In [64]: df.with_columns(name_with_extra_text=pl.lit('COMPANY A PLUS EXTRA TEXT')).filter(
    ...:     pl.col('name_with_extra_text').str.contains(pl.col('name'), literal=True)
    ...: )
Out[64]:
shape: (1, 3)
┌───────────┬─────┬───────────────────────────┐
│ nameidname_with_extra_text      │
│ ---------                       │
│ stri64str                       │
╞═══════════╪═════╪═══════════════════════════╡
│ COMPANY A1COMPANY A PLUS EXTRA TEXT │
└───────────┴─────┴───────────────────────────┘

Having said that, maybe Expr.str.contains should raise if pattern is longer than self, or broadcast to the length of pattern? @reswqa

@DeflateAwning
Copy link
Contributor Author

I think that it should broadcast to the length. My intuition suggests that what I wrote should function properly.

This feel like the sort of "bug" (design decision) that bothered me to no end about pandas, which polars seemed to solve with their API. This is the first time I've encountered this sort of funky behaviour in polars, where serieses in expressions don't work out to be the correct length.

@deanm0000
Copy link
Collaborator

I brought this one up a while back #8120

@Julian-J-S
Copy link
Contributor

Julian-J-S commented Nov 22, 2023

yeah just tested some variations myself and there is definately some unexpected behaviour

(
    pl.DataFrame({
        'text': 'hello world!!',
        'word': ['hello', 'world', 'hi']
    })
    .with_columns(
        contains_col=pl.col('text').str.contains(pl.col('word'), literal=True),
        contains_lit=pl.lit('hello world!!').str.contains(pl.col('word'), literal=True),
        contains_lit2=pl.lit('XXXXhelloXXXXX').str.contains(pl.col('word'), literal=True),
        contains_lit3=pl.lit('world 123').str.contains(pl.col('word'), literal=True),
    )
)
┌───────────────┬───────┬──────────────┬──────────────┬───────────────┬───────────────┐
│ textwordcontains_colcontains_litcontains_lit2contains_lit3 │
│ ------------------           │
│ strstrboolboolboolbool          │
╞═══════════════╪═══════╪══════════════╪══════════════╪═══════════════╪═══════════════╡
│ hello world!! ┆ hellotruetruetruefalse         │
│ hello world!! ┆ worldtruetruetruefalse         │
│ hello world!! ┆ hifalsetruetruefalse         │
└───────────────┴───────┴──────────────┴──────────────┴───────────────┴───────────────┘

pl.lit(<TEXT>).str.contains(pl.col(<COL>), literal=True) only seems to check if TEXT contains the first value of COL
It will apparently ignore all other values in the specified column. This is hardy ever what you want/expect 😄

@deanm0000
Copy link
Collaborator

deanm0000 commented Nov 22, 2023

Here are some possibly relevant/related issues:

These 2 seem to suggest it's intended
#9713
#9712

These 2 suggest a fix is warranted
#5971
#10530

@Julian-J-S
Copy link
Contributor

I genuinely cannot understand why this should be expected behaviour. Can anyone explain?

df = pl.DataFrame({
    'val': [1, 2, 3],
    'text': ['a', 'b', 'c']
})
df.with_columns(
    val1=1 + pl.col("val"),
    val2=pl.lit(1) + pl.col("val"),
    val3=pl.lit(1).add(pl.col("val")),
    text1="hi " + pl.col("text"),
    text2=pl.lit("hi ") + pl.col("text"),
    text3=pl.lit("hi ").add(pl.col("text")),
    text4=pl.lit("aaa").str.contains(pl.col("text")),  # Only checking first column value "a"
    text5=pl.lit("bbb").str.contains(pl.col("text")),  # Only checking first column value "a"
)
shape: (3, 10)
┌─────┬──────┬──────┬──────┬──────┬───────┬───────┬───────┬───────┬───────┐
│ valtextval1val2val3text1text2text3text4text5 │
│ ------------------------------   │
│ i64stri64i64i64strstrstrboolbool  │
╞═════╪══════╪══════╪══════╪══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ 1a222hi ahi ahi atruefalse │
│ 2b333hi bhi bhi btruefalse │
│ 3c444hi chi chi ctruefalse │
└─────┴──────┴──────┴──────┴──────┴───────┴───────┴───────┴───────┴───────┘

some kind of broadcasting seems to be happening for lit(1)... and lit("hi")...
why does it "work" there but other logic is applied with lit().contains?

I think checking if some literal TEXT contains values of a column should definitely work for all values in that column.
Just checking the first value in the column and broadcasting that result is like intentionally seeing a problem/bug but hiding it from the user. 🤔
Please feel free to convince me from the opposite!

@deanm0000
Copy link
Collaborator

It seems to be an order of operations thing. Where .str.contains is happening before broadcast but +, .add, etc are after broadcast.

@Julian-J-S
Copy link
Contributor

@deanm0000 yeah you are probably right.
I think this is a bug/unexpected and should be fixed.
Here another example of why this is "problematic"

df.with_columns(
    word_col=pl.lit('bbb'),
).with_columns(
    contain_buggy=pl.lit("bbb").str.contains(pl.col("text")),  # these 2 rows "mean" the same and should be identical
    contain_working=pl.col("word_col").str.contains(pl.col("text")),  # these 2 rows "mean" the same and should be identical
)
shape: (3, 5)
┌─────┬──────┬──────────┬───────────────┬─────────────────┐
│ valtextword_colcontain_buggycontain_working │
│ ---------------             │
│ i64strstrboolbool            │
╞═════╪══════╪══════════╪═══════════════╪═════════════════╡
│ 1abbbfalsefalse           │
│ 2bbbbfalsetrue            │
│ 3cbbbfalsefalse           │
└─────┴──────┴──────────┴───────────────┴─────────────────┘

I think in general adding a constant literal value to a column and afterwards using that in a calculation should have the same result than using that literal directly.

@cmdlineluser
Copy link
Contributor

I think anything that accept expressions behaves like this and needs broadcasting or a length check to be implemented "manually".

.dt.offset_by is the most recent example that I recall being fixed: #11088

Just trying a few, .str.starts_with, .str.ends_with, .str.strip_chars all behave this way:

pl.select(pl.lit("bbb").str.starts_with(pl.Series(["a", "b", "c"])))
# shape: (1, 1)
# ┌─────────┐
# │ literal │
# │ ---     │
# │ bool    │
# ╞═════════╡
# │ false   │
# └─────────┘

Some also have length checks in place:

pl.select(pl.lit("bbb").str.extract_all(pl.Series(["a", "b", "c"])))
# ComputeError: pattern's length: 3 does not match that of the argument series: 1

Probably a good idea to test every expression function to see which ones behave this way.

@deanm0000
Copy link
Collaborator

I'd like to revise my earlier guess about order of operations. I pulled up the fix for the temporal and saw that it's explicitly checking the length of each input.

I think this is what needs to be fixed for str.contains

from: polars/crates/polars-ops/src/chunked_array/strings
/namespace.rs

    fn contains_chunked(
        &self,
        pat: &Utf8Chunked,
        literal: bool,
        strict: bool,
    ) -> PolarsResult<BooleanChunked> {
        let ca = self.as_utf8();
        match pat.len() {
            1 => match pat.get(0) {
                Some(pat) => {
                    if literal {
                        ca.contains_literal(pat)
                    } else {
                        ca.contains(pat, strict)
                    }
                },
                None => Ok(BooleanChunked::full_null(ca.name(), ca.len())),
            },
            _ => {
                if literal {
                    Ok(binary_elementwise_values(ca, pat, |src, pat| {
                        src.contains(pat)
                    }))
                } else if strict {
                    // A sqrt(n) regex cache is not too small, not too large.
                    let mut reg_cache = FastFixedCache::new((ca.len() as f64).sqrt() as usize);
                    try_binary_elementwise(ca, pat, |opt_src, opt_pat| match (opt_src, opt_pat) {
                        (Some(src), Some(pat)) => {
                            let reg = reg_cache.try_get_or_insert_with(pat, |p| Regex::new(p))?;
                            Ok(Some(reg.is_match(src)))
                        },
                        _ => Ok(None),
                    })
                } else {
                    // A sqrt(n) regex cache is not too small, not too large.
                    let mut reg_cache = FastFixedCache::new((ca.len() as f64).sqrt() as usize);
                    Ok(binary_elementwise(
                        ca,
                        pat,
                        infer_re_match(|src, pat| {
                            let reg = reg_cache.try_get_or_insert_with(pat?, |p| Regex::new(p));
                            Some(reg.ok()?.is_match(src?))
                        }),
                    ))
                }
            },
        }
    }

I think it needs to check the length of ca which is essentially the pl.lit() of the above examples. I don't really know rust I'm just pattern matching so I don't know how to actually fix it unfortunately.

@cmdlineluser
Copy link
Contributor

That does look like the spot @deanm0000

You can highlight the lines and copy the permalink to have Github inline the code.

image

if literal {
Ok(binary_elementwise_values(ca, pat, |src, pat| {

binary_elementwise_values() behaves like zip() in Python.

Because ca has a length of 1 it only processes 1 item from pat

What the offset_by example does, is it switches around and uses rhs.apply_values_generic instead when lhs.len() == 1

let src = ca.get(0).unwrap() # this should be pattern matched, but for example purposes
pat.apply_values_generic(|pattern| {
    reg = ...;
    reg.is_match(src)
})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants