bug: re_extract not working as expected for big query #6167

andresbonatto · 2023-05-07T17:39:10Z

What happened?

For duckdb backend, the code works as expected:

import ibis
ibis.set_backend("duckdb")
t = ibis.memtable({"s": ["a|b|c", "b|a|c", "b|b|b|c|a"]})
expr = t.s.re_extract(get_regex("|", 1), 0).execute()
0    b|c
1    a|c
2    c|a
Name: RegexExtract(s, '([a-zA-Z0-9]+(?:\\|[a-zA-Z0-9]+){0,1})$', 0), dtype: object

But when I run in BQ, I get a different result:

0 | a|b|c|  
1 | b|a|c|  
2 | b|b|b|c|a

The compiled SQL statements are very different:
Duckdb:

CASE
    WHEN REGEXP_MATCHES(t0.s, '([a-zA-Z0-9]+(?:\\|[a-zA-Z0-9]+){0,1})$')
    THEN REGEXP_EXTRACT(t0.s, '([a-zA-Z0-9]+(?:\\|[a-zA-Z0-9]+){0,1})$')
    ELSE ''
  END

BQ:

CASE
    WHEN REGEXP_CONTAINS(t0.`s`, '([a-zA-Z0-9]+(?:\|[a-zA-Z0-9]+){0,1})$')
    THEN CASE
      WHEN COALESCE(0, 0) = 0
      THEN t0.`s`
      ELSE REGEXP_EXTRACT_ALL(t0.`s`, '([a-zA-Z0-9]+(?:\|[a-zA-Z0-9]+){0,1})$')[SAFE_ORDINAL(0)]
    END
    ELSE NULL
  END

What version of ibis are you using?

5.1.0

What backend(s) are you using, if any?

BigQuery

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

mesejo · 2023-05-08T11:13:51Z

Thanks for the issue @andresbonatto! I did some digging on this issue; here is what I found.

According to the documentation on re_extract, it is BigQuery the one working as expected (emphasis mine):

The behavior of this function follows the behavior of Python's re.match: when index is zero and there's a match, return the entire string, otherwise return the content of the index-th match group.

The problem is that Python's re.match doesn't work as described. It returns a match object if:

If zero or more characters at the beginning of string match this regular expression

In the end, both are wrong 😵‍💫. The backend working as re.match is pandas, and it returns all NA:

ibis.set_backend("pandas")
df = pd.DataFrame({"s": ["a|b|c", "b|a|c", "b|b|b|c|a"]})
t = ibis.memtable(df)
expr = t.s.re_extract(r"([a-zA-Z0-9]+(?:\|[a-zA-Z0-9]+){0,1})$", 0).execute()
print(expr)

Output

0    NaN
1    NaN
2    NaN
Name: RegexExtract(s, '([a-zA-Z0-9]+(?:\\|[a-zA-Z0-9]+){0,1})$', 0), dtype: object

andresbonatto · 2023-05-08T12:59:14Z

Hi @mesejo, thanks for answering!
Is this really the intended behavior?
when index is zero and there's a match, should it return

the entire string or
the entire match
?

In my understanding it should be number 2 because it's the only interface in Ibis to extract from a pattern. BQ already has the function REGEXP_EXTRACT to do this but we lose this simple function when we use Ibis.

Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167

tswast · 2023-05-11T15:21:23Z

An example from Python for my own reference:

In [17]: import re
    ...: my_string = "123.456.789.notmatched"
    ...: match = re.match(r"([0-9]+)\.([0-9]+)\.([0-9]+)", my_string)

In [18]: match.group(0)
Out[18]: '123.456.789'

In [19]: match.group(1)
Out[19]: '123'

In [20]: match.group(2)
Out[20]: '456'

In [21]: match.group(3)
Out[21]: '789'

In [22]: my_unmatched = "notmatched.123.456.789.notmatched"

In [23]: match = re.match(r"([0-9]+)\.([0-9]+)\.([0-9]+)", my_unmatched)

In [24]: match

In [25]: match is None
Out[25]: True

tswast · 2023-05-11T15:25:04Z

It is true that https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract does the right thing when there is no capturing group in the regex. I would prefer we only use REGEXP_EXTRACT when index = 0.

Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167

andresbonatto added the bug Incorrect behavior inside of ibis label May 7, 2023

mesejo mentioned this issue May 9, 2023

fix(common): fix bug in re_extract #6172

Closed

kszucs closed this as completed in 6ebaeab May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: re_extract not working as expected for big query #6167

bug: re_extract not working as expected for big query #6167

andresbonatto commented May 7, 2023

mesejo commented May 8, 2023

andresbonatto commented May 8, 2023 •

edited

tswast commented May 11, 2023

tswast commented May 11, 2023

bug: re_extract not working as expected for big query #6167

bug: re_extract not working as expected for big query #6167

Comments

andresbonatto commented May 7, 2023

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

mesejo commented May 8, 2023

andresbonatto commented May 8, 2023 • edited

tswast commented May 11, 2023

tswast commented May 11, 2023

andresbonatto commented May 8, 2023 •

edited