Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: re_extract not working as expected for big query #6167

Closed
1 task done
andresbonatto opened this issue May 7, 2023 · 4 comments
Closed
1 task done

bug: re_extract not working as expected for big query #6167

andresbonatto opened this issue May 7, 2023 · 4 comments
Labels
bug Incorrect behavior inside of ibis

Comments

@andresbonatto
Copy link

What happened?

For duckdb backend, the code works as expected:

import ibis
ibis.set_backend("duckdb")
t = ibis.memtable({"s": ["a|b|c", "b|a|c", "b|b|b|c|a"]})
expr = t.s.re_extract(get_regex("|", 1), 0).execute()
0    b|c
1    a|c
2    c|a
Name: RegexExtract(s, '([a-zA-Z0-9]+(?:\\|[a-zA-Z0-9]+){0,1})$', 0), dtype: object

But when I run in BQ, I get a different result:

0 | a|b|c|  
1 | b|a|c|  
2 | b|b|b|c|a

The compiled SQL statements are very different:
Duckdb:

CASE
    WHEN REGEXP_MATCHES(t0.s, '([a-zA-Z0-9]+(?:\\|[a-zA-Z0-9]+){0,1})$')
    THEN REGEXP_EXTRACT(t0.s, '([a-zA-Z0-9]+(?:\\|[a-zA-Z0-9]+){0,1})$')
    ELSE ''
  END

BQ:

CASE
    WHEN REGEXP_CONTAINS(t0.`s`, '([a-zA-Z0-9]+(?:\|[a-zA-Z0-9]+){0,1})$')
    THEN CASE
      WHEN COALESCE(0, 0) = 0
      THEN t0.`s`
      ELSE REGEXP_EXTRACT_ALL(t0.`s`, '([a-zA-Z0-9]+(?:\|[a-zA-Z0-9]+){0,1})$')[SAFE_ORDINAL(0)]
    END
    ELSE NULL
  END

What version of ibis are you using?

5.1.0

What backend(s) are you using, if any?

BigQuery

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@andresbonatto andresbonatto added the bug Incorrect behavior inside of ibis label May 7, 2023
@mesejo
Copy link
Contributor

mesejo commented May 8, 2023

Thanks for the issue @andresbonatto! I did some digging on this issue; here is what I found.

According to the documentation on re_extract, it is BigQuery the one working as expected (emphasis mine):

The behavior of this function follows the behavior of Python's re.match: when index is zero and there's a match, return the entire string, otherwise return the content of the index-th match group.

The problem is that Python's re.match doesn't work as described. It returns a match object if:

If zero or more characters at the beginning of string match this regular expression

In the end, both are wrong 😵‍💫. The backend working as re.match is pandas, and it returns all NA:

ibis.set_backend("pandas")
df = pd.DataFrame({"s": ["a|b|c", "b|a|c", "b|b|b|c|a"]})
t = ibis.memtable(df)
expr = t.s.re_extract(r"([a-zA-Z0-9]+(?:\|[a-zA-Z0-9]+){0,1})$", 0).execute()
print(expr)

Output

0    NaN
1    NaN
2    NaN
Name: RegexExtract(s, '([a-zA-Z0-9]+(?:\\|[a-zA-Z0-9]+){0,1})$', 0), dtype: object

@andresbonatto
Copy link
Author

andresbonatto commented May 8, 2023

Hi @mesejo, thanks for answering!
Is this really the intended behavior?
when index is zero and there's a match, should it return

  1. the entire string or
  2. the entire match
    ?

In my understanding it should be number 2 because it's the only interface in Ibis to extract from a pattern. BQ already has the function REGEXP_EXTRACT to do this but we lose this simple function when we use Ibis.

mesejo added a commit to mesejo/ibis that referenced this issue May 9, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
mesejo added a commit to mesejo/ibis that referenced this issue May 9, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
mesejo added a commit to mesejo/ibis that referenced this issue May 9, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
mesejo added a commit to mesejo/ibis that referenced this issue May 10, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
mesejo added a commit to mesejo/ibis that referenced this issue May 10, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
@tswast
Copy link
Contributor

tswast commented May 11, 2023

An example from Python for my own reference:

In [17]: import re
    ...: my_string = "123.456.789.notmatched"
    ...: match = re.match(r"([0-9]+)\.([0-9]+)\.([0-9]+)", my_string)

In [18]: match.group(0)
Out[18]: '123.456.789'

In [19]: match.group(1)
Out[19]: '123'

In [20]: match.group(2)
Out[20]: '456'

In [21]: match.group(3)
Out[21]: '789'

In [22]: my_unmatched = "notmatched.123.456.789.notmatched"

In [23]: match = re.match(r"([0-9]+)\.([0-9]+)\.([0-9]+)", my_unmatched)

In [24]: match

In [25]: match is None
Out[25]: True

@tswast
Copy link
Contributor

tswast commented May 11, 2023

It is true that https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract does the right thing when there is no capturing group in the regex. I would prefer we only use REGEXP_EXTRACT when index = 0.

mesejo added a commit to mesejo/ibis that referenced this issue May 11, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
mesejo added a commit to mesejo/ibis that referenced this issue May 19, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
mesejo added a commit to mesejo/ibis that referenced this issue May 19, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
cpcloud pushed a commit to cpcloud/ibis that referenced this issue May 22, 2023
Background
==========
The current definition of `re_extract` needs to be clarified. The documentation
claims that it works as `re.match`, but it works differently:

> when index is zero and there's a match, return the entire string, otherwise
> return the content of the index-th match group.

A Python's Match Object returns the entire match, not the entire string.

Implementation
==============
This PR updates the documentation and the implementation of some of the backends
to actually work as a Match Object, note that some backends were already doing
this: pandas, sqlite, duckdb, postgresql.

In the case of BigQuery the index parameter makes no sense because it returns either
the whole match of the first capturing group that matches.

fixes ibis-project#6167
@kszucs kszucs closed this as completed in 6ebaeab May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants