New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: re_extract not working as expected for big query #6167
Comments
Thanks for the issue @andresbonatto! I did some digging on this issue; here is what I found. According to the documentation on
The problem is that Python's
In the end, both are wrong 😵💫. The backend working as ibis.set_backend("pandas")
df = pd.DataFrame({"s": ["a|b|c", "b|a|c", "b|b|b|c|a"]})
t = ibis.memtable(df)
expr = t.s.re_extract(r"([a-zA-Z0-9]+(?:\|[a-zA-Z0-9]+){0,1})$", 0).execute()
print(expr) Output
|
Hi @mesejo, thanks for answering!
In my understanding it should be number 2 because it's the only interface in Ibis to extract from a pattern. BQ already has the function REGEXP_EXTRACT to do this but we lose this simple function when we use Ibis. |
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
An example from Python for my own reference: In [17]: import re
...: my_string = "123.456.789.notmatched"
...: match = re.match(r"([0-9]+)\.([0-9]+)\.([0-9]+)", my_string)
In [18]: match.group(0)
Out[18]: '123.456.789'
In [19]: match.group(1)
Out[19]: '123'
In [20]: match.group(2)
Out[20]: '456'
In [21]: match.group(3)
Out[21]: '789'
In [22]: my_unmatched = "notmatched.123.456.789.notmatched"
In [23]: match = re.match(r"([0-9]+)\.([0-9]+)\.([0-9]+)", my_unmatched)
In [24]: match
In [25]: match is None
Out[25]: True |
It is true that https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_extract does the right thing when there is no capturing group in the regex. I would prefer we only use |
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
Background ========== The current definition of `re_extract` needs to be clarified. The documentation claims that it works as `re.match`, but it works differently: > when index is zero and there's a match, return the entire string, otherwise > return the content of the index-th match group. A Python's Match Object returns the entire match, not the entire string. Implementation ============== This PR updates the documentation and the implementation of some of the backends to actually work as a Match Object, note that some backends were already doing this: pandas, sqlite, duckdb, postgresql. In the case of BigQuery the index parameter makes no sense because it returns either the whole match of the first capturing group that matches. fixes ibis-project#6167
What happened?
For duckdb backend, the code works as expected:
But when I run in BQ, I get a different result:
The compiled SQL statements are very different:
Duckdb:
BQ:
What version of ibis are you using?
5.1.0
What backend(s) are you using, if any?
BigQuery
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: