-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(strings): improve selected string tests #10020
Conversation
7dfde84
to
22f9164
Compare
A hand-crafted table with some common problematic properties for testing string methods.
22f9164
to
af2998c
Compare
Ok, I've fixed
I'm going to open issues to track the other proposed work in this PR description and handle them separately, since some of those might be breaking changes, while everything here is strictly a bugfix. |
29e91e2
to
97c4e4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢 it!
Doing a rebase merge to pick up all the individual fixes. |
BigQuery:
rstrip
andlstrip
are broken because of a SQLGlot bug, already fixed upstream.Snowflake:
rstrip
andlstrip
are broken because of a SQLGlot bug, already fixed upstream.MSSQL:
contains
might work, but requires “fulltext index” and I can’t be botheredrpad
not definedlpad
not definedfind_in_set
not definedOracle:
len(🐍) == 2
, should be able to useLENGTHC
insteadstrip
maps totrim
, butTRIM
doesn’t accept custom characters, unlikeRTRIM
andLTRIM
Flink:
len(🐍) == 2
strip
maps totrim
, butTRIM
doesn’t accept custom characters, unlikeRTRIM
andLTRIM
There is a function in the docs called
BTRIM
but maybe it’s only in dev? But that should work?Impala (woof):
len(🐍) == 4
,len(Éé) == 4
TRIM
for anything but whitespace. This appears to be a limitation of our Impala compiler.Risingwave
MySQL
len(🐍) == 4
, should useCHAR_LENGTH
PySpark
TRIM
only trims spacesSQLite
Clickhouse
BUT:
There are UTF8 aware versionf for all of these
What should the string APIs do?
find_in_set
We state in the docs for
find_in_set
that if a value contains a comma, wereturn
-1
(the same behavior as if there is no match). Currently no backenddoes this, the 5 backends that support the operation all return the index of the
comma-containing field (
duckdb
,datafusion
,mysql
,postgres
,risingwave
)We can just update the docstring?
Padding
The SQL convention with padding is that if the original string is longer than
the pad length, the string is trimmed to that length.
The Python convention is to leave strings >= padlength alone.
Which would we like to adopt?
I think the SQL convention here is terrible and we should follow Python.
UTF-8
Some backends will support this, some won’t, but, we should be consistent.
I think that
length
,lpad
, andrpad
should all be UTF-8 variants.Propose adding
len_bytes
as an additional string method for the byte length of a string.