Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(strings): improve selected string tests #10020

Merged
merged 8 commits into from
Sep 6, 2024

Conversation

gforsyth
Copy link
Member

@gforsyth gforsyth commented Sep 4, 2024

BigQuery:

  • rstrip and lstrip are broken because of a SQLGlot bug, already fixed upstream.

Snowflake:

  • rstrip and lstrip are broken because of a SQLGlot bug, already fixed upstream.

MSSQL:

  • contains might work, but requires “fulltext index” and I can’t be bothered
  • rpad not defined
  • lpad not defined
  • find_in_set not defined

Oracle:

  • Treats len(🐍) == 2, should be able to use LENGTHC instead
  • strip maps to trim, but TRIM doesn’t accept custom characters, unlike RTRIM and LTRIM

Flink:

  • Treats len(🐍) == 2
  • strip maps to trim, but TRIM doesn’t accept custom characters, unlike RTRIM and LTRIM
    There is a function in the docs called BTRIM but maybe it’s only in dev? But that should work?

Impala (woof):

  • Treats len(🐍) == 4, len(Éé) == 4
  • No TRIM for anything but whitespace. This appears to be a limitation of our Impala compiler.
  • Accented characters do not get upper-cased or lower-cased, just left as-is.

Risingwave

  • Accented characters do not get upper-cased or lower-cased, just left as-is.

MySQL

  • Treats len(🐍) == 4, should use CHAR_LENGTH

PySpark

  • TRIM only trims spaces

SQLite

  • Accented characters do not get upper-cased or lower-cased, just left as-is.

Clickhouse

  • Lengths of accented characters and emoji are of byte width, not character length
  • This also impacts padding
    BUT:
    There are UTF8 aware versionf for all of these

What should the string APIs do?

find_in_set

We state in the docs for find_in_set that if a value contains a comma, we
return -1 (the same behavior as if there is no match). Currently no backend
does this, the 5 backends that support the operation all return the index of the
comma-containing field (duckdb, datafusion, mysql, postgres,
risingwave)

We can just update the docstring?

Padding

The SQL convention with padding is that if the original string is longer than
the pad length, the string is trimmed to that length.

The Python convention is to leave strings >= padlength alone.

Which would we like to adopt?

I think the SQL convention here is terrible and we should follow Python.

UTF-8

Some backends will support this, some won’t, but, we should be consistent.

I think that length, lpad, and rpad should all be UTF-8 variants.

Propose adding len_bytes as an additional string method for the byte length of a string.

@gforsyth gforsyth force-pushed the better_string_tests branch 2 times, most recently from 7dfde84 to 22f9164 Compare September 5, 2024 13:34
@gforsyth
Copy link
Member Author

gforsyth commented Sep 5, 2024

Ok, I've fixed strip, lstrip and rstrip for all backends except:

  • pyspark: because they don't have upstream rtrim and ltrim functions that take characters to trim
  • snowflake: upstream fixed sqlglot bug, just waiting on it to release
  • bigquery: upstream fixed sqlglot bug, just waiting on it to release

I'm going to open issues to track the other proposed work in this PR description and handle them separately, since some of those might be breaking changes, while everything here is strictly a bugfix.

@gforsyth gforsyth added impala The Apache Impala backend clickhouse The ClickHouse backend mssql The Microsoft SQL Server backend flink Issues or PRs related to Flink oracle The Oracle backend labels Sep 5, 2024
Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢 it!

@cpcloud
Copy link
Member

cpcloud commented Sep 6, 2024

Doing a rebase merge to pick up all the individual fixes.

@cpcloud cpcloud merged commit 01117a5 into ibis-project:main Sep 6, 2024
81 checks passed
@cpcloud cpcloud added this to the 9.5 milestone Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse The ClickHouse backend flink Issues or PRs related to Flink impala The Apache Impala backend mssql The Microsoft SQL Server backend oracle The Oracle backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants