Do not allow special characters in base table locations #8524

dimas-b · 2024-05-16T16:20:51Z

Issue description

Coming from discussions on #8516.

Related to:

The text was updated successfully, but these errors were encountered:

adutra · 2024-05-16T16:25:34Z

Do we have a definition of "special characters"? Would it be enough to apply e.g. percent encoding to all base locations?

snazy · 2024-05-16T16:28:45Z

I suspect this can become pretty nasty later - some locations (provided "externally") are "properly escaped" and some are not.
Ideally, we/Nessie should store "properly escaped" locations - the big question is probably: how do we know when a location that's provided "to us/Nessie" is already escaped (so we don't escape it twice or more often).

dimas-b · 2024-05-16T16:29:41Z

I do not think URL encoding is interoperable with URI, Iceberg's S3URI and AWS S3Utilities... sadly... cf. apache/iceberg#10329 (comment)

dimas-b · 2024-05-16T16:33:36Z

I think we should permit only unreserved (per RFC 3986) chars in base locations.

snazy · 2024-05-16T16:36:12Z

Ah, you're right. Then we can only implement "our own" "safe escaping" for in namespace/content-key elements. Forbidding would then mean that you cannot create tables or views (just because of such a character).
I suspect, we have to work on some rules for this - something like:

If it's a quote char, skip it
If it's a # or % or ?, ignore/skip it - or replace with _

Some like that?

dimas-b · 2024-05-16T16:39:42Z

Unicode is tricky too... but something that works transparently with java.net.URI is probably ok.

dimas-b · 2024-05-16T16:40:56Z

I think the transformation does not have to be reversible.

adutra · 2024-05-16T16:41:02Z

Are we ok with a destructive encoding function? I.e. if both foo# and foo? become foo (or foo_), the encoded result becomes ambiguous.

adutra · 2024-05-16T16:44:26Z

https://en.wikipedia.org/wiki/Punycode ?

dimas-b · 2024-05-16T16:45:21Z

Exactly what I was thinking :) Is there a good OSS impl.?

snazy · 2024-05-16T16:46:17Z

destructive encoding function / encoded result becomes ambiguous

Ugh - true. However, entities have their Iceberg-UUID in the name - so it should™️ not be ambiguous?
But your point's still valid.

I suspect we have to rigorously forbid special-chars in object-store locations (as in org.projectnessie.catalog.service.config.WarehouseConfig.location()) but map/escape/destrictive-encode content-key elements?

snazy · 2024-05-16T16:47:32Z

But no matter which encoding we use - we have to think about existing locations (which we must/should not change) and new locations.

snazy · 2024-05-16T16:47:44Z

Legacy system issues - not nice

dimas-b · 2024-05-16T16:48:55Z

Existing locations are covered by StorageUri (hopefully). I think if it used to work in Iceberg/Spark, it will keep working with Nessie.

adutra · 2024-05-16T16:51:05Z

Exactly what I was thinking :) Is there a good OSS impl.?

It seems the jdk has one: https://docs.oracle.com/en%2Fjava%2Fjavase%2F21%2Fdocs%2Fapi%2F%2F/java.base/java/net/IDN.html

dimas-b · 2024-05-16T16:52:04Z

From my POC the main concern with new locations is that stuff derived from Nessie ContentKey for new tables may still have # and %, which will then break something on the Iceberg side.

dimas-b · 2024-05-16T16:57:19Z

With Unicode URI.toString() does not percent-encode non-reseved Unicode chars (and is able to parse them back), but URI.toASCIIString() does encode them. The latter will then hit S3 interop problems, I'm afraid.

dimas-b · 2024-05-16T17:00:23Z

On the other hand, Punycode will make Unicode path elements unreadable to humans in storage paths, which defeats the whole idea of using ContentKey for base locations, WDYT?

adutra · 2024-05-16T17:29:23Z

Yes, and I'm also concerned by the fact that it encodes all the ASCII characters first, then all the rest after, thus altering the natural sort order of original names. E.g.

äbc -> bc-uia
žbc -> bc-1va

snazy · 2024-05-16T18:37:05Z

Maybe collations et al?

dimas-b mentioned this issue May 16, 2024

Force table / view location #8516

Merged

snazy added this to the 1.0.0 milestone Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not allow special characters in base table locations #8524

Do not allow special characters in base table locations #8524

dimas-b commented May 16, 2024

adutra commented May 16, 2024

snazy commented May 16, 2024

dimas-b commented May 16, 2024 •

edited

dimas-b commented May 16, 2024

snazy commented May 16, 2024

dimas-b commented May 16, 2024

dimas-b commented May 16, 2024

adutra commented May 16, 2024

adutra commented May 16, 2024

dimas-b commented May 16, 2024 •

edited

snazy commented May 16, 2024

snazy commented May 16, 2024

snazy commented May 16, 2024

dimas-b commented May 16, 2024 •

edited

adutra commented May 16, 2024

dimas-b commented May 16, 2024

dimas-b commented May 16, 2024

dimas-b commented May 16, 2024

adutra commented May 16, 2024

snazy commented May 16, 2024

Do not allow special characters in base table locations #8524

Do not allow special characters in base table locations #8524

Comments

dimas-b commented May 16, 2024

Issue description

adutra commented May 16, 2024

snazy commented May 16, 2024

dimas-b commented May 16, 2024 • edited

dimas-b commented May 16, 2024

snazy commented May 16, 2024

dimas-b commented May 16, 2024

dimas-b commented May 16, 2024

adutra commented May 16, 2024

adutra commented May 16, 2024

dimas-b commented May 16, 2024 • edited

snazy commented May 16, 2024

snazy commented May 16, 2024

snazy commented May 16, 2024

dimas-b commented May 16, 2024 • edited

adutra commented May 16, 2024

dimas-b commented May 16, 2024

dimas-b commented May 16, 2024

dimas-b commented May 16, 2024

adutra commented May 16, 2024

snazy commented May 16, 2024

dimas-b commented May 16, 2024 •

edited

dimas-b commented May 16, 2024 •

edited

dimas-b commented May 16, 2024 •

edited