Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Catalog] write.data.path & write.metadata.path - forbid or respect? #8859

Open
snazy opened this issue Jun 18, 2024 · 10 comments
Open

[Catalog] write.data.path & write.metadata.path - forbid or respect? #8859

snazy opened this issue Jun 18, 2024 · 10 comments
Labels
catalog Nessie Catalog / Iceberg REST

Comments

@snazy
Copy link
Member

snazy commented Jun 18, 2024

The write.data.path and write.metadata.path table properties instruct clients to use that location for write operations.

write.data.path overrides the table-metadata location - see org.apache.iceberg.LocationProviders.DefaultLocationProvider#dataLocation.
write.metadata.path overrides the table-metadata location + "/metadata" default.

We should either forbid setting this property or consider this as a base location.

Other already deprecated properties: write.folder-storage.path and write.object-storage.path.
See org.apache.iceberg.TableProperties.

I suspect we have to respect existing values (for "imported" table-metadata) and prevent setting this property, but allow removing this property.

@adutra adutra added the catalog Nessie Catalog / Iceberg REST label Jul 1, 2024
@marvin-roesch
Copy link
Contributor

As per our discussion on Zulip and our call, we would like to see specifying paths supported, both on a table as well as a namespace/schema level.

Our particular use case is through Trino, where a location property can be set on both a schema as well as a table level (see the Trino docs). For schemas, the native Nessie catalog implementation in Trino will use the specified location as base for any table in that schema, unless the table overrides the property itself. We use this to store different schemas in different S3 buckets.

With the Nessie REST catalog, tables get stored in the warehouse's configured bucket in a <schema>/<table> sub-directory, completely ignoring the location property on the namespace.

We do not need support for changing the data and metadata paths specifically, only the overall base location of a table.

@snazy
Copy link
Member Author

snazy commented Jul 23, 2024

#9170 adds some better support for Trino that eliminates the need to manually specify the location via Trino.

@snazy
Copy link
Member Author

snazy commented Aug 3, 2024

@marvin-roesch with #9170 Nessie still controls the location of new tables. However, that PR does also support write.object-storage.enable=true for both S3 request signing and credentials vending.

@snazy
Copy link
Member Author

snazy commented Aug 10, 2024

@marvin-roesch is Nessie 0.95.0 working for you?

@marvin-roesch
Copy link
Contributor

@snazy Unfortunately we haven't had a chance yet to try out the newest version as other projects took precedence 😅 We're planning for a Trino and Nessie upgrade by end of this week, will report after that!

@marvin-roesch
Copy link
Contributor

@snazy Just looking at the changes from that MR, I still don't think we'll be able to use Nessie with the Iceberg REST catalog just yet while maintaining our current S3 setup, but please correct me if I'm wrong.

We do explicitly set the location property for the schema/namespace to a bucket that doesn't match the default warehouse config nor does it follow a simple ${warehouse-location/${namespace-name} scheme. The problem is with creating new tables where Nessie just completely ignores this namespace-level property.

Having new tables created in the warehouse's default bucket isn't bad, but it's a little confusing for us and our users to have to go looking for a table's data in an unexpected location.

Are there plans to support this use case? Otherwise we would make an effort to migrate everything to a single bucket for our setup.

@snazy
Copy link
Member Author

snazy commented Aug 13, 2024

Hm - yea - a different bucket wouldn't work yet. But migrating Iceberg tables is a huge effort (literally rewrite everything).

Can you check whether everything else works for you?
I've opened #9331 as a follow-up.

@marvin-roesch
Copy link
Contributor

I'll run a few tests later this week to verify everything else is working for us and report back 👍 Thanks for the follow-up, that'd be ideal for us!

@marvin-roesch
Copy link
Contributor

Sorry for the delay on this, @snazy, things were quite busy with unrelated stuff. I've tested out the most important features we use with Nessie 0.95.0 and using the REST API. Everything is working fine, so with #9331, it'd be perfect for us!

@snazy
Copy link
Member Author

snazy commented Aug 28, 2024

NP, we're preparing a "bigger" release at the moment.
I hope to get this solved in the release after the next one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Nessie Catalog / Iceberg REST
Projects
None yet
Development

No branches or pull requests

3 participants