Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add support for using Parquet as external storage #20

Open
aaronsteers opened this issue Mar 8, 2023 · 11 comments
Open
Labels
help wanted Extra attention is needed

Comments

@aaronsteers
Copy link

aaronsteers commented Mar 8, 2023

Updated issue description (2023-03-30):

There are some great use cases where we'd love to use target-duckdb as an interop layer to write Parquet files.

Today, users sometimes are creating data flows where they first use target-parquet and then transforming with dbt-duckdb, whereas a more streamlined approach would be to let target-duckdb and dbt-duckdb both operate on the same Parquet-based datastore.

From the comment thread below in #20 (comment):

...As we think towards where to invest future efforts, and where to direct community members who want to interop with Spark and/or data lakes, I think target-duckdb might ultimately be a better layer for "table-like" operations in data lake paradigms.

I'm not sure how the target-parquet would handle a merge upsert operation, for instance. Whereas DuckDB's support for SQL transformations could likely be a better interface for data lake management operations.

Original question

Details

We have some users interested in storing data within Parquet. Can this target be used in combination with DuckDB's support for Parquet datasets?

@aaronsteers aaronsteers added the help wanted Extra attention is needed label Mar 8, 2023
@jwills
Copy link
Owner

jwills commented Mar 9, 2023

Hey AJ-- I think @matsonj just used target-parquet for that in his MDS in a box project b/c of the (current) instability of the DuckDB file format, and then used the support for external sources in dbt-duckdb to do transformations on the resulting data.

I was just going to take a pass over this repo to do some updates for DuckDB 0.7.x-- is there some reason target-parquet wouldn't work for your user, or something that I could improve on it using DuckDB?

@aaronsteers
Copy link
Author

aaronsteers commented Mar 10, 2023

Hi, @jwills. Re:

is there some reason target-parquet wouldn't work for your user, or something that I could improve on it using DuckDB?

No reason I know of. I'm totally happy to recommend that model - target-parquet, with dbt-duckdb then consuming from the landed parquet files.

As we think towards where to invest future efforts, and where to direct community members who want to interop with Spark and/or data lakes, I think target-duckdb might ultimately be a better layer for "table-like" operations in data lake paradigms.

I'm not sure how the target-parquet would handle a merge upsert operation, for instance. Whereas DuckDB's support for SQL transformations could likely be a better interface for data lake management operations.

There's no rush on this, by the way. I just wanted to start this thread to see if what I'm thinking of would make sense.

@jwills
Copy link
Owner

jwills commented Mar 10, 2023

Yeah, your reasoning there re: upsert operations makes sense and is valid IMO. I'm going to turn my attention back to this project next week once I get some dbt-duckdb stuff I've been working on out the door and I will look hard at making parquet support for this target into a first-class concept.

@aaronsteers
Copy link
Author

Yeah, your reasoning there re: upsert operations makes sense and is valid IMO.

Thanks for this validation!

I'm going to turn my attention back to this project next week once I get some dbt-duckdb stuff I've been working on out the door and I will look hard at making parquet support for this target into a first-class concept.

Sounds great. Again, no rush from our side. Nothing per se is broken as of now, and this is more of a long-term strategic investment, I think.

I'll close this issue since the question is answered. Thanks again, and let us know if we can help in any way.

@aaronsteers aaronsteers changed the title Question: Is Parquet storage supported? Question: Add support info for using Parquet as storage Mar 30, 2023
@aaronsteers
Copy link
Author

aaronsteers commented Mar 30, 2023

Reopening (with an updated title) because I've been hearing a lot of interest in this.

I've updated the description to be more direct in terms of what I think the next steps may be.

cc @kgpayne

@aaronsteers aaronsteers reopened this Mar 30, 2023
@aaronsteers aaronsteers changed the title Question: Add support info for using Parquet as storage Question: Add support for using Parquet as storage Mar 30, 2023
@aaronsteers aaronsteers changed the title Question: Add support for using Parquet as storage Question: Add support for using Parquet as external storage Mar 30, 2023
@aaronsteers aaronsteers changed the title Question: Add support for using Parquet as external storage Feature request: Add support for using Parquet as external storage Mar 30, 2023
@jwills
Copy link
Owner

jwills commented Mar 30, 2023

okay, cool-- as you can tell I've done ~ nothing to move this forward; do you want to chat about it somewhere? Meltano Slack?

@aaronsteers
Copy link
Author

aaronsteers commented Mar 30, 2023

@jwills - great idea! I created a new channel for this: #-duckdb-warehousing-dev

(Join link for anyone not already in our slack: https://meltano.com/slack)

@aaronsteers
Copy link
Author

Looks like @kgpayne has an implementation POC using external storage here:

@ReneTC
Copy link

ReneTC commented Jul 20, 2023

Any updates on this?
I'm running into the versioning error, with target-duckdb being older than the dbt-duckdb version.
Can the merged feature solve the issue? I'm not sure how to use it, I was looking for some documentation without luck.

@jwills
Copy link
Owner

jwills commented Jul 20, 2023

@ReneTC I think the move is to use a virtualenv-type solution to align your duckdb, dbt-duckdb, and target-duckdb versions together; I'd recommend:

duckdb==0.8.1
dbt-duckdb==1.5.2
target-duckdb==0.6.0

...but I'm on vacation for a couple of weeks and haven't tried them in combination yet.

@ReneTC
Copy link

ReneTC commented Jul 20, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants