-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(parquet): figure out convention for multi-file parquet writing #8584
Comments
I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify |
During triage earlier today, we collectively decided that option 5 makes the most sense. By having two separate APIs, we surface the directory option more obviously to users. The directory option only applies for the write path; for reading, we will still maintain a single API. Furthermore, we will support directory vs. single file options, but not get into more backend-specific behavior (like glob handling); that will still be delegated to the backend. Finally, it came up that at some point we may need to better support cloud I/O (e.g. via |
On the topic of unifying behaviors across backends, is it weird if some backends require that, e.g., |
I'm looking into this one for duckdb which is what's missing implementing since the pyspark case was covered in #9272. I noticed that for the case of writing hive partitions. if you do something like >>> penguins = ibis.examples.penguins.fetch()
>>> con = ibis.get_backend(penguins)
>>> con.to_parquet(penguins, "my_dir" , partition_by="year") That will create a directory called So for the case of hive partitions we kind of are supporting writing to a directory. Then the question is do we want a
|
Since we have |
I just want to make sure we want to implement it just for this case |
If a user wants to pass in kwargs to |
We may use pyarrow.parquet Here is another view of points: do we want to ensure consistency between |
Hmmm, I remember us talking about this before (sorry, this issue had slipped my mind when we talked about this last week).
Currently all our
to_csv
/to_parquet
writers produce a single file (rather than a directory of files). There was an open question as to what the expected behavior was for backends where a single-file output is tricky/inefficient/impossible (backends like spark or dask).Enumerating all the options I can think of:
1.
to_csv
/to_parquet
always outputs a single fileWe could fallback to pyarrow or error for backends where this is tricky. This is the current behavior.
2.
to_csv
/to_parquet
always output a directoryBackends that only write a single file would convert to writing a directory with a single file. This would be a breaking change, but should work with any backend's native csv/parquet writer. This would make a common case of writing a small csv of results trickier though.
3.
to_csv
/to_parquet
write the backend-native styleThe output style (single file or directory) depends on the backend. This is easy to implement, but would mean different backends would result in different behaviors.
4.
to_csv
/to_parquet
have an option to write a directoryThese methods would write to a single file by default, but would have an option to instead write to a directory. 🤷 whether this is a
directory=True
option, or inferred somehow from the input path. This would mean that users using backends that can only efficiently write to directories may need to opt into this support, but would make things explicit about the output type.5. New methods for writing directories
to_csv
/to_parquet
keep their existing behavior, and we add newto_csv_dir
/to_parqet_dir
(or much better named) methods for writing directories. Same caveats/questions as 4.Right now I'm leaning towards 4. An extra flag seems fine to me, and I like it better than adding another top-level-method per file format just to support partitioned writing. cc'ing @gforsyth for a 2nd opinion though, since he may remember what conclusions we came to last time this came up.
Originally posted by @jcrist in #6615 (comment)
Additionally, I've documented (although this is now at least partly out of date) how a few systems handle various parquet partitioning schemes: https://gist.github.com/gforsyth/8dd4ca981b2beed6ef4db80f5e8afbfd
Opening this issue so we have something to track that isn't a comment in a closed PR.
Probably need to write out a taxonomy of what parquet functionality is supported by each backend natively
The text was updated successfully, but these errors were encountered: