Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parquet): figure out convention for multi-file parquet writing #8584

Closed
gforsyth opened this issue Mar 7, 2024 · 8 comments · Fixed by #9781
Closed

feat(parquet): figure out convention for multi-file parquet writing #8584

gforsyth opened this issue Mar 7, 2024 · 8 comments · Fixed by #9781
Assignees

Comments

@gforsyth
Copy link
Member

gforsyth commented Mar 7, 2024

Hmmm, I remember us talking about this before (sorry, this issue had slipped my mind when we talked about this last week).

Currently all our to_csv/to_parquet writers produce a single file (rather than a directory of files). There was an open question as to what the expected behavior was for backends where a single-file output is tricky/inefficient/impossible (backends like spark or dask).

Enumerating all the options I can think of:

1. to_csv/to_parquet always outputs a single file

We could fallback to pyarrow or error for backends where this is tricky. This is the current behavior.

2. to_csv/to_parquet always output a directory

Backends that only write a single file would convert to writing a directory with a single file. This would be a breaking change, but should work with any backend's native csv/parquet writer. This would make a common case of writing a small csv of results trickier though.

3. to_csv/to_parquet write the backend-native style

The output style (single file or directory) depends on the backend. This is easy to implement, but would mean different backends would result in different behaviors.

4. to_csv/to_parquet have an option to write a directory

These methods would write to a single file by default, but would have an option to instead write to a directory. 🤷 whether this is a directory=True option, or inferred somehow from the input path. This would mean that users using backends that can only efficiently write to directories may need to opt into this support, but would make things explicit about the output type.

t.to_csv("test.csv")  # would error for the spark backend? Or fallback to pyarrow?
t.to_csv("test/", directory=True)  # output a directory can use pyspark native behavior

5. New methods for writing directories

to_csv/to_parquet keep their existing behavior, and we add new to_csv_dir/to_parqet_dir (or much better named) methods for writing directories. Same caveats/questions as 4.


Right now I'm leaning towards 4. An extra flag seems fine to me, and I like it better than adding another top-level-method per file format just to support partitioned writing. cc'ing @gforsyth for a 2nd opinion though, since he may remember what conclusions we came to last time this came up.

Originally posted by @jcrist in #6615 (comment)

Additionally, I've documented (although this is now at least partly out of date) how a few systems handle various parquet partitioning schemes: https://gist.github.com/gforsyth/8dd4ca981b2beed6ef4db80f5e8afbfd

Opening this issue so we have something to track that isn't a comment in a closed PR.

Probably need to write out a taxonomy of what parquet functionality is supported by each backend natively

@gforsyth gforsyth changed the title feat(parquet): figure out convention for multi-file parque writing feat(parquet): figure out convention for multi-file parquet writing Mar 7, 2024
@jcrist jcrist removed their assignment Mar 21, 2024
@deepyaman
Copy link
Contributor

Right now I'm leaning towards 4.

I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify directory=True to leverage the native path, while locally (e.g. using DuckDB) they may want to leave directory=False. It feels smoother that directory=False on PySpark would at least work, but the impression a user would have is that they're choosing output format, when in reality they're also choosing between a native and PyArrow-based bath.

@deepyaman
Copy link
Contributor

Right now I'm leaning towards 4.

I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify directory=True to leverage the native path, while locally (e.g. using DuckDB) they may want to leave directory=False. It feels smoother that directory=False on PySpark would at least work, but the impression a user would have is that they're choosing output format, when in reality they're also choosing between a native and PyArrow-based bath.

During triage earlier today, we collectively decided that option 5 makes the most sense. By having two separate APIs, we surface the directory option more obviously to users.

The directory option only applies for the write path; for reading, we will still maintain a single API. Furthermore, we will support directory vs. single file options, but not get into more backend-specific behavior (like glob handling); that will still be delegated to the backend.

Finally, it came up that at some point we may need to better support cloud I/O (e.g. via fsspec), but this will be put off until we get more users asking for it.

@chloeh13q
Copy link
Contributor

On the topic of unifying behaviors across backends, is it weird if some backends require that, e.g., read_csv() be pointing to directory paths and other backends can read in individual csv files?

@ncclementi
Copy link
Contributor

ncclementi commented Jul 25, 2024

I'm looking into this one for duckdb which is what's missing implementing since the pyspark case was covered in #9272.

I noticed that for the case of writing hive partitions. if you do something like

>>> penguins = ibis.examples.penguins.fetch()
>>> con = ibis.get_backend(penguins)
>>> con.to_parquet(penguins, "my_dir" , partition_by="year")

That will create a directory called my_dir at the current location, and then the subsequent partition directories, in this case year=2007 year=2008 year=2009 with the respective parquet file in them.

So for the case of hive partitions we kind of are supporting writing to a directory.

Then the question is do we want a to_parquet_dir (based on this comment #8584 (comment)) method just to cover the case where you want to do

con.to_parquet(penguins, "some_dir/myfile.parquet") (currently throwing IO Error) instead of modifying the existing to_parquet() functionality?

@cpcloud
Copy link
Member

cpcloud commented Jul 30, 2024

Since we have to_parquet_dir, can we use that? Let's avoid stuffing directory support into duckdb's to_parquet if we can.

@ncclementi
Copy link
Contributor

Since we have to_parquet_dir, can we use that?
We have it only for pyspark(see

def to_parquet_dir(
), but we need to implement it for duckdb.

I just want to make sure we want to implement it just for this case con.to_parquet(penguins, "some_dir/myfile.parquet"). My concern is that for hive partitions this is already covered by regular to_parquet which means that we will have writing to directories supported for hive partitions via to_parquet but for single files in a directory, supported by to_parquet_dir.

@gforsyth
Copy link
Member Author

If a user wants to pass in kwargs to to_parquet and get the files output in hive-partitioning, that's fine, we don't need to prevent that behavior, but we could expose those kwargs explicitly in to_parquet_dir to make it more obvious that it's an option.

@jitingxu1
Copy link
Contributor

We may use pyarrow.parquet read_table (it could read a single file or a directory ) for ibis backends that lack native read_parquet support.

Here is another view of points: do we want to ensure consistency between to_parquet and read_parquet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants