feat(parquet): figure out convention for multi-file parquet writing #8584

gforsyth · 2024-03-07T21:50:03Z

Hmmm, I remember us talking about this before (sorry, this issue had slipped my mind when we talked about this last week).

Currently all our to_csv/to_parquet writers produce a single file (rather than a directory of files). There was an open question as to what the expected behavior was for backends where a single-file output is tricky/inefficient/impossible (backends like spark or dask).

Enumerating all the options I can think of:

1. `to_csv`/`to_parquet` always outputs a single file

We could fallback to pyarrow or error for backends where this is tricky. This is the current behavior.

2. `to_csv`/`to_parquet` always output a directory

Backends that only write a single file would convert to writing a directory with a single file. This would be a breaking change, but should work with any backend's native csv/parquet writer. This would make a common case of writing a small csv of results trickier though.

3. `to_csv`/`to_parquet` write the backend-native style

The output style (single file or directory) depends on the backend. This is easy to implement, but would mean different backends would result in different behaviors.

4. `to_csv`/`to_parquet` have an option to write a directory

These methods would write to a single file by default, but would have an option to instead write to a directory. 🤷 whether this is a directory=True option, or inferred somehow from the input path. This would mean that users using backends that can only efficiently write to directories may need to opt into this support, but would make things explicit about the output type.

t.to_csv("test.csv")  # would error for the spark backend? Or fallback to pyarrow?
t.to_csv("test/", directory=True)  # output a directory can use pyspark native behavior

5. New methods for writing directories

to_csv/to_parquet keep their existing behavior, and we add new to_csv_dir/to_parqet_dir (or much better named) methods for writing directories. Same caveats/questions as 4.

Right now I'm leaning towards 4. An extra flag seems fine to me, and I like it better than adding another top-level-method per file format just to support partitioned writing. cc'ing @gforsyth for a 2nd opinion though, since he may remember what conclusions we came to last time this came up.

Originally posted by @jcrist in #6615 (comment)

Additionally, I've documented (although this is now at least partly out of date) how a few systems handle various parquet partitioning schemes: https://gist.github.com/gforsyth/8dd4ca981b2beed6ef4db80f5e8afbfd

Opening this issue so we have something to track that isn't a comment in a closed PR.

Probably need to write out a taxonomy of what parquet functionality is supported by each backend natively

The text was updated successfully, but these errors were encountered:

deepyaman · 2024-04-04T15:26:00Z

Right now I'm leaning towards 4.

I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify directory=True to leverage the native path, while locally (e.g. using DuckDB) they may want to leave directory=False. It feels smoother that directory=False on PySpark would at least work, but the impression a user would have is that they're choosing output format, when in reality they're also choosing between a native and PyArrow-based bath.

deepyaman · 2024-04-15T23:43:07Z

Right now I'm leaning towards 4.

I agree this seems reasonable. However, taking the PySpark example—it would be weird if user behaviors end up being driven by the backend. It sounds like, a PySpark user would often want to specify directory=True to leverage the native path, while locally (e.g. using DuckDB) they may want to leave directory=False. It feels smoother that directory=False on PySpark would at least work, but the impression a user would have is that they're choosing output format, when in reality they're also choosing between a native and PyArrow-based bath.

During triage earlier today, we collectively decided that option 5 makes the most sense. By having two separate APIs, we surface the directory option more obviously to users.

The directory option only applies for the write path; for reading, we will still maintain a single API. Furthermore, we will support directory vs. single file options, but not get into more backend-specific behavior (like glob handling); that will still be delegated to the backend.

Finally, it came up that at some point we may need to better support cloud I/O (e.g. via fsspec), but this will be put off until we get more users asking for it.

chloeh13q · 2024-05-08T19:01:26Z

On the topic of unifying behaviors across backends, is it weird if some backends require that, e.g., read_csv() be pointing to directory paths and other backends can read in individual csv files?

ncclementi · 2024-07-25T16:40:30Z

I'm looking into this one for duckdb which is what's missing implementing since the pyspark case was covered in #9272.

I noticed that for the case of writing hive partitions. if you do something like

>>> penguins = ibis.examples.penguins.fetch()
>>> con = ibis.get_backend(penguins)
>>> con.to_parquet(penguins, "my_dir" , partition_by="year")

That will create a directory called my_dir at the current location, and then the subsequent partition directories, in this case year=2007 year=2008 year=2009 with the respective parquet file in them.

So for the case of hive partitions we kind of are supporting writing to a directory.

Then the question is do we want a to_parquet_dir (based on this comment #8584 (comment)) method just to cover the case where you want to do

con.to_parquet(penguins, "some_dir/myfile.parquet") (currently throwing IO Error) instead of modifying the existing to_parquet() functionality?

cpcloud · 2024-07-30T18:02:50Z

Since we have to_parquet_dir, can we use that? Let's avoid stuffing directory support into duckdb's to_parquet if we can.

ncclementi · 2024-07-30T18:11:07Z

Since we have to_parquet_dir, can we use that?
We have it only for pyspark(see

ibis/ibis/backends/pyspark/__init__.py

Line 1291 in fd61f2c

def to_parquet_dir(

), but we need to implement it for duckdb.

I just want to make sure we want to implement it just for this case con.to_parquet(penguins, "some_dir/myfile.parquet"). My concern is that for hive partitions this is already covered by regular to_parquet which means that we will have writing to directories supported for hive partitions via to_parquet but for single files in a directory, supported by to_parquet_dir.

gforsyth · 2024-07-30T18:23:40Z

If a user wants to pass in kwargs to to_parquet and get the files output in hive-partitioning, that's fine, we don't need to prevent that behavior, but we could expose those kwargs explicitly in to_parquet_dir to make it more obvious that it's an option.

jitingxu1 · 2024-08-01T00:48:04Z

We may use pyarrow.parquet read_table (it could read a single file or a directory ) for ibis backends that lack native read_parquet support.

Here is another view of points: do we want to ensure consistency between to_parquet and read_parquet?

gforsyth changed the title ~~feat(parquet): figure out convention for multi-file parque writing~~ feat(parquet): figure out convention for multi-file parquet writing Mar 7, 2024

cpcloud assigned jcrist Mar 8, 2024

jcrist removed their assignment Mar 21, 2024

jcrist mentioned this issue Apr 3, 2024

fix(pyspark): run pre-execute hooks for to_delta #8848

Merged

gforsyth mentioned this issue Apr 25, 2024

[meta] Ibis expression API stability #8996

Closed

10 tasks

ncclementi mentioned this issue May 6, 2024

[meta] Ibis backend public API stability #8994

Closed

This was referenced May 9, 2024

feat: add Spark streaming support #8868

Open

feat(pyspark): provide a mode option to manage both batch and streaming connections #9131

Merged

ncclementi mentioned this issue Jul 19, 2024

[meta]: Ibis expression and backend stability #9638

Open

13 tasks

ncclementi self-assigned this Jul 25, 2024

ncclementi mentioned this issue Aug 6, 2024

feat(output-formats): add support for to_parquet_dir #9781

Merged

3 tasks

cpcloud closed this as completed in #9781 Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): figure out convention for multi-file parquet writing #8584

feat(parquet): figure out convention for multi-file parquet writing #8584

gforsyth commented Mar 7, 2024 •

edited

Loading

deepyaman commented Apr 4, 2024

deepyaman commented Apr 15, 2024

chloeh13q commented May 8, 2024

ncclementi commented Jul 25, 2024 •

edited

Loading

cpcloud commented Jul 30, 2024 •

edited

Loading

ncclementi commented Jul 30, 2024

gforsyth commented Jul 30, 2024

jitingxu1 commented Aug 1, 2024

feat(parquet): figure out convention for multi-file parquet writing #8584

feat(parquet): figure out convention for multi-file parquet writing #8584

Comments

gforsyth commented Mar 7, 2024 • edited Loading

1. to_csv/to_parquet always outputs a single file

2. to_csv/to_parquet always output a directory

3. to_csv/to_parquet write the backend-native style

4. to_csv/to_parquet have an option to write a directory

5. New methods for writing directories

deepyaman commented Apr 4, 2024

deepyaman commented Apr 15, 2024

chloeh13q commented May 8, 2024

ncclementi commented Jul 25, 2024 • edited Loading

cpcloud commented Jul 30, 2024 • edited Loading

ncclementi commented Jul 30, 2024

gforsyth commented Jul 30, 2024

jitingxu1 commented Aug 1, 2024

gforsyth commented Mar 7, 2024 •

edited

Loading

1. `to_csv`/`to_parquet` always outputs a single file

2. `to_csv`/`to_parquet` always output a directory

3. `to_csv`/`to_parquet` write the backend-native style

4. `to_csv`/`to_parquet` have an option to write a directory

ncclementi commented Jul 25, 2024 •

edited

Loading

cpcloud commented Jul 30, 2024 •

edited

Loading