Modin doesn't partition DataFrames read from a parquet file if the file itself isn't partitioned #5296

dchigarev · 2022-11-30T23:37:18Z

The current implementation of .read_parquet() completely relies on the partitioning provided by a parquet file scheme (row_groups, column chunking), this partitioning may not always be good though.

The recommended Apache's configuration for a row_group size considers an optimal row chunk size to be 1GB [1] which acts terribly for Modin because of the risk to put the whole dataframe into only 1 or 2 partitions, this way of partitioning causes good-parallelized implementations to perform very poorly in comparison with what they could achieve with proper partitioning.

Consider this example where I read a single-row-group parquet file with only 100_000 rows and then apply a simple function to the dataframe. It appears that the dataframe read with the parquet partitioning scheme performs 4x slower than the properly partitioned one. The difference in performance will grow with increasing the dataset.

data_shape: (100_000, 2)
parquet num_row_groups: 1
default partitioning: (1, 2)
        .apply() with default partitioning: 13.06s.
proper partitioning: (112, 1)
        .apply() with default partitioning: 3.35s.

Reproducer

import pandas
import modin.pandas as pd
from modin.utils import try_cast_to_pandas
import numpy as np
import tempfile
import pyarrow as pa
from timeit import default_timer as timer

NROWS = 100_000
NCOLS = 2

with tempfile.NamedTemporaryFile() as file:
    pandas.DataFrame({f"col{i}": np.arange(NROWS) for i in range(NCOLS)}).to_parquet(file.name)

    df = pd.read_parquet(file.name)

    parquet = pa.parquet.ParquetFile(file.name)
    print(f"parquet num_row_groups: {parquet.num_row_groups}")
    print(f"default partitioning: {df._query_compiler._modin_frame._partitions.shape}")

    t1 = timer()
    df.apply(lambda row: row + row, axis=1) # no parallelism at all due to poor partitioning
    print(f"\t.apply() with default partitioning: {timer() - t1}")

    df = pd.DataFrame(try_cast_to_pandas(df)) # force repartitioning
    print(f"proper partitioning: {df._query_compiler._modin_frame._partitions.shape}")

    t1 = timer()
    df.apply(lambda row: row + row, axis=1) # decent parallelism
    print(f"\t.apply() with default partitioning: {timer() - t1}")

I wonder if we could to not fully rely on the partitioning provided by a parquet file by using our own partitioning scheme if the provided one is not good enough.

The text was updated successfully, but these errors were encountered:

mvashishtha · 2022-11-30T23:43:35Z

@dchigarev I don't know how to seek to a certain byte distance (say starting 1/4 of the number of bytes in the file) and end up at a valid row boundary. But I think it would be useful to try something like that.

I remember looking into pyarrow batches when I wrote the original partition reading, but I don't remember why I chose not to use those.

From some quick browsing looking at posts like this I don't see an easy solution.

dchigarev · 2022-11-30T23:47:29Z

If we can't properly partition during the reading I think we could do it afterward. I mean, of course, it would be expensive but considering the performance improvement it could get us for further operations I think it may be worth it.

… groups Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev added Performance 🚀 Performance related issues and pull requests. P1 Important tasks that we should complete soon labels Nov 30, 2022

mvashishtha added the pandas.io label Nov 30, 2022

dchigarev mentioned this issue Dec 7, 2022

FEAT-#5367: Introduce new API for repartitioning Modin objects #5366

Merged

7 tasks

dchigarev mentioned this issue Sep 14, 2023

Make the number of column partitions dependent on the number of row groups at .read_parquet() #6558

Closed

dchigarev self-assigned this Mar 4, 2024

dchigarev added a commit to dchigarev/modin that referenced this issue Mar 6, 2024

PERF-modin-project#5296: Partition parquet file if it has too few row…

e812c0d

… groups Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Mar 6, 2024

PERF-#5296: Partition parquet file if it has too few row groups #7016

Merged

7 tasks

YarShev closed this as completed in #7016 Mar 8, 2024

YarShev pushed a commit that referenced this issue Mar 8, 2024

PERF-#5296: Partition parquet file if it has too few row groups (#7016)

b1501d8

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modin doesn't partition DataFrames read from a parquet file if the file itself isn't partitioned #5296

Modin doesn't partition DataFrames read from a parquet file if the file itself isn't partitioned #5296

dchigarev commented Nov 30, 2022

mvashishtha commented Nov 30, 2022

dchigarev commented Nov 30, 2022

Modin doesn't partition DataFrames read from a parquet file if the file itself isn't partitioned #5296

Modin doesn't partition DataFrames read from a parquet file if the file itself isn't partitioned #5296

Comments

dchigarev commented Nov 30, 2022

mvashishtha commented Nov 30, 2022

dchigarev commented Nov 30, 2022