Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modin doesn't partition DataFrames read from a parquet file if the file itself isn't partitioned #5296

Closed
dchigarev opened this issue Nov 30, 2022 · 2 comments · Fixed by #7016
Assignees
Labels
P1 Important tasks that we should complete soon pandas.io Performance 🚀 Performance related issues and pull requests.

Comments

@dchigarev
Copy link
Collaborator

The current implementation of .read_parquet() completely relies on the partitioning provided by a parquet file scheme (row_groups, column chunking), this partitioning may not always be good though.

The recommended Apache's configuration for a row_group size considers an optimal row chunk size to be 1GB [1] which acts terribly for Modin because of the risk to put the whole dataframe into only 1 or 2 partitions, this way of partitioning causes good-parallelized implementations to perform very poorly in comparison with what they could achieve with proper partitioning.

Consider this example where I read a single-row-group parquet file with only 100_000 rows and then apply a simple function to the dataframe. It appears that the dataframe read with the parquet partitioning scheme performs 4x slower than the properly partitioned one. The difference in performance will grow with increasing the dataset.

data_shape: (100_000, 2)
parquet num_row_groups: 1
default partitioning: (1, 2)
        .apply() with default partitioning: 13.06s.
proper partitioning: (112, 1)
        .apply() with default partitioning: 3.35s.
Reproducer
import pandas
import modin.pandas as pd
from modin.utils import try_cast_to_pandas
import numpy as np
import tempfile
import pyarrow as pa
from timeit import default_timer as timer

NROWS = 100_000
NCOLS = 2

with tempfile.NamedTemporaryFile() as file:
    pandas.DataFrame({f"col{i}": np.arange(NROWS) for i in range(NCOLS)}).to_parquet(file.name)

    df = pd.read_parquet(file.name)

    parquet = pa.parquet.ParquetFile(file.name)
    print(f"parquet num_row_groups: {parquet.num_row_groups}")
    print(f"default partitioning: {df._query_compiler._modin_frame._partitions.shape}")

    t1 = timer()
    df.apply(lambda row: row + row, axis=1) # no parallelism at all due to poor partitioning
    print(f"\t.apply() with default partitioning: {timer() - t1}")

    df = pd.DataFrame(try_cast_to_pandas(df)) # force repartitioning
    print(f"proper partitioning: {df._query_compiler._modin_frame._partitions.shape}")

    t1 = timer()
    df.apply(lambda row: row + row, axis=1) # decent parallelism
    print(f"\t.apply() with default partitioning: {timer() - t1}")

I wonder if we could to not fully rely on the partitioning provided by a parquet file by using our own partitioning scheme if the provided one is not good enough.

@dchigarev dchigarev added Performance 🚀 Performance related issues and pull requests. P1 Important tasks that we should complete soon labels Nov 30, 2022
@mvashishtha
Copy link
Collaborator

@dchigarev I don't know how to seek to a certain byte distance (say starting 1/4 of the number of bytes in the file) and end up at a valid row boundary. But I think it would be useful to try something like that.

I remember looking into pyarrow batches when I wrote the original partition reading, but I don't remember why I chose not to use those.

From some quick browsing looking at posts like this I don't see an easy solution.

@dchigarev
Copy link
Collaborator Author

If we can't properly partition during the reading I think we could do it afterward. I mean, of course, it would be expensive but considering the performance improvement it could get us for further operations I think it may be worth it.

@dchigarev dchigarev self-assigned this Mar 4, 2024
dchigarev added a commit to dchigarev/modin that referenced this issue Mar 6, 2024
… groups

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
YarShev pushed a commit that referenced this issue Mar 8, 2024
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Important tasks that we should complete soon pandas.io Performance 🚀 Performance related issues and pull requests.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants