FIX-#6936: Fix `read_parquet` when dataset is created with `to_parquet` and `index=False` #6937

anmyachev · 2024-02-16T13:58:24Z

What do these changes do?

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves read_parquet failed to read the dataset created by to_parquet with index = False. #6936
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

…th 'to_parquet' and 'index=False' Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev · 2024-02-16T14:16:43Z

modin/core/io/column_stores/parquet_dispatcher.py

@@ -690,6 +690,7 @@ def build_query_compiler(cls, dataset, columns, index_columns, **kwargs):
        if (
            dataset.pandas_metadata
            and "column_indexes" in dataset.pandas_metadata
+            and len(dataset.pandas_metadata["column_indexes"]) == 1


In case of index=False the list is empty

YarShev · 2024-02-16T14:28:22Z

modin/pandas/test/test_io.py

        read_df = pd.read_parquet(path / file_name, engine=engine)
+        if not index:
+            # In that case pyarrow cannot preserve index dtype
+            read_df.columns = pandas.Index(read_df.columns).astype("int64").to_list()


We do the same in parquet_dispatcher.py. Am I right that we will have correct columns for the Modin DataFrame but wrong columns for partitions?

The assigned columns will be propagated to partitions by synchronize_labels function.

Should we do this in parquet_dispatcher.py like with sync_index flag?

IIRC pyarrow handled this on its end by calling .to_pandas and using pandas metadata:

modin/modin/core/storage_formats/pandas/parsers.py

Lines 747 to 758 in 56fb47e

return (

ParquetFile(f)

.read_row_groups(

range(

row_group_start,

row_group_end,

),

columns=columns,

use_pandas_metadata=True,

)

.to_pandas(**to_pandas_kwargs)

)

Why do we need then this line in the test if internal and external column labels match?

In this case, the index is not saved, the pyarrow cannot obtain the same index type as the original dataframe had, but since I need to compare the original and read dataframe, I explicitly cast the type.

The fact that Modin reads the dataframe in this case with the another type of index is not Modin’s problem, the same behavior occurs in pandas. This is simply a consequence of the fact that due to the lack of metadata, the index type is lost.

Why do we need then this line in the test if internal and external column labels match?

Both the internal type and the external type (of the columns) are the same, but they do not match what was in the original dataframe.

I see, thanks for the explanation!

FIX-modin-project#6936: Fix 'read_parquet' when dataset is created wi…

676514d

…th 'to_parquet' and 'index=False' Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

anmyachev marked this pull request as ready for review February 16, 2024 14:16

anmyachev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, dchigarev and a team as code owners February 16, 2024 14:16

anmyachev commented Feb 16, 2024

View reviewed changes

YarShev reviewed Feb 16, 2024

View reviewed changes

YarShev approved these changes Feb 16, 2024

View reviewed changes

YarShev merged commit 3f9a733 into modin-project:master Feb 16, 2024
37 checks passed

anmyachev deleted the issue6936 branch February 16, 2024 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#6936: Fix `read_parquet` when dataset is created with `to_parquet` and `index=False` #6937

FIX-#6936: Fix `read_parquet` when dataset is created with `to_parquet` and `index=False` #6937

anmyachev commented Feb 16, 2024 •

edited

anmyachev Feb 16, 2024

YarShev Feb 16, 2024

anmyachev Feb 16, 2024

YarShev Feb 16, 2024

anmyachev Feb 16, 2024

YarShev Feb 16, 2024

anmyachev Feb 16, 2024

YarShev Feb 16, 2024

	return (
	ParquetFile(f)
	.read_row_groups(
	range(
	row_group_start,
	row_group_end,
	),
	columns=columns,
	use_pandas_metadata=True,
	)
	.to_pandas(**to_pandas_kwargs)
	)

FIX-#6936: Fix read_parquet when dataset is created with to_parquet and index=False #6937

FIX-#6936: Fix read_parquet when dataset is created with to_parquet and index=False #6937

Conversation

anmyachev commented Feb 16, 2024 • edited

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FIX-#6936: Fix `read_parquet` when dataset is created with `to_parquet` and `index=False` #6937

FIX-#6936: Fix `read_parquet` when dataset is created with `to_parquet` and `index=False` #6937

anmyachev commented Feb 16, 2024 •

edited