Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#6936: Fix read_parquet when dataset is created with to_parquet and index=False #6937

Merged
merged 1 commit into from Feb 16, 2024

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Feb 16, 2024

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves read_parquet failed to read the dataset created by to_parquet with index = False. #6936
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

…th 'to_parquet' and 'index=False'

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@anmyachev anmyachev marked this pull request as ready for review February 16, 2024 14:16
@@ -690,6 +690,7 @@ def build_query_compiler(cls, dataset, columns, index_columns, **kwargs):
if (
dataset.pandas_metadata
and "column_indexes" in dataset.pandas_metadata
and len(dataset.pandas_metadata["column_indexes"]) == 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of index=False the list is empty

read_df = pd.read_parquet(path / file_name, engine=engine)
if not index:
# In that case pyarrow cannot preserve index dtype
read_df.columns = pandas.Index(read_df.columns).astype("int64").to_list()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do the same in parquet_dispatcher.py. Am I right that we will have correct columns for the Modin DataFrame but wrong columns for partitions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assigned columns will be propagated to partitions by synchronize_labels function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this in parquet_dispatcher.py like with sync_index flag?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC pyarrow handled this on its end by calling .to_pandas and using pandas metadata:

return (
ParquetFile(f)
.read_row_groups(
range(
row_group_start,
row_group_end,
),
columns=columns,
use_pandas_metadata=True,
)
.to_pandas(**to_pandas_kwargs)
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need then this line in the test if internal and external column labels match?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, the index is not saved, the pyarrow cannot obtain the same index type as the original dataframe had, but since I need to compare the original and read dataframe, I explicitly cast the type.

The fact that Modin reads the dataframe in this case with the another type of index is not Modin’s problem, the same behavior occurs in pandas. This is simply a consequence of the fact that due to the lack of metadata, the index type is lost.

Why do we need then this line in the test if internal and external column labels match?

Both the internal type and the external type (of the columns) are the same, but they do not match what was in the original dataframe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation!

@YarShev YarShev merged commit 3f9a733 into modin-project:master Feb 16, 2024
37 checks passed
@anmyachev anmyachev deleted the issue6936 branch February 16, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_parquet failed to read the dataset created by to_parquet with index = False.
2 participants