New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6936: Fix read_parquet
when dataset is created with to_parquet
and index=False
#6937
Conversation
…th 'to_parquet' and 'index=False' Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@@ -690,6 +690,7 @@ def build_query_compiler(cls, dataset, columns, index_columns, **kwargs): | |||
if ( | |||
dataset.pandas_metadata | |||
and "column_indexes" in dataset.pandas_metadata | |||
and len(dataset.pandas_metadata["column_indexes"]) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of index=False
the list is empty
read_df = pd.read_parquet(path / file_name, engine=engine) | ||
if not index: | ||
# In that case pyarrow cannot preserve index dtype | ||
read_df.columns = pandas.Index(read_df.columns).astype("int64").to_list() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do the same in parquet_dispatcher.py. Am I right that we will have correct columns for the Modin DataFrame but wrong columns for partitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assigned columns will be propagated to partitions by synchronize_labels
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do this in parquet_dispatcher.py like with sync_index flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC pyarrow handled this on its end by calling .to_pandas
and using pandas metadata:
modin/modin/core/storage_formats/pandas/parsers.py
Lines 747 to 758 in 56fb47e
return ( | |
ParquetFile(f) | |
.read_row_groups( | |
range( | |
row_group_start, | |
row_group_end, | |
), | |
columns=columns, | |
use_pandas_metadata=True, | |
) | |
.to_pandas(**to_pandas_kwargs) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need then this line in the test if internal and external column labels match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, the index is not saved, the pyarrow
cannot obtain the same index type as the original dataframe had, but since I need to compare the original and read dataframe, I explicitly cast the type.
The fact that Modin reads the dataframe in this case with the another type of index is not Modin’s problem, the same behavior occurs in pandas. This is simply a consequence of the fact that due to the lack of metadata, the index type is lost.
Why do we need then this line in the test if internal and external column labels match?
Both the internal type and the external type (of the columns) are the same, but they do not match what was in the original dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the explanation!
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
read_parquet
failed to read the dataset created byto_parquet
withindex = False
. #6936docs/development/architecture.rst
is up-to-date