Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#6935: Fix Merge failed when right operand is an empty dataframe #6941

Merged
merged 5 commits into from Feb 22, 2024

Conversation

arunjose696
Copy link
Collaborator

What do these changes do?

Fixing corner case when partitions are empty for merge.

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Merge failed when right operand is an empty dataframe #6935 ?
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@@ -1120,6 +1120,9 @@ def to_pandas_remote(df, partition_shape, *dfs):
(df,) + dfs, partition_shape, called_from_remote=True
)

if partitions.size == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arunjose696 please add also a test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote the test, the fix is slightly complicated as the broadcasted right dataframes being empty will not be reconstructed as the partitions would be empty (Creating a new empty datatframe does not work as the column information is not available).

I assume I would need to modify broadcast_apply_full_axis logic as other modin operations eg. df.compare(df2) which use broadcast_apply_full_axis would also fail if the second dataframe is empty.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of empty partitions, could we also send data about the indexes that we store at the dataframe level?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried an approach with creating a new partition (with indexes and columns from dataframe level)in broadcast_apply_full axis if the dataframe does not have partitions, this seems to work for the test case.

@arunjose696 arunjose696 force-pushed the mergefix branch 6 times, most recently from 3fbce22 to f2198db Compare February 19, 2024 15:50
@@ -427,6 +427,23 @@ def empty(cls):
"""
return cls.put(pandas.DataFrame(), 0, 0)

@classmethod
def partition_from_data(cls, df):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty clear to use the old put method I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this for verbosity , will remove.

@arunjose696 arunjose696 force-pushed the mergefix branch 2 times, most recently from 98a3d7c to 1202f53 Compare February 19, 2024 16:58
@anmyachev
Copy link
Collaborator

@arunjose696 please rebase

Comment on lines 3357 to 3359
empty_partition = self._partition_mgr_cls.create_partition_from_data(
pandas.DataFrame(index=df.index, columns=df.columns)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should make this as a part of combine function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can leave out create_partition_from_data. What do you think?

cc @arunjose696, @anmyachev

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean leave it as is? Then this will only work for broadcast_apply_full function; if we want to use this approach with other functions, we will have to duplicate this logic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to leave the create_partition_from_data function as is. The issue is indeed in broadcast_apply_full_axis as we dont send any data to remote function for empty partition dataframe, so I assume the issue would be better addressed in this function itself.

The combine function is called only in merge()/join() so even if we use the logic there to create a empty partition(with index and columns), the broadcast_apply_full_axis function would fail for other operations eg df.compare(df2),

To use this approach in other functions, the alternative I think would be to make get_partitions() a method of dataframe(and let it handle the case if partitions are empty) class would this be fine ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To use this approach in other functions, the alternative I think would be to make get_partitions() a method of dataframe(and let it handle the case if partitions are empty) class would this be fine ?

This sounds good to me. We could do get_partitions (I would name it as _extract_partitions) as a method of the PandasDataframe class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

modin/pandas/test/dataframe/test_join_sort.py Outdated Show resolved Hide resolved
Comment on lines 3357 to 3359
empty_partition = self._partition_mgr_cls.create_partition_from_data(
pandas.DataFrame(index=df.index, columns=df.columns)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
empty_partition = self._partition_mgr_cls.create_partition_from_data(
pandas.DataFrame(index=df.index, columns=df.columns)
)
return np.array([[self._partition_mgr_cls._partition_class.put(
pandas.DataFrame(index=df.index, columns=df.columns)
]])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

committed this, for now but I think the readability would be better with using empty_partition or empty_data_partition as a variable and returning the variable. Is it a better practice to not create a variable and return directly as this would be slightly confusing?

Also wouldnt using self._partition_mgr_cls._partition_class be using the private attributes in a different class, wouldnt this be a bad practice?

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved
arunjose696 and others added 2 commits February 22, 2024 09:52
Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
@@ -1120,6 +1137,9 @@ def to_pandas_remote(df, partition_shape, *dfs):
(df,) + dfs, partition_shape, called_from_remote=True
)

if partitions.size <= 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this at the beggining of the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -175,6 +175,23 @@ def preprocess_func(cls, map_func):

# END Abstract Methods

@classmethod
def create_partition_from_data(cls, data):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I keep this function as is or else we would have to use self._partition_mgr_cls._partition_class.put which would be calling a private function from a private attribute?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename this method to create_partition_from_metadata?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have renamed and changed arguments to take in a metadata dict as the name would suggest it accepts metadata.

anmyachev
anmyachev previously approved these changes Feb 22, 2024
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @arunjose696

@YarShev YarShev merged commit 156ea84 into modin-project:master Feb 22, 2024
37 checks passed
tochigiv pushed a commit to tochigiv/modin that referenced this pull request Feb 22, 2024
…ty dataframe (modin-project#6941)

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: arunjose696 <arunjose696@gmail.com>
tochigiv pushed a commit to tochigiv/modin that referenced this pull request Feb 22, 2024
…ty dataframe (modin-project#6941)

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: arunjose696 <arunjose696@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge failed when right operand is an empty dataframe
3 participants