FIX-#6935: Fix Merge failed when right operand is an empty dataframe #6941

arunjose696 · 2024-02-16T15:45:50Z

What do these changes do?

Fixing corner case when partitions are empty for merge.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Merge failed when right operand is an empty dataframe #6935 ?
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

anmyachev · 2024-02-16T16:00:59Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

@@ -1120,6 +1120,9 @@ def to_pandas_remote(df, partition_shape, *dfs):
                (df,) + dfs, partition_shape, called_from_remote=True
            )

+        if partitions.size == 0:


@arunjose696 please add also a test

Wrote the test, the fix is slightly complicated as the broadcasted right dataframes being empty will not be reconstructed as the partitions would be empty (Creating a new empty datatframe does not work as the column information is not available).

I assume I would need to modify broadcast_apply_full_axis logic as other modin operations eg. df.compare(df2) which use broadcast_apply_full_axis would also fail if the second dataframe is empty.

In the case of empty partitions, could we also send data about the indexes that we store at the dataframe level?

I have tried an approach with creating a new partition (with indexes and columns from dataframe level)in broadcast_apply_full axis if the dataframe does not have partitions, this seems to work for the test case.

anmyachev · 2024-02-19T15:52:37Z

modin/core/dataframe/pandas/partitioning/partition.py

@@ -427,6 +427,23 @@ def empty(cls):
        """
        return cls.put(pandas.DataFrame(), 0, 0)

+    @classmethod
+    def partition_from_data(cls, df):


It's pretty clear to use the old put method I think.

Added this for verbosity , will remove.

anmyachev · 2024-02-19T17:22:34Z

@arunjose696 please rebase

…ty dataframe Signed-off-by: arunjose696 <arunjose696@gmail.com>

anmyachev · 2024-02-19T20:21:57Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+                empty_partition = self._partition_mgr_cls.create_partition_from_data(
+                    pandas.DataFrame(index=df.index, columns=df.columns)
+                )


Maybe we should make this as a part of combine function?

I think we can leave out create_partition_from_data. What do you think?

cc @arunjose696, @anmyachev

Do you mean leave it as is? Then this will only work for broadcast_apply_full function; if we want to use this approach with other functions, we will have to duplicate this logic.

I think it would be better to leave the create_partition_from_data function as is. The issue is indeed in broadcast_apply_full_axis as we dont send any data to remote function for empty partition dataframe, so I assume the issue would be better addressed in this function itself.

The combine function is called only in merge()/join() so even if we use the logic there to create a empty partition(with index and columns), the broadcast_apply_full_axis function would fail for other operations eg df.compare(df2),

To use this approach in other functions, the alternative I think would be to make get_partitions() a method of dataframe(and let it handle the case if partitions are empty) class would this be fine ?

To use this approach in other functions, the alternative I think would be to make get_partitions() a method of dataframe(and let it handle the case if partitions are empty) class would this be fine ?

This sounds good to me. We could do get_partitions (I would name it as _extract_partitions) as a method of the PandasDataframe class.

modin/pandas/test/dataframe/test_join_sort.py

YarShev · 2024-02-20T08:52:00Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+                empty_partition = self._partition_mgr_cls.create_partition_from_data(
+                    pandas.DataFrame(index=df.index, columns=df.columns)
+                )


Suggested change

empty_partition = self._partition_mgr_cls.create_partition_from_data(

pandas.DataFrame(index=df.index, columns=df.columns)

)

return np.array([[self._partition_mgr_cls._partition_class.put(

pandas.DataFrame(index=df.index, columns=df.columns)

]])

committed this, for now but I think the readability would be better with using empty_partition or empty_data_partition as a variable and returning the variable. Is it a better practice to not create a variable and return directly as this would be slightly confusing?

Also wouldnt using self._partition_mgr_cls._partition_class be using the private attributes in a different class, wouldnt this be a bad practice?

modin/core/dataframe/pandas/dataframe/dataframe.py

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

YarShev · 2024-02-22T12:49:32Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

@@ -1120,6 +1137,9 @@ def to_pandas_remote(df, partition_shape, *dfs):
                (df,) + dfs, partition_shape, called_from_remote=True
            )

+        if partitions.size <= 1:


Let's put this at the beggining of the function.

YarShev · 2024-02-22T12:50:03Z

modin/core/dataframe/pandas/partitioning/partition_manager.py

@@ -175,6 +175,23 @@ def preprocess_func(cls, map_func):

    # END Abstract Methods

+    @classmethod
+    def create_partition_from_data(cls, data):


Can I keep this function as is or else we would have to use self._partition_mgr_cls._partition_class.put which would be calling a private function from a private attribute?

Maybe rename this method to create_partition_from_metadata?

Have renamed and changed arguments to take in a metadata dict as the name would suggest it accepts metadata.

anmyachev

LGTM! Thanks @arunjose696

modin/core/dataframe/pandas/dataframe/dataframe.py

…ty dataframe (modin-project#6941) Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru> Signed-off-by: arunjose696 <arunjose696@gmail.com>

arunjose696 requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners February 16, 2024 15:45

anmyachev reviewed Feb 16, 2024

View reviewed changes

arunjose696 force-pushed the mergefix branch 6 times, most recently from 3fbce22 to f2198db Compare February 19, 2024 15:50

anmyachev reviewed Feb 19, 2024

View reviewed changes

arunjose696 force-pushed the mergefix branch 2 times, most recently from 98a3d7c to 1202f53 Compare February 19, 2024 16:58

arunjose696 added 2 commits February 19, 2024 11:40

FIX-modin-project#6935: Fix Merge failed when right operand is an emp…

105fcb7

…ty dataframe Signed-off-by: arunjose696 <arunjose696@gmail.com>

dealing with empty partitions in broadcast_apply_full_axis

408032d

arunjose696 force-pushed the mergefix branch from 1202f53 to 408032d Compare February 19, 2024 17:40

anmyachev reviewed Feb 19, 2024

View reviewed changes

YarShev reviewed Feb 20, 2024

View reviewed changes

arunjose696 and others added 2 commits February 22, 2024 09:52

Apply suggestions from code review

864ab69

Co-authored-by: Iaroslav Igoshev <Poolliver868@mail.ru>

using eval_general

b1d145f

YarShev reviewed Feb 22, 2024

View reviewed changes

arunjose696 force-pushed the mergefix branch from 5238961 to 0cc6963 Compare February 22, 2024 13:58

anmyachev previously approved these changes Feb 22, 2024

View reviewed changes

YarShev reviewed Feb 22, 2024

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

arunjose696 dismissed anmyachev’s stale review via 11eb8ab February 22, 2024 14:47

arunjose696 force-pushed the mergefix branch from 0cc6963 to 11eb8ab Compare February 22, 2024 14:47

adding _extract_partitions

82607a5

arunjose696 force-pushed the mergefix branch from 11eb8ab to 82607a5 Compare February 22, 2024 14:56

YarShev approved these changes Feb 22, 2024

View reviewed changes

YarShev merged commit 156ea84 into modin-project:master Feb 22, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#6935: Fix Merge failed when right operand is an empty dataframe #6941

FIX-#6935: Fix Merge failed when right operand is an empty dataframe #6941

arunjose696 commented Feb 16, 2024

anmyachev Feb 16, 2024

arunjose696 Feb 16, 2024 •

edited

anmyachev Feb 19, 2024

arunjose696 Feb 19, 2024

anmyachev Feb 19, 2024

arunjose696 Feb 19, 2024

anmyachev commented Feb 19, 2024

anmyachev Feb 19, 2024

YarShev Feb 20, 2024

anmyachev Feb 22, 2024

arunjose696 Feb 22, 2024

YarShev Feb 22, 2024

arunjose696 Feb 22, 2024

YarShev Feb 20, 2024

arunjose696 Feb 22, 2024 •

edited

YarShev Feb 22, 2024

arunjose696 Feb 22, 2024

YarShev Feb 22, 2024

arunjose696 Feb 22, 2024

YarShev Feb 22, 2024

arunjose696 Feb 22, 2024

anmyachev left a comment

FIX-#6935: Fix Merge failed when right operand is an empty dataframe #6941

FIX-#6935: Fix Merge failed when right operand is an empty dataframe #6941

Conversation

arunjose696 commented Feb 16, 2024

What do these changes do?

Choose a reason for hiding this comment

arunjose696 Feb 16, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmyachev commented Feb 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arunjose696 Feb 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

arunjose696 Feb 16, 2024 •

edited

arunjose696 Feb 22, 2024 •

edited