-
Notifications
You must be signed in to change notification settings - Fork 647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX-#6935: Fix Merge failed when right operand is an empty dataframe #6941
Changes from 2 commits
105fcb7
408032d
864ab69
b1d145f
82607a5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -3348,10 +3348,21 @@ def broadcast_apply_full_axis( | |||||||||||||
PandasDataframe | ||||||||||||||
New Modin DataFrame. | ||||||||||||||
""" | ||||||||||||||
|
||||||||||||||
def get_partitions(df): | ||||||||||||||
"""Deal with the corner case if the "other" dataframe has no partitions.""" | ||||||||||||||
if df._partitions.size > 0: | ||||||||||||||
return df._partitions | ||||||||||||||
else: | ||||||||||||||
empty_partition = self._partition_mgr_cls.create_partition_from_data( | ||||||||||||||
pandas.DataFrame(index=df.index, columns=df.columns) | ||||||||||||||
) | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. committed this, for now but I think the readability would be better with using Also wouldnt using self._partition_mgr_cls._partition_class be using the private attributes in a different class, wouldnt this be a bad practice? |
||||||||||||||
return empty_partition | ||||||||||||||
arunjose696 marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||
|
||||||||||||||
if other is not None: | ||||||||||||||
if not isinstance(other, list): | ||||||||||||||
other = [other] | ||||||||||||||
other = [o._partitions for o in other] if len(other) else None | ||||||||||||||
other = [get_partitions(o) for o in other] if len(other) else None | ||||||||||||||
|
||||||||||||||
if apply_indices is not None: | ||||||||||||||
numeric_indices = self.get_axis(axis ^ 1).get_indexer_for(apply_indices) | ||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -175,6 +175,23 @@ def preprocess_func(cls, map_func): | |
|
||
# END Abstract Methods | ||
|
||
@classmethod | ||
def create_partition_from_data(cls, data): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can I keep this function as is or else we would have to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe rename this method to create_partition_from_metadata? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have renamed and changed arguments to take in a metadata dict as the name would suggest it accepts metadata. |
||
""" | ||
Create NumPy array of partitions that wrapps the given data. | ||
|
||
Parameters | ||
---------- | ||
data : pandas.DataFrame or pandas.Series | ||
Data that has to be wrapped in partition. | ||
|
||
Returns | ||
------- | ||
np.ndarray | ||
A NumPy 2D array of a single partition which contains the data. | ||
""" | ||
return np.array([[cls._partition_class.put(data)]]) | ||
|
||
@classmethod | ||
def column_partitions(cls, partitions, full_axis=True): | ||
""" | ||
|
@@ -1120,6 +1137,9 @@ def to_pandas_remote(df, partition_shape, *dfs): | |
(df,) + dfs, partition_shape, called_from_remote=True | ||
) | ||
|
||
if partitions.size <= 1: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's put this at the beggining of the function. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
return partitions | ||
|
||
preprocessed_func = cls.preprocess_func(to_pandas_remote) | ||
partition_shape = partitions.shape | ||
partitions_flattened = partitions.flatten() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should make this as a part of
combine
function?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can leave out create_partition_from_data. What do you think?
cc @arunjose696, @anmyachev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean leave it as is? Then this will only work for
broadcast_apply_full
function; if we want to use this approach with other functions, we will have to duplicate this logic.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to leave the
create_partition_from_data
function as is. The issue is indeed inbroadcast_apply_full_axis
as we dont send any data to remote function for empty partition dataframe, so I assume the issue would be better addressed in this function itself.The
combine
function is called only in merge()/join() so even if we use the logic there to create a empty partition(with index and columns), thebroadcast_apply_full_axis
function would fail for other operations egdf.compare(df2)
,To use this approach in other functions, the alternative I think would be to make get_partitions() a method of dataframe(and let it handle the case if partitions are empty) class would this be fine ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good to me. We could do
get_partitions
(I would name it as _extract_partitions) as a method of the PandasDataframe class.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.