-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7090: Add range-partitioning implementation for '.unique()' and '.drop_duplicates()' #7091
Changes from all commits
211b0f0
a8e3065
db093da
fe490a0
cb1d673
ae759de
039d9c5
f66af0a
df12976
c7ca617
4b36782
84a57a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -79,6 +79,12 @@ Range-partitioning Merge | |
It is recommended to use this implementation if the right dataframe in merge is as big as | ||
the left dataframe. In this case, range-partitioning implementation works faster and consumes less RAM. | ||
|
||
'.unique()' and '.drop_duplicates()' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we refactor this doc page for 0.29.0? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, and there is an issue for that #6987 |
||
"""""""""""""""""""""""""""""""""""" | ||
|
||
Range-partitioning implementation of '.unique()'/'.drop_duplicates()' works best when the input data size is big (more than | ||
5_000_000 rows) and when the output size is also expected to be big (no more than 80% values are duplicates). | ||
Comment on lines
+85
to
+86
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is not very descriptive, so as a part of #6987 I'm also planing to include perf measurements that I've made for range-partitioning PRs in the docs |
||
|
||
'.nunique()' | ||
"""""""""""""""""""""""""""""""""""" | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1933,13 +1933,36 @@ def str_split(self, pat=None, n=-1, expand=False, regex=None): | |
|
||
# END String map partitions operations | ||
|
||
def unique(self): | ||
new_modin_frame = self._modin_frame.apply_full_axis( | ||
0, | ||
lambda x: x.squeeze(axis=1).unique(), | ||
new_columns=self.columns, | ||
def unique(self, keep="first", ignore_index=True, subset=None): | ||
# kernels with 'pandas.Series.unique()' work faster | ||
can_use_unique_kernel = ( | ||
subset is None and ignore_index and len(self.columns) == 1 and keep | ||
) | ||
return self.__constructor__(new_modin_frame) | ||
|
||
if not can_use_unique_kernel and not RangePartitioning.get(): | ||
return super().unique(keep=keep, ignore_index=ignore_index, subset=subset) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this branch for d2p? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, this branch is for general |
||
|
||
if RangePartitioning.get(): | ||
new_modin_frame = self._modin_frame._apply_func_to_range_partitioning( | ||
key_columns=self.columns.tolist() if subset is None else subset, | ||
func=( | ||
(lambda df: pandas.DataFrame(df.squeeze(axis=1).unique())) | ||
if can_use_unique_kernel | ||
else ( | ||
lambda df: df.drop_duplicates( | ||
dchigarev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
keep=keep, ignore_index=ignore_index, subset=subset | ||
) | ||
) | ||
), | ||
preserve_columns=True, | ||
) | ||
else: | ||
new_modin_frame = self._modin_frame.apply_full_axis( | ||
0, | ||
lambda x: x.squeeze(axis=1).unique(), | ||
new_columns=self.columns, | ||
) | ||
return self.__constructor__(new_modin_frame, shape_hint=self._shape_hint) | ||
|
||
def searchsorted(self, **kwargs): | ||
def searchsorted(df): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1511,13 +1511,12 @@ def drop_duplicates( | |
subset = list(subset) | ||
else: | ||
subset = [subset] | ||
df = self[subset] | ||
else: | ||
df = self | ||
duplicated = df.duplicated(keep=keep) | ||
result = self[~duplicated] | ||
if ignore_index: | ||
result.index = pandas.RangeIndex(stop=len(result)) | ||
if len(diff := pandas.Index(subset).difference(self.columns)) > 0: | ||
raise KeyError(diff) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does pandas raise the same error with this message? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it raises exactly the same error: >>> pd_df
a b
0 1 2
>>> pd_df.drop_duplicates(subset=["b", "c"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "python3.9/site-packages/pandas/core/frame.py", line 6805, in drop_duplicates
result = self[-self.duplicated(subset, keep=keep)]
File "python3.9/site-packages/pandas/core/frame.py", line 6937, in duplicated
raise KeyError(Index(diff))
KeyError: Index(['c'], dtype='object') |
||
result_qc = self._query_compiler.unique( | ||
keep=keep, ignore_index=ignore_index, subset=subset | ||
) | ||
result = self.__constructor__(query_compiler=result_qc) | ||
if inplace: | ||
self._update_inplace(result._query_compiler) | ||
else: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing of range-partitioning implementations is now performed in a separate action