-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7090: Add range-partitioning implementation for '.unique()' and '.drop_duplicates()' #7091
Changes from 9 commits
211b0f0
a8e3065
db093da
fe490a0
cb1d673
ae759de
039d9c5
f66af0a
df12976
c7ca617
4b36782
84a57a7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,7 +18,11 @@ runs: | |
echo "::endgroup::" | ||
shell: bash -l {0} | ||
- run: | | ||
echo "::group::Running experimental groupby tests (group 3)..." | ||
echo "::group::Running range-partitioning tests (group 3)..." | ||
MODIN_RANGE_PARTITIONING_GROUPBY=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_groupby.py | ||
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_series.py -k "test_unique or drop_duplicates" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we already deprecate MODIN_RANGE_PARTITIONING_GROUPBY? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be a non-trivial process, so #7105 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If someone uses the old config, a deprecation message should be enough to indicate usage of the new config. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we also need to replace the old variable everywhere in our code and make sure that the config values are always synced:
the last part is a non-trivial one, so I would make the deprecation in a separate PR |
||
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_general.py -k "test_unique" | ||
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/dataframe/test_map_metadata.py -k "drop_duplicates" | ||
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/dataframe/test_join_sort.py -k "merge" | ||
echo "::endgroup::" | ||
shell: bash -l {0} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -78,3 +78,9 @@ Range-partitioning Merge | |
|
||
It is recommended to use this implementation if the right dataframe in merge is as big as | ||
the left dataframe. In this case, range-partitioning implementation works faster and consumes less RAM. | ||
|
||
'.unique()' and '.drop_duplicates()' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we refactor this doc page for 0.29.0? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, and there is an issue for that #6987 |
||
"""""""""""""""""""""""""""""""""""" | ||
|
||
Range-partitioning implementation of '.unique()'/'.drop_duplicates()' works best when the input data size is big (more than | ||
5_000_000 rows) and when the output size is also expected to be big (no more than 80% values are duplicates). | ||
Comment on lines
+85
to
+86
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is not very descriptive, so as a part of #6987 I'm also planing to include perf measurements that I've made for range-partitioning PRs in the docs |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -1768,24 +1768,33 @@ def to_timedelta(self, unit="ns", errors="raise"): # noqa: PR02 | |||||
self, unit=unit, errors=errors | ||||||
) | ||||||
|
||||||
# FIXME: get rid of `**kwargs` parameter (Modin issue #3108). | ||||||
@doc_utils.add_one_column_warning | ||||||
@doc_utils.add_refer_to("Series.unique") | ||||||
def unique(self, **kwargs): | ||||||
@doc_utils.add_refer_to("Series.drop_duplicates") | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this was done on purpose, the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we want to have a stable QC API, it seems that having this kind of mismatch is not a good idea. I would also like to have different QCs. One is responsible for d2p, second is as a base class for descendants. BaseQC does both the things now, which is not pretty good I think. What do you think? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
So I can rename
Looks like unnecessary overcomplication to me. I don't see a problem in making the base class to implement its methods as d2p. I believe all the backends would inherit from d2p qc and ignore the abstract one, what's the point then in having and supporting both of them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It seems we already have drop_duplicates in the base qc. Maybe just put a comment for now on why we refer to drop_duplicates docs?
I am talking about d2p and base qc, which is not an abstract but rather common functionality for all storage formats. Would it make sense? It looks like the base qc does double work at the moment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No, there's no There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I don't know, I don't see why that would be a problem and why we should support two QCs for that matter |
||||||
def unique(self, keep="first", ignore_index=True, subset=None): | ||||||
""" | ||||||
Get unique values of `self`. | ||||||
Get unique rows of `self`. | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
**kwargs : dict | ||||||
Serves compatibility purpose. Does not affect the result. | ||||||
keep : {"first", "last", False}, default: "first" | ||||||
Which duplicates to keep. | ||||||
ignore_index : bool, default: True | ||||||
If ``True``, the resulting axis will be labeled ``0, 1, …, n - 1``. | ||||||
subset : list, optional | ||||||
Only consider certain columns for identifying duplicates, if `None`, use all of the columns. | ||||||
|
||||||
Returns | ||||||
------- | ||||||
BaseQueryCompiler | ||||||
New QueryCompiler with unique values. | ||||||
""" | ||||||
return SeriesDefault.register(pandas.Series.unique)(self, **kwargs) | ||||||
if subset is not None: | ||||||
mask = self.getitem_column_array(subset, ignore_order=True) | ||||||
else: | ||||||
mask = self | ||||||
without_duplicates = self.getitem_array(mask.duplicated(keep=keep).invert()) | ||||||
if ignore_index: | ||||||
without_duplicates = without_duplicates.reset_index(drop=True) | ||||||
return without_duplicates | ||||||
|
||||||
@doc_utils.add_one_column_warning | ||||||
@doc_utils.add_refer_to("Series.searchsorted") | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1900,13 +1900,36 @@ def str_split(self, pat=None, n=-1, expand=False, regex=None): | |
|
||
# END String map partitions operations | ||
|
||
def unique(self): | ||
new_modin_frame = self._modin_frame.apply_full_axis( | ||
0, | ||
lambda x: x.squeeze(axis=1).unique(), | ||
new_columns=self.columns, | ||
def unique(self, keep="first", ignore_index=True, subset=None): | ||
# kernels with 'pandas.Series.unique()' work faster | ||
can_use_unique_kernel = ( | ||
subset is None and ignore_index and len(self.columns) == 1 and keep | ||
) | ||
return self.__constructor__(new_modin_frame) | ||
|
||
if not can_use_unique_kernel and not RangePartitioning.get(): | ||
return super().unique(keep=keep, ignore_index=ignore_index, subset=subset) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this branch for d2p? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, this branch is for general |
||
|
||
if RangePartitioning.get(): | ||
new_modin_frame = self._modin_frame._apply_func_to_range_partitioning( | ||
key_columns=self.columns.tolist() if subset is None else subset, | ||
func=( | ||
(lambda df: pandas.DataFrame(df.squeeze(axis=1).unique())) | ||
if can_use_unique_kernel | ||
else ( | ||
lambda df: df.drop_duplicates( | ||
dchigarev marked this conversation as resolved.
Show resolved
Hide resolved
|
||
keep=keep, ignore_index=ignore_index, subset=subset | ||
) | ||
) | ||
), | ||
preserve_columns=True, | ||
) | ||
else: | ||
new_modin_frame = self._modin_frame.apply_full_axis( | ||
0, | ||
lambda x: x.squeeze(axis=1).unique(), | ||
new_columns=self.columns, | ||
) | ||
return self.__constructor__(new_modin_frame, shape_hint=self._shape_hint) | ||
|
||
def searchsorted(self, **kwargs): | ||
def searchsorted(df): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1511,13 +1511,12 @@ def drop_duplicates( | |
subset = list(subset) | ||
else: | ||
subset = [subset] | ||
df = self[subset] | ||
else: | ||
df = self | ||
duplicated = df.duplicated(keep=keep) | ||
result = self[~duplicated] | ||
if ignore_index: | ||
result.index = pandas.RangeIndex(stop=len(result)) | ||
if len(diff := pandas.Index(subset).difference(self.columns)) > 0: | ||
raise KeyError(diff) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does pandas raise the same error with this message? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it raises exactly the same error: >>> pd_df
a b
0 1 2
>>> pd_df.drop_duplicates(subset=["b", "c"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "python3.9/site-packages/pandas/core/frame.py", line 6805, in drop_duplicates
result = self[-self.duplicated(subset, keep=keep)]
File "python3.9/site-packages/pandas/core/frame.py", line 6937, in duplicated
raise KeyError(Index(diff))
KeyError: Index(['c'], dtype='object') |
||
result_qc = self._query_compiler.unique( | ||
keep=keep, ignore_index=ignore_index, subset=subset | ||
) | ||
result = self.__constructor__(query_compiler=result_qc) | ||
if inplace: | ||
self._update_inplace(result._query_compiler) | ||
else: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing of range-partitioning implementations is now performed in a separate action