New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7090: Add range-partitioning implementation for '.unique()' and '.drop_duplicates()' #7091
Conversation
…unique()' and '.drop_duplicates()' Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
- run: MODIN_RANGE_PARTITIONING=1 python -m pytest modin/pandas/test/dataframe/test_join_sort.py -k "merge" | ||
shell: bash -l {0} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing of range-partitioning implementations is now performed in a separate action
) | ||
modin_res = modin_series.drop_duplicates(keep=keep, inplace=inplace) | ||
pandas_res = pandas_series.drop_duplicates(keep=keep, inplace=inplace) | ||
if inplace: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise it was comparing Nones in case of inplace=True
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Range-partitioning implementation of '.unique()'/'.drop_duplicates()' works best when the input data size is big (more than | ||
5_000_000 rows) and when the output size is also expected to be big (no more than 80% values are duplicates). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not very descriptive, so as a part of #6987 I'm also planing to include perf measurements that I've made for range-partitioning PRs in the docs
mask = self | ||
without_duplicates = self.getitem_array(mask.duplicated(keep=keep).invert()) | ||
if ignore_index: | ||
without_duplicates.index = pandas.RangeIndex(len(without_duplicates.index)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without_duplicates.index = pandas.RangeIndex(len(without_duplicates.index)) | |
without_duplicates.index = pandas.RangeIndex(len(without_duplicates)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good note, but here we're operation on a query compiler that doesn't have the .__len__()
method, so the only way to get its length is by using its index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is there a point in using this line as the index would have to be computed to the set the index? can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, we can use .reset_index()
here instead that would reset the index without computing the old one
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
MODIN_RANGE_PARTITIONING_GROUPBY=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_groupby.py | ||
MODIN_RANGE_PARTITIONING=1 ${{ inputs.runner }} ${{ inputs.parallel }} modin/pandas/test/test_series.py -k "test_unique or drop_duplicates" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we already deprecate MODIN_RANGE_PARTITIONING_GROUPBY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a non-trivial process, so #7105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If someone uses the old config, a deprecation message should be enough to indicate usage of the new config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we also need to replace the old variable everywhere in our code and make sure that the config values are always synced:
cfg.ExperimentalGroupby == cfg.RangePartitioningGroupby == cfg.RangePartitioning
the last part is a non-trivial one, so I would make the deprecation in a separate PR
@@ -78,3 +78,9 @@ Range-partitioning Merge | |||
|
|||
It is recommended to use this implementation if the right dataframe in merge is as big as | |||
the left dataframe. In this case, range-partitioning implementation works faster and consumes less RAM. | |||
|
|||
'.unique()' and '.drop_duplicates()' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we refactor this doc page for 0.29.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and there is an issue for that #6987
return np.sort(data) | ||
|
||
|
||
def sort_if_range_partitioning(df1, df2, comparator=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a note somewhere that range partitioning gives rows mismatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, it should be done in #6987
if ignore_index: | ||
result.index = pandas.RangeIndex(stop=len(result)) | ||
if len(diff := pandas.Index(subset).difference(self.columns)) > 0: | ||
raise KeyError(diff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does pandas raise the same error with this message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it raises exactly the same error:
>>> pd_df
a b
0 1 2
>>> pd_df.drop_duplicates(subset=["b", "c"])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "python3.9/site-packages/pandas/core/frame.py", line 6805, in drop_duplicates
result = self[-self.duplicated(subset, keep=keep)]
File "python3.9/site-packages/pandas/core/frame.py", line 6937, in duplicated
raise KeyError(Index(diff))
KeyError: Index(['c'], dtype='object')
@doc_utils.add_one_column_warning | ||
@doc_utils.add_refer_to("Series.unique") | ||
def unique(self, **kwargs): | ||
@doc_utils.add_refer_to("Series.drop_duplicates") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@doc_utils.add_refer_to("Series.drop_duplicates") | |
@doc_utils.add_refer_to("Series.unique") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was done on purpose, the qc.unique()
method now uses most of the arguments from Series.drop_duplicates()
so it's makes sense to refer to this method instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we want to have a stable QC API, it seems that having this kind of mismatch is not a good idea. I would also like to have different QCs. One is responsible for d2p, second is as a base class for descendants. BaseQC does both the things now, which is not pretty good I think. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we want to have a stable QC API, it seems that having this kind of mismatch is not a good idea.
So I can rename qc.unique -> qc.drop_duplicates
, would that be better?
I would also like to have different QCs. One is responsible for d2p, second is as a base class for descendants.
Looks like unnecessary overcomplication to me. I don't see a problem in making the base class to implement its methods as d2p. I believe all the backends would inherit from d2p qc and ignore the abstract one, what's the point then in having and supporting both of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I can rename qc.unique -> qc.drop_duplicates, would that be better?
It seems we already have drop_duplicates in the base qc. Maybe just put a comment for now on why we refer to drop_duplicates docs?
Looks like unnecessary overcomplication to me. I don't see a problem in making the base class to implement its methods as d2p. I believe all the backends would inherit from d2p qc and ignore the abstract one, what's the point then in having and supporting both of them?
I am talking about d2p and base qc, which is not an abstract but rather common functionality for all storage formats. Would it make sense? It looks like the base qc does double work at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we already have drop_duplicates in the base qc. Maybe just put a comment for now on why we refer to drop_duplicates docs?
No, there's no drop_duplicates()
method on the qc level, only .duplicated()
. Will put a comment though on why we're referring to drop duplicates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about d2p and base qc, which is not an abstract but rather common functionality for all storage formats. Would it make sense? It looks like the base qc does double work at the moment.
I don't know, I don't see why that would be a problem and why we should support two QCs for that matter
return self.__constructor__(new_modin_frame) | ||
|
||
if not can_use_unique_kernel and not RangePartitioning.get(): | ||
return super().unique(keep=keep, ignore_index=ignore_index, subset=subset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this branch for d2p?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this branch is for general unique()
implementation that uses QC's public API
Please resolve conflicts. |
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
What do these changes do?
This PR merges implementations of
.unique()
/.drop_duplicates()
into a single query compiler method calledunique()
. Some parts ofDataFrame.drop_duplicates()
implementation were moved toBaseQueryCompiler.unique()
.The performance highly depends on the number of duplicated values in the input data, so it's almost impossible to figure out a solid working heuristic:
perf for '.unique()'
MODIN_CPUS=16
MODIN_CPUS=44
script to measure
perf for '.drop_duplicates()'
MODIN_CPUS=16
MODIN_CPUS=44
script to measure
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
.unique()
/.drop_duplicates()
#7090added andare passingdocs/development/architecture.rst
is up-to-date