-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOCS-#6987: Rework range-partitioning docs #7169
Merged
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
e70cf9b
DOCS-#6987: Rework range-partitioning docs
dchigarev 6f1ed03
Merge remote-tracking branch 'origin/master' into issue_6987
dchigarev e72cfcb
update docs
dchigarev d2520f8
update groupby section & move docs to opt notes
dchigarev 8ccca0c
apply review suggestions
dchigarev File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
111 changes: 6 additions & 105 deletions
111
docs/flow/modin/experimental/range_partitioning_groupby.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,107 +1,8 @@ | ||
Range-partitioning GroupBy | ||
"""""""""""""""""""""""""" | ||
:orphan: | ||
|
||
The range-partitioning GroupBy implementation utilizes Modin's reshuffling mechanism that gives an | ||
ability to build range partitioning over a Modin DataFrame. | ||
.. redirect to the new page | ||
.. raw:: html | ||
|
||
In order to enable/disable the range-partitiong implementation you have to specify ``cfg.RangePartitioning`` | ||
:doc:`configuration variable: </flow/modin/config>` | ||
|
||
.. code-block:: ipython | ||
|
||
In [4]: import modin.config as cfg; cfg.RangePartitioning.put(True) | ||
|
||
In [5]: # past this point, Modin will always use the range-partitiong groupby implementation | ||
|
||
In [6]: cfg.RangePartitioning.put(False) | ||
|
||
In [7]: # past this point, Modin won't use range-partitiong groupby implementation anymore | ||
|
||
The range-partitiong implementation appears to be quite efficient when compared to TreeReduce and FullAxis implementations: | ||
|
||
.. note:: | ||
|
||
All of the examples below were run on Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz (112 cores), 192gb RAM | ||
|
||
.. code-block:: ipython | ||
|
||
In [4]: import modin.pandas as pd; import numpy as np | ||
|
||
In [5]: df = pd.DataFrame(np.random.randint(0, 1_000_000, size=(1_000_000, 10)), columns=[f"col{i}" for i in range(10)]) | ||
|
||
In [6]: %timeit df.groupby("col0").nunique() # full-axis implementation | ||
Out[6]: # 2.73 s ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) | ||
|
||
In [7]: import modin.config as cfg; cfg.RangePartitioning.put(True) | ||
|
||
In [8]: %timeit df.groupby("col0").nunique() # range-partitiong implementation | ||
Out[8]: # 595 ms ± 61.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) | ||
|
||
Although it may look like the range-partitioning implementation always outperforms the other ones, it's not actually true. | ||
There's a decent overhead on building the range partitioning itself, meaning that the other implementations | ||
may act better on smaller data sizes or when the grouping columns (a key column to build range partitioning) | ||
have too few unique values (and thus fewer units of parallelization): | ||
|
||
.. code-block:: ipython | ||
|
||
In [4]: import modin.pandas as pd; import numpy as np | ||
|
||
In [5]: df = pd.DataFrame({"col0": np.tile(list("abcde"), 50_000), "col1": np.arange(250_000)}) | ||
|
||
In [6]: %timeit df.groupby("col0").sum() # TreeReduce implementation | ||
Out[6]: # 155 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) | ||
|
||
In [7]: import modin.config as cfg; cfg.RangePartitioning.put(True) | ||
|
||
In [8]: %timeit df.groupby("col0").sum() # range-partitiong implementation | ||
Out[8]: # 230 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) | ||
|
||
We're still looking for a heuristic that would be able to automatically switch to the best implementation | ||
for each groupby case, but for now, we're offering to play with this switch on your own to see which | ||
implementation works best for your particular case. | ||
|
||
The range-partitioning groupby does not yet support all of the pandas API and falls back to an other | ||
implementation with the respective warning if it meets an unsupported case: | ||
|
||
.. code-block:: python | ||
|
||
In [14]: import modin.config as cfg; cfg.RangePartitioning.put(True) | ||
|
||
In [15]: df.groupby(level=0).sum() | ||
Out[15]: # UserWarning: Can't use range-partitiong groupby implementation because of: | ||
... # Range-partitioning groupby is only supported when grouping on a column(s) of the same frame. | ||
... # https://github.com/modin-project/modin/issues/5926 | ||
... # Falling back to a TreeReduce implementation. | ||
|
||
Range-partitioning Merge | ||
"""""""""""""""""""""""" | ||
|
||
It is recommended to use this implementation if the right dataframe in merge is as big as | ||
the left dataframe. In this case, range-partitioning implementation works faster and consumes less RAM. | ||
|
||
'.unique()' and '.drop_duplicates()' | ||
"""""""""""""""""""""""""""""""""""" | ||
|
||
Range-partitioning implementation of '.unique()'/'.drop_duplicates()' works best when the input data size is big (more than | ||
5_000_000 rows) and when the output size is also expected to be big (no more than 80% values are duplicates). | ||
|
||
'.nunique()' | ||
"""""""""""""""""""""""""""""""""""" | ||
|
||
.. note:: | ||
|
||
Range-partitioning approach is implemented only for 'pd.Series.nunique()' and 1-column dataframes. | ||
For multi-column dataframes '.nunique()' can only use full-axis reduce implementation. | ||
|
||
Range-partitioning implementation of '.nunique()'' works best when the input data size is big (more than | ||
5_000_000 rows) and when the output size is also expected to be big (no more than 80% values are duplicates). | ||
|
||
Resample | ||
"""""""" | ||
|
||
.. note:: | ||
|
||
Range-partitioning approach doesn't support transform-like functions (like `.interpolate()`, `.ffill()`, `.bfill()`, ...) | ||
|
||
It is recommended to use range-partitioning for resampling if you're dealing with a dataframe that has more than | ||
5_000_000 rows and the expected output is also expected to be big (more than 500_000 rows). | ||
<script type="text/javascript"> | ||
window.location.href = '../../../usage_guide/optimization_notes/index.html#range-partitioning-in-modin'; | ||
</script> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,5 +4,5 @@ | |
.. raw:: html | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same |
||
|
||
<script type="text/javascript"> | ||
window.location.href = 'range_partitioning_groupby.html'; | ||
window.location.href = '../../../usage_guide/optimization_notes/index.html#range-partitioning-in-modin'; | ||
</script> |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should remove this page at all as we do not reference it anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we still may have a reference to this page somewhere outside our docs (messages, gh issues/comments, blog posts, in-code comments). So I would prefer an automatic redirect to a new page instead of 404