Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCS-#6987: Rework range-partitioning docs #7169

Merged
merged 5 commits into from
Apr 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions docs/flow/modin/experimental/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ and provides a limited set of functionality:
* :doc:`xgboost <xgboost>`
* :doc:`sklearn <sklearn>`
* :doc:`batch <batch>`
* :doc:`Range-partitioning implementations <range_partitioning_groupby>`


.. toctree::
Expand All @@ -24,4 +23,3 @@ and provides a limited set of functionality:
sklearn
xgboost
batch
range_partitioning_groupby
111 changes: 6 additions & 105 deletions docs/flow/modin/experimental/range_partitioning_groupby.rst
Original file line number Diff line number Diff line change
@@ -1,107 +1,8 @@
Range-partitioning GroupBy
""""""""""""""""""""""""""
:orphan:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should remove this page at all as we do not reference it anywhere?

Copy link
Collaborator Author

@dchigarev dchigarev Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still may have a reference to this page somewhere outside our docs (messages, gh issues/comments, blog posts, in-code comments). So I would prefer an automatic redirect to a new page instead of 404

The range-partitioning GroupBy implementation utilizes Modin's reshuffling mechanism that gives an
ability to build range partitioning over a Modin DataFrame.
.. redirect to the new page
.. raw:: html

In order to enable/disable the range-partitiong implementation you have to specify ``cfg.RangePartitioning``
:doc:`configuration variable: </flow/modin/config>`

.. code-block:: ipython

In [4]: import modin.config as cfg; cfg.RangePartitioning.put(True)

In [5]: # past this point, Modin will always use the range-partitiong groupby implementation

In [6]: cfg.RangePartitioning.put(False)

In [7]: # past this point, Modin won't use range-partitiong groupby implementation anymore

The range-partitiong implementation appears to be quite efficient when compared to TreeReduce and FullAxis implementations:

.. note::

All of the examples below were run on Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz (112 cores), 192gb RAM

.. code-block:: ipython

In [4]: import modin.pandas as pd; import numpy as np

In [5]: df = pd.DataFrame(np.random.randint(0, 1_000_000, size=(1_000_000, 10)), columns=[f"col{i}" for i in range(10)])

In [6]: %timeit df.groupby("col0").nunique() # full-axis implementation
Out[6]: # 2.73 s ± 28.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: import modin.config as cfg; cfg.RangePartitioning.put(True)

In [8]: %timeit df.groupby("col0").nunique() # range-partitiong implementation
Out[8]: # 595 ms ± 61.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Although it may look like the range-partitioning implementation always outperforms the other ones, it's not actually true.
There's a decent overhead on building the range partitioning itself, meaning that the other implementations
may act better on smaller data sizes or when the grouping columns (a key column to build range partitioning)
have too few unique values (and thus fewer units of parallelization):

.. code-block:: ipython

In [4]: import modin.pandas as pd; import numpy as np

In [5]: df = pd.DataFrame({"col0": np.tile(list("abcde"), 50_000), "col1": np.arange(250_000)})

In [6]: %timeit df.groupby("col0").sum() # TreeReduce implementation
Out[6]: # 155 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: import modin.config as cfg; cfg.RangePartitioning.put(True)

In [8]: %timeit df.groupby("col0").sum() # range-partitiong implementation
Out[8]: # 230 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We're still looking for a heuristic that would be able to automatically switch to the best implementation
for each groupby case, but for now, we're offering to play with this switch on your own to see which
implementation works best for your particular case.

The range-partitioning groupby does not yet support all of the pandas API and falls back to an other
implementation with the respective warning if it meets an unsupported case:

.. code-block:: python

In [14]: import modin.config as cfg; cfg.RangePartitioning.put(True)

In [15]: df.groupby(level=0).sum()
Out[15]: # UserWarning: Can't use range-partitiong groupby implementation because of:
... # Range-partitioning groupby is only supported when grouping on a column(s) of the same frame.
... # https://github.com/modin-project/modin/issues/5926
... # Falling back to a TreeReduce implementation.

Range-partitioning Merge
""""""""""""""""""""""""

It is recommended to use this implementation if the right dataframe in merge is as big as
the left dataframe. In this case, range-partitioning implementation works faster and consumes less RAM.

'.unique()' and '.drop_duplicates()'
""""""""""""""""""""""""""""""""""""

Range-partitioning implementation of '.unique()'/'.drop_duplicates()' works best when the input data size is big (more than
5_000_000 rows) and when the output size is also expected to be big (no more than 80% values are duplicates).

'.nunique()'
""""""""""""""""""""""""""""""""""""

.. note::

Range-partitioning approach is implemented only for 'pd.Series.nunique()' and 1-column dataframes.
For multi-column dataframes '.nunique()' can only use full-axis reduce implementation.

Range-partitioning implementation of '.nunique()'' works best when the input data size is big (more than
5_000_000 rows) and when the output size is also expected to be big (no more than 80% values are duplicates).

Resample
""""""""

.. note::

Range-partitioning approach doesn't support transform-like functions (like `.interpolate()`, `.ffill()`, `.bfill()`, ...)

It is recommended to use range-partitioning for resampling if you're dealing with a dataframe that has more than
5_000_000 rows and the expected output is also expected to be big (more than 500_000 rows).
<script type="text/javascript">
window.location.href = '../../../usage_guide/optimization_notes/index.html#range-partitioning-in-modin';
</script>
2 changes: 1 addition & 1 deletion docs/flow/modin/experimental/reshuffling_groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,5 +4,5 @@
.. raw:: html
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


<script type="text/javascript">
window.location.href = 'range_partitioning_groupby.html';
window.location.href = '../../../usage_guide/optimization_notes/index.html#range-partitioning-in-modin';
</script>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 32 additions & 0 deletions docs/usage_guide/optimization_notes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,36 @@ cases. This page is for those who love to optimize their code and those who are
within Modin. Here you can find more information about Modin's optimizations both for a pipeline of operations as
well as for specific operations.

Range-partitioning in Modin
"""""""""""""""""""""""""""

Modin utilizes a range-partitioning approach for specific operations, significantly enhancing
parallelism and reducing memory consumption in certain scenarios. Range-partitioning is typically
engaged for operations that has key columns (to group on, to merge on, etc).

You can enable `range-partitioning`_ by specifying ``cfg.RangePartitioning`` :doc:`configuration variable: </flow/modin/config>`

.. code-block:: python

import modin.pandas as pd
import modin.config as cfg

cfg.RangePartitioning.put(True) # past this point methods that support range-partitioning
# will use it

pd.DataFrame(...).groupby(...).mean() # use range-partitioning for groupby.mean()

cfg.Range-partitioning.put(False)

pd.DataFrame(...).groupby(...).mean() # use MapReduce implementation for groupby.mean()

Building range-partitioning assumes data reshuffling, which may result into breaking the original
order of rows, for some operation, it will mean that the result will be different from Pandas.

Range-partitioning is not a silver bullet, meaning that enabling it is not always beneficial. Below you find
a link to the list of operations that have support for range-partitioning and practical advices on when one should
enable it: :doc:`operations that support range-partitioning </usage_guide/optimization_notes/range_partitioning_ops>`.

Understanding Modin's partitioning mechanism
""""""""""""""""""""""""""""""""""""""""""""

Expand Down Expand Up @@ -275,3 +305,5 @@ an inner join you may want to swap left and right DataFrames.
1.22 s 40.1 ms per loop (mean std. dev. of 7 runs, 1 loop each)

Note that result columns order may differ for first and second ``merge``.

.. _range-partitioning: https://www.techopedia.com/definition/31994/range-partitioning
Loading
Loading