Skip to content

Commit

Permalink
DOCS-#7217: Update docs as to when Modin operators work best (#7218)
Browse files Browse the repository at this point in the history
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
  • Loading branch information
YarShev committed Apr 26, 2024
1 parent 199e3fd commit 74cf1cf
Showing 1 changed file with 26 additions and 0 deletions.
26 changes: 26 additions & 0 deletions docs/flow/modin/core/dataframe/algebra.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,11 @@ Uniformly apply a function argument to each partition in parallel.
.. figure:: /img/map_evaluation.svg
:align: center

This operator performs best when the number of partitions equals to the number of CPUs
so that each single partition gets processed in parallel. When the number of partitions is 1.5x greater than
the number of CPUs, Modin applies a heuristic to join some partitions to get "ideal" partitioning so that
each new partition gets processed in parallel.

Reduce operator
---------------
Applies an argument function that reduces each column or row on the specified axis into a scalar, but requires knowledge about the whole axis.
Expand All @@ -43,13 +48,19 @@ that the reduce function returns a one dimensional frame.
.. figure:: /img/reduce_evaluation.svg
:align: center

This operator performs best when the number of partitions (row or column partitions in depend on the specified axis)
equals to the number of CPUs so that each single axis partition gets processed in parallel.

TreeReduce operator
-------------------
Applies an argument function that reduces specified axis into a scalar. First applies map function to each partition
in parallel, then concatenates resulted partitions along the specified axis and applies reduce
function. In contrast with `Map function` template, here you're allowed to change partition shape
in the map phase. Note that the execution engine expects that the reduce function returns a one dimensional frame.

This operator performs best when the number of partitions (including the initial and intermediate stages)
equals to the number of CPUs so that each single axis partition gets processed in parallel.

Binary operator
---------------
Applies an argument function, that takes exactly two operands (first is always `QueryCompiler`).
Expand All @@ -65,20 +76,35 @@ the right operand to the left.
it automatically but note that this requires repartitioning, which is a much
more expensive operation than the binary function itself.

This operator performs best when both operands have identical partitioning and the number of partitions of an operand
equals to the number of CPUs so that each single partition gets processed in parallel.

Fold operator
-------------
Applies an argument function that requires knowledge of the whole axis. Be aware that providing this knowledge may be
expensive because the execution engine has to concatenate partitions along the specified axis.

This operator performs best when the number of partitions (row or column partitions in depend on the specified axis)
equals to the number of CPUs so that each single axis partition gets processed in parallel.

GroupBy operator
----------------
Evaluates GroupBy aggregation for that type of functions that can be executed via TreeReduce approach.
To be able to form groups engine broadcasts ``by`` partitions to each partition of the source frame.

This operator performs best when the cardinality of ``by`` columns is low (small number of output groups).
At the ``Map`` stage, the operator computes the aggregation for each row partition individually, meaning,
that the ``Reduce`` stage takes a dataframe with the following number of rows:
``num_groups * n_row_parts``. If the number of groups is too high, there's a risk of getting a dataframe
with even bigger than the initial shape at the ``Reduce`` stage.

Default-to-pandas operator
--------------------------
Do :doc:`fallback to pandas </supported_apis/defaulting_to_pandas>` for passed function.

This operator has a performance penalty for going from a partitioned Modin DataFrame to pandas because of
the communication cost and single-threaded nature of pandas.


How to register your own function
'''''''''''''''''''''''''''''''''
Expand Down

0 comments on commit 74cf1cf

Please sign in to comment.