PERF-#6876: Skip the masking stage on 'iloc' where beneficial #6878

dchigarev · 2024-01-24T13:24:13Z

What do these changes do?

This PR extends the idea originally introduced in #6423.

What's the idea

The masking pipeline consists of two stages:

Compute individual indexers for each partition and mask them.
If the indexing forces reordering, apply a full-column reordering function to the result of the first stage.

In theory, the second stage can function without the first one, the first stage is needed in order to reduce the dimension of the input for the second stage. However, sometimes the first stage can be quite expensive, and it could be reasonable to skip it and jump to the second one right away.

This PR extends the number of cases where we skip the first stage with the cases where we know for sure that the reordering stage will be required. Based on the measurements below, I found out the following condition that controls this:

# 'base_num_cols' specifies the number of columns that the dataframe should have
# in order to jump to 'reordered_labels' in case of len(row_positions) / len(self) >= base_ratio;
# these variables may be a subject to change in order to tune performance more accurately
base_num_cols = 10
base_ratio = 0.2
# Example:
#   len(self.columns): 10 == base_num_cols -> min ratio to jump to reorder_labels: 0.2 == base_ratio
#   len(self.columns): 15 -> min ratio to jump to reorder_labels: 0.3
#   len(self.columns): 20 -> min ratio to jump to reorder_labels: 0.4
#   ...
#   len(self.columns): 49 -> min ratio to jump to reorder_labels: 0.98
#   len(self.columns): 50 -> min ratio to jump to reorder_labels: 1.0
#   len(self.columns): 55 -> min ratio to jump to reorder_labels: 1.0
#   ...
if rows_reordering_required and len(row_positions) * base_num_cols >= min(
    all_rows * len(self.columns) * base_ratio, len(row_positions) * base_num_cols
):
    # jump to reorder_labels

measurements with MODIN_CPUS=44

The reproducer from the issue (#6876) now performs like this:

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves df.take is much slower against pandas #6876
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

…icial Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2024-01-24T13:56:56Z

modin/core/dataframe/pandas/dataframe/dataframe.py

            indexers.append(indexer)
        row_positions, col_positions = indexers

        if col_positions is None and row_positions is None:
            return self.copy()

+        # quite fast check that allows skip sorting
+        must_sort_row_pos = row_positions is not None and not np.all(
+            row_positions[1:] >= row_positions[:-1]


this check was copied from _maybe_reorder_labels and it indeed works pretty fast:

import numpy as np from timeit import default_timer as timer arr = np.random.randint(0, 100_000_000, size=100_000_000) t1 = timer() must_sort_col_pos = np.all(arr[1:] >= arr[:-1]) print(timer() - t1) # 0.09s

dchigarev · 2024-01-24T13:57:31Z

modin/core/dataframe/pandas/dataframe/dataframe.py

@@ -1175,18 +1185,40 @@ def _take_2d_positional(
            all_rows = None
            if self.has_materialized_index:
                all_rows = len(self.index)
-            elif self._row_lengths_cache:
+            elif self._row_lengths_cache or must_sort_row_pos:


we will have to trigger row_lengths anyway in that case

dchigarev · 2024-01-24T14:00:01Z

@anmyachev I believe you're the most suitable person to review this

anmyachev · 2024-01-24T14:30:22Z

@dchigarev The picture won't open (to full size) with "measurements with MODIN_CPUS=44".

YarShev · 2024-01-24T14:33:54Z

@dchigarev The picture won't open (to full size) with "measurements with MODIN_CPUS=44".

works to me

modin/core/dataframe/pandas/dataframe/dataframe.py

anmyachev · 2024-01-24T14:37:06Z

@dchigarev The picture won't open (to full size) with "measurements with MODIN_CPUS=44".

works to me

Apparently I had a bug, now it also works for me.

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

anmyachev · 2024-01-24T14:47:34Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+                must_sort_row_pos
+                and len(row_positions) * base_num_cols
+                >= min(
+                    all_rows * len(self.columns) * base_ratio,


Is it performance-wise safe to materialize columns here? (Are they already materializing somewhere nearby or not?)

no, they're not materialized explicitly anywhere nearby, but this value is required to properly branch here, so I guess I have no good choices here

my assumption is that columns are more likely to be pre-computed than indices, so accessing them shouldn't always trigger computations

Ok, we can try to use self.column_widths in the future if needed

anmyachev

LGTM!

PERF-modin-project#6876: Skip the masking stage on 'iloc' where benef…

8061516

…icial Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev commented Jan 24, 2024

View reviewed changes

dchigarev marked this pull request as ready for review January 24, 2024 13:59

dchigarev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev and a team as code owners January 24, 2024 13:59

anmyachev reviewed Jan 24, 2024

View reviewed changes

modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved

Update modin/core/dataframe/pandas/dataframe/dataframe.py

960535a

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>

anmyachev reviewed Jan 24, 2024

View reviewed changes

anmyachev approved these changes Jan 24, 2024

View reviewed changes

anmyachev merged commit 72de8c0 into modin-project:master Jan 24, 2024
37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF-#6876: Skip the masking stage on 'iloc' where beneficial #6878

PERF-#6876: Skip the masking stage on 'iloc' where beneficial #6878

dchigarev commented Jan 24, 2024 •

edited

dchigarev Jan 24, 2024

dchigarev Jan 24, 2024

dchigarev commented Jan 24, 2024

anmyachev commented Jan 24, 2024 •

edited

YarShev commented Jan 24, 2024

anmyachev commented Jan 24, 2024

anmyachev Jan 24, 2024

dchigarev Jan 24, 2024

dchigarev Jan 24, 2024

anmyachev Jan 24, 2024

anmyachev left a comment

PERF-#6876: Skip the masking stage on 'iloc' where beneficial #6878

PERF-#6876: Skip the masking stage on 'iloc' where beneficial #6878

Conversation

dchigarev commented Jan 24, 2024 • edited

What do these changes do?

dchigarev Jan 24, 2024

Choose a reason for hiding this comment

dchigarev Jan 24, 2024

Choose a reason for hiding this comment

dchigarev commented Jan 24, 2024

anmyachev commented Jan 24, 2024 • edited

YarShev commented Jan 24, 2024

anmyachev commented Jan 24, 2024

anmyachev Jan 24, 2024

Choose a reason for hiding this comment

dchigarev Jan 24, 2024

Choose a reason for hiding this comment

dchigarev Jan 24, 2024

Choose a reason for hiding this comment

anmyachev Jan 24, 2024

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

dchigarev commented Jan 24, 2024 •

edited

anmyachev commented Jan 24, 2024 •

edited