Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-#6876: Skip the masking stage on 'iloc' where beneficial #6878

Merged
merged 2 commits into from Jan 24, 2024

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Jan 24, 2024

What do these changes do?

This PR extends the idea originally introduced in #6423.

What's the idea

The masking pipeline consists of two stages:

  1. Compute individual indexers for each partition and mask them.
  2. If the indexing forces reordering, apply a full-column reordering function to the result of the first stage.

In theory, the second stage can function without the first one, the first stage is needed in order to reduce the dimension of the input for the second stage. However, sometimes the first stage can be quite expensive, and it could be reasonable to skip it and jump to the second one right away.

This PR extends the number of cases where we skip the first stage with the cases where we know for sure that the reordering stage will be required. Based on the measurements below, I found out the following condition that controls this:

# 'base_num_cols' specifies the number of columns that the dataframe should have
# in order to jump to 'reordered_labels' in case of len(row_positions) / len(self) >= base_ratio;
# these variables may be a subject to change in order to tune performance more accurately
base_num_cols = 10
base_ratio = 0.2
# Example:
#   len(self.columns): 10 == base_num_cols -> min ratio to jump to reorder_labels: 0.2 == base_ratio
#   len(self.columns): 15 -> min ratio to jump to reorder_labels: 0.3
#   len(self.columns): 20 -> min ratio to jump to reorder_labels: 0.4
#   ...
#   len(self.columns): 49 -> min ratio to jump to reorder_labels: 0.98
#   len(self.columns): 50 -> min ratio to jump to reorder_labels: 1.0
#   len(self.columns): 55 -> min ratio to jump to reorder_labels: 1.0
#   ...
if rows_reordering_required and len(row_positions) * base_num_cols >= min(
    all_rows * len(self.columns) * base_ratio, len(row_positions) * base_num_cols
):
    # jump to reorder_labels
measurements with MODIN_CPUS=44

image

The reproducer from the issue (#6876) now performs like this:

image

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves df.take is much slower against pandas #6876
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

…icial

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
indexers.append(indexer)
row_positions, col_positions = indexers

if col_positions is None and row_positions is None:
return self.copy()

# quite fast check that allows skip sorting
must_sort_row_pos = row_positions is not None and not np.all(
row_positions[1:] >= row_positions[:-1]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check was copied from _maybe_reorder_labels and it indeed works pretty fast:

import numpy as np
from timeit import default_timer as timer

arr = np.random.randint(0, 100_000_000, size=100_000_000)

t1 = timer()
must_sort_col_pos = np.all(arr[1:] >= arr[:-1])
print(timer() - t1) # 0.09s

@@ -1175,18 +1185,40 @@ def _take_2d_positional(
all_rows = None
if self.has_materialized_index:
all_rows = len(self.index)
elif self._row_lengths_cache:
elif self._row_lengths_cache or must_sort_row_pos:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will have to trigger row_lengths anyway in that case

@dchigarev dchigarev marked this pull request as ready for review January 24, 2024 13:59
@dchigarev
Copy link
Collaborator Author

@anmyachev I believe you're the most suitable person to review this

@anmyachev
Copy link
Collaborator

anmyachev commented Jan 24, 2024

@dchigarev The picture won't open (to full size) with "measurements with MODIN_CPUS=44".

@YarShev
Copy link
Collaborator

YarShev commented Jan 24, 2024

@dchigarev The picture won't open (to full size) with "measurements with MODIN_CPUS=44".

works to me

@anmyachev
Copy link
Collaborator

@dchigarev The picture won't open (to full size) with "measurements with MODIN_CPUS=44".

works to me

Apparently I had a bug, now it also works for me.

Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
must_sort_row_pos
and len(row_positions) * base_num_cols
>= min(
all_rows * len(self.columns) * base_ratio,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it performance-wise safe to materialize columns here? (Are they already materializing somewhere nearby or not?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, they're not materialized explicitly anywhere nearby, but this value is required to properly branch here, so I guess I have no good choices here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my assumption is that columns are more likely to be pre-computed than indices, so accessing them shouldn't always trigger computations

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can try to use self.column_widths in the future if needed

Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@anmyachev anmyachev merged commit 72de8c0 into modin-project:master Jan 24, 2024
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

df.take is much slower against pandas
3 participants