New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#6876: Skip the masking stage on 'iloc' where beneficial #6878
Conversation
…icial Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
indexers.append(indexer) | ||
row_positions, col_positions = indexers | ||
|
||
if col_positions is None and row_positions is None: | ||
return self.copy() | ||
|
||
# quite fast check that allows skip sorting | ||
must_sort_row_pos = row_positions is not None and not np.all( | ||
row_positions[1:] >= row_positions[:-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check was copied from _maybe_reorder_labels
and it indeed works pretty fast:
import numpy as np
from timeit import default_timer as timer
arr = np.random.randint(0, 100_000_000, size=100_000_000)
t1 = timer()
must_sort_col_pos = np.all(arr[1:] >= arr[:-1])
print(timer() - t1) # 0.09s
@@ -1175,18 +1185,40 @@ def _take_2d_positional( | |||
all_rows = None | |||
if self.has_materialized_index: | |||
all_rows = len(self.index) | |||
elif self._row_lengths_cache: | |||
elif self._row_lengths_cache or must_sort_row_pos: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will have to trigger row_lengths
anyway in that case
@anmyachev I believe you're the most suitable person to review this |
@dchigarev The picture won't open (to full size) with "measurements with MODIN_CPUS=44". |
works to me |
Apparently I had a bug, now it also works for me. |
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
must_sort_row_pos | ||
and len(row_positions) * base_num_cols | ||
>= min( | ||
all_rows * len(self.columns) * base_ratio, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it performance-wise safe to materialize columns here? (Are they already materializing somewhere nearby or not?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, they're not materialized explicitly anywhere nearby, but this value is required to properly branch here, so I guess I have no good choices here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my assumption is that columns are more likely to be pre-computed than indices, so accessing them shouldn't always trigger computations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can try to use self.column_widths
in the future if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What do these changes do?
This PR extends the idea originally introduced in #6423.
What's the idea
The masking pipeline consists of two stages:
In theory, the second stage can function without the first one, the first stage is needed in order to reduce the dimension of the input for the second stage. However, sometimes the first stage can be quite expensive, and it could be reasonable to skip it and jump to the second one right away.
This PR extends the number of cases where we skip the first stage with the cases where we know for sure that the reordering stage will be required. Based on the measurements below, I found out the following condition that controls this:
measurements with MODIN_CPUS=44
The reproducer from the issue (#6876) now performs like this:
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date