-
Notifications
You must be signed in to change notification settings - Fork 644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF-#6224: Use 'Map' operator to retrieve categorical codes #6230
Conversation
…odes Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
if ser.dtype != "category": | ||
ser = ser.astype("category", copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since now, we shouldn't do this cast as disappearance of the categorical dtype will mean that we lost valid codes and should report an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! @YarShev?
@@ -1920,7 +1920,12 @@ def tree_reduce( | |||
return self._compute_tree_reduce_metadata(axis.value, reduce_parts) | |||
|
|||
@lazy_metadata_decorator(apply_axis=None) | |||
def map(self, func: Callable, dtypes: Optional[str] = None) -> "PandasDataframe": | |||
def map( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably update the map's signature in modin/core/dataframe/base/dataframe/dataframe.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that, however, it seems that there's already a mismatch with the base map as it additionally takes an axis
parameter that is not defined in the pandas version of the dataframe
modin/modin/core/dataframe/base/dataframe/dataframe.py
Lines 90 to 96 in 930b8fb
@abstractmethod | |
def map( | |
self, | |
function: Callable, | |
axis: Optional[Union[int, Axis]] = None, | |
dtypes: Optional[str] = None, | |
) -> "ModinDataframe": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's align the signatures in a separate issue, please create one for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, #6231
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
What do these changes do?
After #6222 we guarantee that each partition has knowledge about the whole categorical table, meaning that we don't need to build the whole axis no more in order to retrieve valid categorical codes. So this PR basically replaces
.fold(axis=0, ...)
to the.map(...)
in the.cat_codes
implementation.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
cat.codes
using map rather than full-axis approach #6224docs/development/architecture.rst
is up-to-date