-
Notifications
You must be signed in to change notification settings - Fork 649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7118: Add range-partitioning impl for 'df.resample()' #7140
Conversation
@@ -1049,11 +1058,23 @@ | |||
PandasQueryCompiler | |||
New QueryCompiler containing the result of resample aggregation. | |||
""" | |||
from modin.core.dataframe.pandas.dataframe.utils import ShuffleResample |
Check notice
Code scanning / CodeQL
Cyclic import Note
modin.core.dataframe.pandas.dataframe.utils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this import not at the top of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer is right above in the codeQL warning :)
it triggers a circular import otherwise
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
df, columns_info, ascending, **kwargs | ||
) | ||
for i, pivot in enumerate(columns_info[0].pivots): | ||
add_attr(result[i], pivot - pandas.Timedelta(1, unit="ns")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an example of why it's requires
Imagine we have a time series with an Hour resolution:
>>> sh
a
2018-01-01 00:00:00 0.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 2.0
2018-01-01 03:00:00 3.0
Resampling this into 30-min intervals gives this:
>>> expected_res = sh.resample("30min").sum()
>>> expected_res
a
2018-01-01 00:00:00 0.0
2018-01-01 00:30:00 0.0 <---- interpolated value
2018-01-01 01:00:00 1.0
2018-01-01 01:30:00 0.0 <---- interpolated value
2018-01-01 02:00:00 2.0
2018-01-01 02:30:00 0.0 <---- interpolated value
2018-01-01 03:00:00 3.0
Let's now emulate parallel execution of resample
and split sh
into two partitions:
>>> pd.concat([sh.iloc[:2].resample("30min").sum(), sh.iloc[2:].resample("30min").sum()])
a
2018-01-01 00:00:00 0.0
2018-01-01 00:30:00 0.0 <---- interpolated value
2018-01-01 01:00:00 1.0
*should be an interpolated value here, but it's missing*
2018-01-01 02:00:00 2.0
2018-01-01 02:30:00 0.0 <---- interpolated value
2018-01-01 03:00:00 3.0
The reason for the missing value is that, the first partition only sees an interval from [00:00:00, 01:00:00] and the second sees [02:00:00, 03:00:00], so it's unclear that we should fill the gap between 1h and 2h with 1:30.
One of the solutions is to insert a timestamp dummy value with a slight offset in each partition, so it would know the real bounds of the partition.
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
"data": {"A": range(12), "B": range(12)}, | ||
"index": pandas.date_range("31/12/2000", periods=12, freq="h"), | ||
"data": { | ||
f"col{i}": random_state.randint(RAND_LOW, RAND_HIGH, size=NROWS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more data to actually use partitioning
@@ -2438,6 +2438,7 @@ def combine_and_apply( | |||
dtypes=new_dtypes, | |||
) | |||
|
|||
@lazy_metadata_decorator(apply_axis="both") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was missing before, however is needed to function properly
resample_kwargs, | ||
"transform", | ||
arg=arg, | ||
allow_range_impl=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this approach doesn't work well with transform operations, so all of them are disabled
@@ -122,6 +124,10 @@ class ShuffleSortFunctions(ShuffleFunctions): | |||
The ideal number of new partitions. | |||
level : list of strings or ints, or None | |||
Index level(s) to use as a key. Can't be specified along with `columns`. | |||
closed_on_right : bool, default: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
closed_on_right : bool, default: False | |
close_to_right : bool, default: False |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here I refer to the term "closed interval", closed_on_right
means, that we have to include the right bound in it
@@ -1039,6 +1046,8 @@ def _resample_func( | |||
Modin frame. If not specified will be computed automaticly. | |||
df_op : callable(pandas.DataFrame) -> [pandas.DataFrame, pandas.Series], optional | |||
Preprocessor function to apply to the passed frame before resampling. | |||
allow_range_impl : bool, default: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_range_impl : bool, default: True | |
range_impl : bool, default: True |
or use_range_impl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow_range_impl=True
doesn't necessarily mean, that the range-partition will be used, it also depends on cfg.RangePartitioning
value and axis
argument. So this parameter indeed only 'allows' for the range-partitioning to be used, not dictates that.
What do these changes do?
Adds range-partitioning impl for
df.resample()
. The new implementation doesn't always work better, so enabling it only when the flag is specified.script to measure
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
df.resample()
#7118docs/development/architecture.rst
is up-to-date