-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#7283: Introduce MinRowPartitionSize and MinColumnPartitionSize #7284
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @YarShev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good, only one design question and a couple of small ones left.
@@ -654,6 +654,14 @@ def get(cls) -> int: | |||
------- | |||
int | |||
""" | |||
from modin.error_message import ErrorMessage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential solution could be to change the class type to tuple, this would localize the partition data in one place and save us from an extra class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing this we force the user to specify both sizes at a time - per row and col, while the user may want to specify/adjust a single axis only. I would leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing this value, the user probably already knows the default values. How else will he start from something to select new values? There should be no problem specifying two values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user is not forced to know the default values - this is mainly for us to get ultimate performance in general. The user may know the number of rows/cols in advance and may specify fine-grained or coarse-grained partitioning. To me, specifying an additional number looks like extra burden if I don't care about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, specifying an additional number looks like extra burden if I don't care about it.
In each particular place, depending on the context, you thought about which of the two classes to use (this is the burden of choice). Although you can transfer the selection of the desired dimension to a lower-level function, which will make the selection automatically based on the axis parameter.
Changing one of the dimensions might look like this (a bit longer, but still simple):
sizes = MinPartitionSizes.get()
MinPartitionSizes.put((new_value, sizes[1]))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep the things exposed to users as much as simple. I think very few people are going to dig into the internals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without knowing this information, it is not clear how to use these configurations, so I still believe that this is no more complicated than the current approach.
Even with a single environment variable, it's hard to imagine the user specifying a value without looking at the one we use by default. It turns out that user need to look in both situations and in this sense they are equivalent in complexity.
Also default values can be found in https://modin.readthedocs.io/en/latest/flow/modin/config.html#modin-configs-list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason by which I think we should keep both MinRow/ColumnPartitionSize is that the user doesn't have to know how to distribute data along a single axis and leave this to Modin itself. Consider a case with a narrow or wide dataframe. If the user has a narrow df, say 10 columns, he/she can set MinColumnPartitionSize to 1 and leave distribution along rows to Modin, which automatically finds a better MinRowPartitionSize if the default one doesn't suit. I stick to keep the current approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I stick to keep the current approach.
@YarShev I still think another solution would be better, but since the majority here is for your solution, please update the PR and we'll merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
min_block_size=( | ||
MinRowPartitionSize.get() if axis == 0 else MinColumnPartitionSize.get() | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks more natural, doesn't it?
min_block_size=( | |
MinRowPartitionSize.get() if axis == 0 else MinColumnPartitionSize.get() | |
), | |
min_block_sizes=MinPartitionSizes.get(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We split a df along a single axis. I don't understand why we have to pass a tuple of row and col axis sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function has the value of the axis parameter; the function itself will select the desired value from the tuple using it.
…artitionSize Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
Signed-off-by: Igoshev, Iaroslav <iaroslav.igoshev@intel.com>
Co-authored-by: Anatoly Myachev <anatoliimyachev@mail.com>
30f2b62
to
6767481
Compare
What do these changes do?
flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
docs/development/architecture.rst
is up-to-date