-
Notifications
You must be signed in to change notification settings - Fork 655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce amount of remote calls for square-like dataframes #5394
Comments
I think the first approach ("complex") is clear and reasonable. I wonder why "easy" approach is faster than "complex` in your measurements? |
@dchigarev could you please elaborate what exactly is "rebalance" in |
I guess it's because of the "rebalancing" implementation. The whole point of both approaches is to reduce the number of partitions to The "complex" approach does so by making each axis have On the other hand, the "easy" approach doesn't even require repartitioning, it just builds It appears that in a competition of "repartition along rows + repartition along cols + apply a function to 'NCORES' amount of partitions" vs "build column partitions + apply a function to 'NCORES' amount of partitions" the second approach wins. Again, probably we could advance the repartitioning logic somehow for the first approach but I really doubt that we'll make it faster than the second one just because the first one requires manipulations with both of the axes and the second only with a single one. |
"Rebalance" means repartitioning so the frame would have fewer total amount of partitions. An exact way of doing so is implementation details I would say. In the measurements above I used the |
I'm guessing "easy" is faster because it runs "Complex" solution could be running in |
That said, I propose to measure distributed workloads (on a cluster) as well, as they may behave very differently due to worker-driver exchange being much more expensieve. |
Ah, I see why the "easy" approach is faster. I originally thought that you proposed to first make a repartitioning in one remote call and then perform "map" in other remote call. The "complex" approach would complicate a virtual partition because it would consist of both a row part and a col part.
Yes, I believe "Complex" solution could be running in sqrt(NCORES) jobs, and we could avoid the need to put intermediate partitions in the object store. However, this would complicate the virtual partition implementation. Also, I am not sure what is better in terms of performance. I think we could move forward with "easy" approach and run "map" either row-wise or col-wise in depend on the number of partitions along the axes. |
From my point of view, if "easy" approach does not exhibit a slowdown on a real cluster (i.e. in a a non-single-node scenario), then let's implement it as it should be only benefits. We can always build a more advanced solution later on. |
Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
Reopening since #7136 covered Map operator only. |
… and GroupByReduce operators Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
…educe operators (#7245) Signed-off-by: Kirill Suvorov <kirill.suvorov@intel.com>
What's the problem?
There are certain scenarios where some patterns of distributed executions work better than others for different types of datasets even if they're implementing the same function. Consider this example from Modin's optimization notes:
When we're applying a Map/MapReduce operator to a frame having a square (MxM) or close to square shape we would have a NUM_CPUS^2 partitions for this frame and so NUM_CPUS^2 remote calls for this Map function. It's known that this amount of remote calls per core results in a really poor performance with Ray (and probably Dask):
.abs()
).sum()
)p.s. all of the measurements were conducted on 2x 28 Cores (56 threads) Xeon Platinum-8276L @ 2.20 with the
MODIN_NPARTITIONS=MODIN_NCPUS=112
being set.How to run this
What to do?
There are several options of how we can resolve this:
I've made a draft implementation for each of these approaches and made measurements with the same dataset of
5000x5000
('Rebalanced partitioning' also includes time for actual rebalancing):How to run this
Why can't we just change the default partitioning scheme so square-like frames would not have NCORES^2 partitions by default?
Well, there are not only Map and MapReduce operators existing in Modin's DataFrame algebra. There's still a mass usage of full-axis functions that would be hurt by decreasing the total amount of partitions.
Would like to hear more opinions/suggestions from @modin-project/modin-core of what we should do with it.
The text was updated successfully, but these errors were encountered: