New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REFACTOR-#2642: Refactor partition API #2643
REFACTOR-#2642: Refactor partition API #2643
Conversation
5616cdf
to
af9af8a
Compare
Codecov Report
@@ Coverage Diff @@
## master #2643 +/- ##
==========================================
- Coverage 82.85% 80.91% -1.95%
==========================================
Files 127 127
Lines 14212 14275 +63
==========================================
- Hits 11776 11550 -226
- Misses 2436 2725 +289
Continue to review full report at Codecov.
|
af9af8a
to
7f29166
Compare
730ea02
to
8282dd1
Compare
I would merge #2606 firstly, and then this PR. @devin-petersohn , @vnlitvinov , what do you think? |
They seem to overlap, so we should merge in most logical order. I personally am fine with any one you as author thinks reasonable :) |
So, let's consider #2606 to merge firstly. |
Are you planning to run CI tests in two modes now, one with disabled API and one with enabled? I see that currently blocks of code with API enabled are not tested at all. |
Yes, I am going to do it after we merge #2606 . |
@devin-petersohn I've merged the PR closing #2606 |
@devin-petersohn , @vnlitvinov , I will ping you after I make necessary changes for this PR. |
38fe1a5
to
9008ed3
Compare
I see, perhaps it would be better to provide an |
That doesn't make sense because several partitions can fit to one ip but have different places in a dataframe. |
Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>
9bcfe55
to
a2149be
Compare
…issue2642 Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>
d74ce89
to
6f7b216
Compare
@devin-petersohn , the changes were reverted. Please, look at them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add changes to Modin XGBoost related using unwrap_partiitons
function. Current implementation uses [(ip_ObjectRef, part_ObjectRef),..]
output style of this function. Also, could you please give performance numbers of getting IP this implementation VS previous implementation with materialization of IP's on user-side.
The requested changes are no longer valid.
This pr looks really good! It was both a good catch on finding that subtle ray issue that originally led to this pr and everyone has wonderful ideas on clear, user-friendly, performant API for partitions! However, I think this PR has slowly grown beyond the scope of its initial and intended purpose of fixing the bug described in #2642. For the sake of being able to make progress on the PR and make it easier to review, it would be helpful if we could split this into multiple PRs? It looks like that there is a bug fix for #2642, refactoring of the existing partitions API, and addition of a new partitions API. I'd be happy to clear out time in my schedule to help push the PR in, but we might be able to get through with these issues faster and move on to the bigger problems like the ones discussed here. |
@williamma12 , thanks for your attention to this PR! We would like to merge it as soon as possible too. I agree the PR has grown beyond the scope of its initial and intended purpose described in #2642. The goal was to add a doc page containing a note regarding ray in-process store issue (it is that what was added in #2721) and partition API declaring and also make some refactor. I suggest removing a new API (
from time import time
import ray
import modin.pandas as pd
from modin.distributed.dataframe.pandas import unwrap_partitions
def time_unwrap_partitions(df):
partitions = df._query_compiler._modin_frame._partitions
partitions = [part.oid for row in partitions for part in row]
while partitions:
_, partitions = ray.wait(partitions)
common_time = 0.0
n = 10
for i in range(n):
if i == 0:
partitions = unwrap_partitions(df, axis=0, get_ip=True)
else:
t0 = time()
partitions = unwrap_partitions(df, axis=0, get_ip=True)
common_time += (time() - t0)
common_time /= (n - 1)
print('time_unwrap_partitions =', common_time)
_partitions = [part[1] for part in partitions]
while _partitions:
_, _partitions = ray.wait(_partitions)
return partitions
def time_getting_ip_master(partitions):
common_time = 0.0
n = 10
for i in range(n):
if i == 0:
ips = ray.get([part[0] for part in partitions])
# ips = [ray.get(part[0]) for part in partitions]
else:
t0 = time()
ips = ray.get([part[0] for part in partitions])
# ips = [ray.get(part[0]) for part in partitions]
common_time += (time() - t0)
common_time /= (n - 1)
print('time_getting_ip_master =', common_time)
return ips
def time_getting_ip_pr2643(partitions):
common_time = 0.0
n = 10
for i in range(n):
if i == 0:
ips = [part[0] for part in partitions]
else:
t0 = time()
ips = [part[0] for part in partitions]
common_time += (time() - t0)
common_time /= (n - 1)
print('time_getting_ip_pr2643 =', common_time)
return ips
df = pd.read_csv("/my/file")
print('df.shape =', df.shape)
df.shape = (2260668, 145)
partitions = time_unwrap_partitions(df)
# ips = time_getting_ip_master(partitions)
ips = time_getting_ip_pr2643(partitions) As we can see, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YarShev That sounds like a great plan! I can see arguments for having us or the user to materialize since this is a developer API. If we decide on only having one, we can always add in the other later if people want it.
This is a great refactor! It is a lot more readable! I just have some questions about a few things (I think they are staying in this PR)
@williamma12 , I fully agree with you. So, we continue using a list of tuple of ObjRefs as a return value from |
Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>
5f362ea
to
a17dded
Compare
@williamma12 , just a friendly reminder to review. |
Signed-off-by: Igoshev, Yaroslav <yaroslav.igoshev@intel.com>
@devin-petersohn , @williamma12 , let's converge to merge the PR because it has been hanging for over a month. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good!
However, I was wrong and I think for future PRs it might have been better if we did the refactor and docs separately. That way, the PRs would be smaller and more closely follow the one PR, one change rule of thumb, which will make things easier to review (aka faster to get PRs in). But this is already basically done so I don’t think its worth the extra effort of separating it into PRs after merging them into one.
Regardless, this is a great pr that improves performance, actually document the partitions API, and reduces technical debt!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @YarShev, glad we can converge!
Signed-off-by: Igoshev, Yaroslav yaroslav.igoshev@intel.com
What do these changes do?
flake8 modin
black --check modin
git commit -s