Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX-#7072: Replace MaterializationHook with the materialized object on serialization. #7075

Merged
merged 5 commits into from
Mar 14, 2024

Conversation

AndreyPavlenko
Copy link
Collaborator

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves BUG: The H&M benchmark fails on Ray #7072
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

-------
tuple
"""
return _hook_deserializer, (RayWrapper.materialize(self),)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

return _hook_deserializer, (RayWrapper.materialize(self),)


def _hook_deserializer(obj):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this function a staticmethod or a classmethod. This way we can move it closer to the place of use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When moved inside the class, the serialized data requires 20 bytes more, because the class name is also serialized.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, renamed to _get() and added check - if the object is int, this type is used instead because it has a 3x smaller serialized form.

@@ -307,6 +307,31 @@ def post_materialize(self, materialized):
"""
raise NotImplementedError()

def __reduce__(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class intended to work in remote kernels?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's used for lazy materialization on the host process only.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this class is used in remote functions because we ues it in MetaList, which we use in remote functions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which context do we use MetaList in a remote function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems some functions pass the non-materialized partition lengths to remote functions.

YarShev
YarShev previously approved these changes Mar 13, 2024
@YarShev
Copy link
Collaborator

YarShev commented Mar 13, 2024

@AndreyPavlenko, did you verify that hm and plasticc passed?

@AndreyPavlenko
Copy link
Collaborator Author

@AndreyPavlenko, did you verify that hm and plasticc passed?

I've verified hm.

@YarShev
Copy link
Collaborator

YarShev commented Mar 13, 2024

I checked plasticc, it also works.

@anmyachev , @dchigarev, any comments?

tuple
"""
data = RayWrapper.materialize(self)
return int if isinstance(data, int) else self._get, (data,)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of int, we are losing data. A bit strange.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? In case of int, tuple(int, tuple(data)) is returned.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood the comma

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment why you made a separate branch for int.

-------
object
"""
return obj
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, all the implementations (MetaListHook and SlicerHook) return int, thus, this method is never used.
Probably, it makes sense to remove this method at all and add assert isinstance(data, int) or raise NotImplementedError() to __reduce__(). If a new implementation, that returns non-int, appears, the assertion will fail. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for me

tuple
"""
data = RayWrapper.materialize(self)
return int if isinstance(data, int) else self._get, (data,)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused by this line, let's make it more obvious

Suggested change
return int if isinstance(data, int) else self._get, (data,)
reconstructor = int if isinstance(data, int) else self._get
return reconstructor, (data,)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, can it simply be?

def __reduce__(self):
    return (lambda x: x), (data,)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use len(pickle.dumps(hook)) to calculate the length of the serialized data. In case of int this is 37, self._get - 96, lambda x: x - Can't pickle local object. Anyway, creating and serializing a new object each time when we need to serialize a single int - this is a redundant overhead. I'm not familiar with the ray serializer implementation, but, if it's smart enough, it should not serialize the same object multiple times. For example, when serializing 100 hook objects, in case of a static function, this function + 100 tuples will be serialized, but not 100 lambdas + 100 tuples.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this method. Only ints are currently supported.

@dchigarev dchigarev merged commit 528fd7b into modin-project:master Mar 14, 2024
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: The H&M benchmark fails on Ray
4 participants