Skip to content

Conversation

mattdangerw
Copy link
Member

This drafts the dtensor API we want to add for the OPT backbone and task models.

Alternately, we could start adding these to the base Backbone and Task methods, just keeping it simple with just the OPT classes for now.

@mattdangerw
Copy link
Member Author

/gcbrun

@mattdangerw
Copy link
Member Author

/gcbrun

@mattdangerw
Copy link
Member Author

/gcbrun

@mattdangerw mattdangerw requested review from qlzh727 and jbischof April 24, 2023 16:37
@mattdangerw
Copy link
Member Author

/gcbrun

@mattdangerw
Copy link
Member Author

These failures look like some upstream issues with dtensor (repro). Checking it out with dtensor folks now!

@mattdangerw
Copy link
Member Author

/gcbrun

@mattdangerw
Copy link
Member Author

/gcbrun

Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!


@classmethod
def create_layout_map(cls, mesh):
"""Create a DTensor layout map for an OPTCasualLM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider parameterizing the docstrings if there's a lot of copypasta going forward

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't taking on the "base version" of this quite yet, but it would be very valid to. Probably there is very little changes that need to made here per backbone/task.

@mattdangerw
Copy link
Member Author

/gcbrun

@mattdangerw
Copy link
Member Author

/gcbrun

This drafts the dtensor API we want to add for the OPT backbone
and task models.

Alternately, we could start adding these to the base Backbone and
Task methods, just keeping it simple with just the OPT classes for now.
@mattdangerw
Copy link
Member Author

/gcbrun

@mattdangerw mattdangerw marked this pull request as ready for review May 2, 2023 19:15
@mattdangerw
Copy link
Member Author

/gcbrun

Copy link
Contributor

@jbischof jbischof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Do we have any test runs on GCP using this code?

@mattdangerw
Copy link
Member Author

Thank you! Do we have any test runs on GCP using this code?

No great automated test coverage yet. The best way to test this would be an actual multi-worker setup, which would take some playing around with resources. We could also probably fake a multi device setup on a single CPU, but even then we would probably need to run this in a separate process from our main test.

Let's do that as a follow up, as it is fairly involved.

@mattdangerw mattdangerw merged commit 78ef245 into keras-team:master May 3, 2023
@jbischof
Copy link
Contributor

jbischof commented May 3, 2023

Sorry I was actually just asking if we've ever tried a manual run with this code as a sanity check.

chenmoneygithub pushed a commit that referenced this pull request May 4, 2023
* Add DTensor layout map class method for OPT

This drafts the dtensor API we want to add for the OPT backbone
and task models.

Alternately, we could start adding these to the base Backbone and
Task methods, just keeping it simple with just the OPT classes for now.

* Workaround for DTensor failures

* Copy edits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants