Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add model checkpointing to push_to_hub and PushToHubCallback #14492

Merged
merged 19 commits into from
Nov 29, 2021

Conversation

Rocketknight1
Copy link
Member

No description provided.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this to the callback!

src/transformers/file_utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few last comments!

"Checkpoint loading failed as no optimizer is attached to the model. "
"This is most likely caused by the model not being compiled."
)
repo = self._create_or_get_repo(repo_path_or_name, organization)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we said to use Repository here to avoid creating a new repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this, but it resulted in some code duplication in order to make everything work - I kept it as-is but added a comment to explain instead. If you'd still prefer to avoid the call, let me know!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what @LysandreJik thinks then.

Copy link
Member

@LysandreJik LysandreJik Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure it won't actually create a remote repository if it doesn't exist? If you're certain of it it's fine for me to go with self._create_or_get_repo but I'm quite unsure that it won't create a remote repo.

Also which method are you using, is it this method here?

def _create_or_get_repo(
cls,
repo_path_or_name: Optional[str] = None,
repo_url: Optional[str] = None,
organization: Optional[str] = None,
private: bool = None,
use_auth_token: Optional[Union[bool, str]] = None,
) -> Repository:
if repo_path_or_name is None and repo_url is None:
raise ValueError("You need to specify a `repo_path_or_name` or a `repo_url`.")
if use_auth_token is None and repo_url is None:
use_auth_token = True
if repo_path_or_name is None:
repo_path_or_name = repo_url.split("/")[-1]
if repo_url is None and not os.path.exists(repo_path_or_name):
repo_name = Path(repo_path_or_name).name
repo_url = cls._get_repo_url_from_name(
repo_name, organization=organization, private=private, use_auth_token=use_auth_token
)
# Create a working directory if it does not exist.
if not os.path.exists(repo_path_or_name):
os.makedirs(repo_path_or_name)
repo = Repository(repo_path_or_name, clone_from=repo_url, use_auth_token=use_auth_token)
repo.git_pull()
return repo

Because if so, don't call it with positional arguments like you do here as you're passing the organization parameter to repo_url.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot on the organization positional arg, fixed!

Also, the _create_or_get_repo method just does this, after sanity-checking the arguments:

repo = Repository(repo_path_or_name, clone_from=repo_url, use_auth_token=use_auth_token)
repo.git_pull()
return repo

So I think as long as we don't push it afterwards, we won't create a new remote repo if one doesn't exist already.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this discussion is real proof we need to just put the code used here and not try to use some magic of the Mixin. The library always prefers an explicit approach at the cost of duplicate code, and I think this is an instance where we should just apply the code needed.

The Mixin might also change in the feature, following the RFC on huggingface_hub, so those private methods might actually introduce some behavior we don't want in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the Repository(repo_path_or_name, clone_from=repo_url, use_auth_token=use_auth_token) line actually creates the repo if you try to clone it but it doesn't exist:

https://github.com/huggingface/huggingface_hub/blob/10c69146969ad7f9e1add075c1ef4ec15e42e85f/src/huggingface_hub/repository.py#L542-L549

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The push event is unrelated to the creation of the repository. No push to a remote repository can be done if the repository does not exist. The repository creation happens at the repository creation, not at the repository update (push).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm sorry, I didn't realize that's how it worked! And yes, you're right, this is probably a sign that we need to be explicit. I'll try to figure out some code that just checks without accidentally creating a new repo.

src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
src/transformers/keras_callbacks.py Outdated Show resolved Hide resolved
src/transformers/keras_callbacks.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting!

src/transformers/modeling_tf_utils.py Outdated Show resolved Hide resolved
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@Rocketknight1 Rocketknight1 merged commit 6fc38ad into master Nov 29, 2021
@Rocketknight1 Rocketknight1 deleted the resume_training branch November 29, 2021 17:36
Albertobegue pushed a commit to Albertobegue/transformers that referenced this pull request Jan 27, 2022
…face#14492)

* Add checkpointing to push_to_hub and PushToHubCallback

* Add checkpoint loading

* Add missing default value

* Correct method name

* make style

* Moving everything to the right location

* make style

* Revert changes to file_utils.py

* Update src/transformers/keras_callbacks.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/keras_callbacks.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Adding docstrings and comments to clarify code

* make style

* Fix organization positional arg

* Fix load_repo_checkpoint to no longer accidentally create empty repos

* make style

* Remove unnecessary 'organization' argument in load_repo_checkpoint

* Avoid private `_create_or_get_repo` method

* make style

* Update src/transformers/modeling_tf_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants