Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Huggingface Integration #916

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

pranayasinghcsmpl
Copy link
Contributor

Fixes #727

Proposed Changes

  • Added Huggingface Upload & Download functionality in a subcommand.
  • Added library name, version & git hash in huggingface tags for huggingface uploads.
  • Added functionality to save a copy of config.yaml during training.

Checklist

  • CONTRIBUTING guide has been followed.
  • PR is based on the current GaNDLF master .
  • Non-breaking change (does not break existing functionality): provide as many details as possible for any breaking change.
  • Function/class source code documentation added/updated (ensure typing is used to provide type hints, including and not limited to using Optional if a variable has a pre-defined value).
  • Code has been blacked for style consistency and linting.
  • If applicable, version information has been updated in GANDLF/version.py.
  • If adding a git submodule, add to list of exceptions for black styling in pyproject.toml file.
  • Usage documentation has been updated, if appropriate.
  • Tests added or modified to cover the changes; if coverage is reduced, please give explanation.
  • If customized dependency installation is required (i.e., a separate pip install step is needed for PR to be functional), please ensure it is reflected in all the files that control the CI, namely: python-test.yml, and all docker files [1,2,3].

@pranayasinghcsmpl pranayasinghcsmpl requested a review from a team as a code owner August 14, 2024 12:03
Copy link
Contributor

github-actions bot commented Aug 14, 2024

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Collaborator

@sarthakpati sarthakpati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before I start reviewing this in earnest, I would need at least the following 2 pieces of information to be added to the PR:

  1. Documentation: (it is absolutely fine to have a bullet point list of items that link to the main HF docs)
  2. Tests

I believe both of these were present in the previous PR.

setup.py Outdated Show resolved Hide resolved
@sarthakpati
Copy link
Collaborator

Hi @Wauplin and @NielsRogge - this PR looks good from my end. Do you have any feedback?

Copy link

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there 👋 Thanks for the ping!

I left a few comments from an outsider point of view. I do think that the CLI should be more opinionated (understand "have less options and decide things for the user") otherwise we pretty much end up with a CLI close to what huggingface-cli upload and huggingface-cli download do.

GANDLF/cli/huggingface_hub_handler.py Outdated Show resolved Hide resolved
from pathlib import Path
from GANDLF.utils import get_git_hash

readme_template = """
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a simple copy of the model card template found here? If yes, I can suggest to either:

  • directly reuse the template from huggingface_hub (i.e. ModelCard.from_template(card_data) without the template_str).
  • or define your own template but in this case you should only put the relevant fields and descriptions for your library (instead of having all fields as empty)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Wauplin for making me aware of this ,I will definitely go through it and make required changes as you mentioned

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Wauplin,

We had an internal discussion on what would be the best way for us to showcase potential model uploaders with a specific set of required options for the model card. Thus far, we have landed on using a custom model card. The reason to have all the fields present is provide the ability for a user to put in more information than what we require.

Here, we have put the string "REQUIRED_FOR_GANDLF" for the fields that are explicitly needed for the user to populate, and the rest have been left as present in the template.

In the code, we plan to add 2 checks:

  1. If "REQUIRED_FOR_GANDLF" is found, we present an error to the user saying that this field needs to be populated with appropriate information.
  2. The Repository key should always be https://github.com/mlcommons/GaNDLF.

Thoughts?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a sensible idea to me yes!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant, thanks for the confirmation! We'll get on it right away. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarthakpati @Wauplin so how can we test this file if we propose the upload functionality as we only have entry points tests, do we have to mention a specific directory there

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can leverage one of the existing training tests to test the upload. I would recommend this one, since this would only upload a single model.

Ensure you put an appropriate description for it (such as Unit testing model or something) to make it clear for anyone viewing it. Is there a way to update an existing model, @Wauplin?

GANDLF/cli/huggingface_hub_handler.py Outdated Show resolved Hide resolved
tags += [git_hash]

card_data = ModelCardData(library_name="GaNDLF", tags=tags)
card = ModelCard.from_template(card_data, template_str=readme_template)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above about template_str

Comment on lines +244 to +259
def download_from_hub(
repo_id: str,
revision: Union[str, None] = None,
cache_dir: Union[str, None] = None,
local_dir: Union[str, None] = None,
force_download: bool = False,
token: Union[str, None] = None,
):
snapshot_download(
repo_id=repo_id,
revision=revision,
cache_dir=cache_dir,
local_dir=local_dir,
force_download=force_download,
token=token,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this alias is really needed. I would simply call snapshot_download in other places in the code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the alias is not needed and that snapshot_download could be used by default

GANDLF/cli/huggingface_hub_handler.py Outdated Show resolved Hide resolved
GANDLF/cli/huggingface_hub_handler.py Show resolved Hide resolved
GANDLF/cli/huggingface_hub_handler.py Show resolved Hide resolved
Co-authored-by: Lucain <lucainp@gmail.com>
@sarthakpati
Copy link
Collaborator

@pranayasinghcsmpl some lint fixes (unused variables and whatnot) will be needed for this PR. Thanks for taking care of it!

Thanks for your comments and suggestions, @Wauplin!

Copy link

codecov bot commented Sep 13, 2024

Codecov Report

Attention: Patch coverage is 97.05882% with 4 lines in your changes missing coverage. Please review.

Project coverage is 94.61%. Comparing base (e066e88) to head (5e8a97b).
Report is 15 commits behind head on master.

Files with missing lines Patch % Lines
testing/test_full.py 94.23% 3 Missing ⚠️
GANDLF/entrypoints/hf_hub_integration.py 96.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #916      +/-   ##
==========================================
+ Coverage   94.58%   94.61%   +0.03%     
==========================================
  Files         161      164       +3     
  Lines        9567     9701     +134     
==========================================
+ Hits         9049     9179     +130     
- Misses        518      522       +4     
Flag Coverage Δ
unittests 94.61% <97.05%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

testing/test_full.py Outdated Show resolved Hide resolved
@sarthakpati
Copy link
Collaborator

Support ticket generated with Codacy to explore the coverage issue.

@sarthakpati
Copy link
Collaborator

Codacy folks suggested not to use coverage reporter for anything coming in from other forks 🙄

Anyway, we should be good to go from my end. @Wauplin is this PR good to merge for you?

Copy link

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration looks good yes :) I left a few comments but nothing blocking on my side.
Thanks for the iterations!

Comment on lines +93 to +102
api = HfApi(token=token)

try:
api.create_repo(repo_id)
except Exception as e:
print(f"Error: {e}")

api = HfApi(token=token)

repo_id = api.create_repo(repo_id, exist_ok=True).repo_id
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
api = HfApi(token=token)
try:
api.create_repo(repo_id)
except Exception as e:
print(f"Error: {e}")
api = HfApi(token=token)
repo_id = api.create_repo(repo_id, exist_ok=True).repo_id
api = HfApi(token=token)
try:
repo_id = api.create_repo(repo_id).repo_id
except Exception as e:
print(f"Error: {e}")

no need to create the repo twice

GANDLF/cli/huggingface_hub_handler.py Show resolved Hide resolved

api.upload_folder(
repo_id=repo_id,
token=token,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
token=token,

no need for this since already provided in HfApi

Comment on lines +244 to +259
def download_from_hub(
repo_id: str,
revision: Union[str, None] = None,
cache_dir: Union[str, None] = None,
local_dir: Union[str, None] = None,
force_download: bool = False,
token: Union[str, None] = None,
):
snapshot_download(
repo_id=repo_id,
revision=revision,
cache_dir=cache_dir,
local_dir=local_dir,
force_download=force_download,
token=token,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the alias is not needed and that snapshot_download could be used by default

Comment on lines +73 to +87
@click.option(
"--allow-patterns",
"-ap",
help="Uploading: If provided, only files matching at least one pattern are uploaded.",
)
@click.option(
"--ignore-patterns",
"-ip",
help="Uploading: If provided, files matching any of the patterns are not uploaded.",
)
@click.option(
"--delete-patterns",
"-dp",
help="Uploading: If provided, remote files matching any of the patterns will be deleted from the repo while committing new files. This is useful if you don't know which files have already been uploaded.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@click.option(
"--allow-patterns",
"-ap",
help="Uploading: If provided, only files matching at least one pattern are uploaded.",
)
@click.option(
"--ignore-patterns",
"-ip",
help="Uploading: If provided, files matching any of the patterns are not uploaded.",
)
@click.option(
"--delete-patterns",
"-dp",
help="Uploading: If provided, remote files matching any of the patterns will be deleted from the repo while committing new files. This is useful if you don't know which files have already been uploaded.",
)

I don't think this is needed since you are in control of what needs to be uploaded, right?

Comment on lines +88 to +93
@click.option(
"--hf-template",
"-hft",
help="Adding the template path for the model card it is Required during Uploaing a model",
type=click.Path(exists=True, file_okay=True, dir_okay=False),
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe default it to hugging_face.md to reduce friction? Users are free to provide another template if they want but having one by default should reduce friction and help grow usage.

@@ -82,6 +82,7 @@
"typer==0.9.0",
"colorlog",
"opacus==1.5.2",
"huggingface-hub==0.23.4",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) latest is 0.25.1

Suggested change
"huggingface-hub==0.23.4",
"huggingface-hub==0.25.1",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Hugging Face Hub integration
3 participants