Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32

Merged

Conversation

patrickvonplaten
Copy link
Contributor

@patrickvonplaten patrickvonplaten commented Nov 12, 2021

Hey PyCTCDecode team,

Update: Add load_from_hf_hub to BeamSearchDecoder instead

This PR is a proposition to add the possibility to load KenLM models directly from the Hugging Face hub.
I've uploaded an example kenLM model under: https://huggingface.co/kensho/beamsearch_decoder_dummy so that you can try out the loading a beam search decoder from the hub as follows:

from pyctcdecode import BeamSearchDecoderCTC

decoder = BeamSearchDecoderCTC.load_from_hf_hub("kensho/beamsearch_decoder_dummy")

Models are hosted for free on the Hugging Face hub with the goal of facilitating the user experience sharing and versioning models. In this case, the user is not required to download the raw model manually (via wget), but instead can integrate the model loading with a single line of code: decoder = BeamSearchDecoderCTC.load_from_hf_hub("kensho/beamsearch_decoder_dummy") into a python script. The loading method automatically caches the downloaded file so that the user will only have to download the model once.

@mikeyshulman @gkucsko @poneill - please let me know what you think about the integration and whether anything can be improved :-)

@patrickvonplaten patrickvonplaten changed the title [Integration with Hugging Face] Aad load_from_hub to LanguageModel [Integration with 🤗 Hugging Face] Add load_from_hub to LanguageModel Nov 12, 2021
Copy link
Contributor

@mikeyshulman mikeyshulman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

My only question is about testing. While we don't want our tests to actually call out to hf hub, maybe we can still add a test. Even if it's thin and mocks out the load_from_hf_hub to just return the binary contents of the little arpa file checked into the tests directory, it will make sure the code actually runs and returns a kosher LanguageModel. I'd also be in favor of putting huggingface_hub in the dev requirements in setup.py.

As an aside, are there any other guidelines/best practices HF recommends to package developers to make sure hub integration works?

pyctcdecode/language_model.py Outdated Show resolved Hide resolved
pyctcdecode/language_model.py Outdated Show resolved Hide resolved
@patrickvonplaten
Copy link
Contributor Author

This looks great!

My only question is about testing. While we don't want our tests to actually call out to hf hub, maybe we can still add a test. Even if it's thin and mocks out the load_from_hf_hub to just return the binary contents of the little arpa file checked into the tests directory, it will make sure the code actually runs and returns a kosher LanguageModel. I'd also be in favor of putting huggingface_hub in the dev requirements in setup.py.

As an aside, are there any other guidelines/best practices HF recommends to package developers to make sure hub integration works?

Awesome!

Yeah that's a good question about testing! Actually what would be nice is to add some functionality that if pretrained_path is a local path -> then it should load the file simply pass the file name to init(). This could also be very easily tested - I'll update the PR :-)

setup.py Outdated Show resolved Hide resolved
Copy link
Contributor

@gkucsko gkucsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, very exciting!

pyctcdecode/language_model.py Outdated Show resolved Hide resolved
pyctcdecode/language_model.py Outdated Show resolved Hide resolved
"See https://pypi.org/project/huggingface-hub/ for installation."
)

# download and cache kenLM model
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

download each file seperatly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no way to download the whole thing as a tarball?
if not, there's the potential to download a kenlm file and a unigrams file that is incompatible with it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could find a solution to download a tar ball instead if you prefer!

IMO there would be the following downsides though:

  1. Users cannot verify the files online on the hub. E.g. right now one can easily check the saved configs of the Language Model here:
    https://huggingface.co/kensho/dummy_full_language_model/blob/main/attrs.json
    => This would not be possible if a .tar file is saved instead
  2. the downloaded tar file would be cached instead of each of the three files. This means that after caching, the tar file would be untared every time load_from_hf_hub is called. I don't think it's possible to download a tar ball and untar it and then cache the whole directory.

Especially 2.) does not give a great user experience IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also cc'ing @osanseviero here - do we have examples where whole folders are saved as a .tar ball on the hub?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patrickvonplaten there might be a couple of repos that have compressed formats, but in general that's discouraged for the reasons you give.

Qq: instead of downloading each file separately, why not use snapshot_download and get all the repo files in a single go? This downloads multiple files + has caching

Copy link
Contributor

@mikeyshulman mikeyshulman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking for others' thoughts here ideally

pyctcdecode/language_model.py Outdated Show resolved Hide resolved
pyctcdecode/language_model.py Outdated Show resolved Hide resolved
"See https://pypi.org/project/huggingface-hub/ for installation."
)

# download and cache kenLM model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no way to download the whole thing as a tarball?
if not, there's the potential to download a kenlm file and a unigrams file that is incompatible with it

@patrickvonplaten patrickvonplaten changed the title [Integration with 🤗 Hugging Face] Add load_from_hub to LanguageModel [Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder Nov 29, 2021
@@ -688,8 +689,6 @@ def parse_directory_contents(filepath: str) -> Dict[str, Union[str, None]]:
contents = os.listdir(filepath)
# filter out hidden files
contents = [c for c in contents if not c.startswith(".") and not c.startswith("__")]
if len(contents) not in {1, 2}: # always alphabet, sometimes language model
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we relax this error here?

It is already checked that:

  • the "alphabet.json" file is present
  • "language_model" directory is present which is then further down the road checked aggressively

IMO there is no "danger" in having more than those files in the directory that is to be loaded. E.g. on huggingface.co we always have a README.md in the folder as well - see: https://huggingface.co/kensho/beamsearch_decoder_dummy/tree/main

Would it be ok for you to remove those lines here? @mikeyshulman

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im ok with it. Do the remaining tests pass? I think we might need to change e.g.
https://github.com/kensho-technologies/pyctcdecode/blob/main/pyctcdecode/tests/test_decoder.py#L546

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the tests

Copy link
Contributor

@mikeyshulman mikeyshulman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks excellent to me. Very clean!
@gkucsko do you want to have one final look?

setup.py Outdated Show resolved Hide resolved
pyctcdecode/tests/test_decoder.py Outdated Show resolved Hide resolved
pyctcdecode/tests/test_decoder.py Outdated Show resolved Hide resolved
@@ -688,8 +689,6 @@ def parse_directory_contents(filepath: str) -> Dict[str, Union[str, None]]:
contents = os.listdir(filepath)
# filter out hidden files
contents = [c for c in contents if not c.startswith(".") and not c.startswith("__")]
if len(contents) not in {1, 2}: # always alphabet, sometimes language model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im ok with it. Do the remaining tests pass? I think we might need to change e.g.
https://github.com/kensho-technologies/pyctcdecode/blob/main/pyctcdecode/tests/test_decoder.py#L546

Copy link
Contributor

@gkucsko gkucsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, lgtm. very clean now

@patrickvonplaten
Copy link
Contributor Author

@mikeyshulman @gkucsko - thanks a lot for the review! I applied the proposed changes.
All tests except pyctcdecode/tests/test_decoder.py::TestSerialization::test_load_from_hub_offline are now passing.
pyctcdecode/tests/test_decoder.py::TestSerialization::test_load_from_hub_offline does pass locally for me, but we'll need to wait until huggingface/huggingface_hub#505 is merged and a patch is released.

So it's on us now to finish this ;-) I'll ping you here again once the PR is merged!

@patrickvonplaten
Copy link
Contributor Author

huggingface/huggingface_hub#505 is merged and released on pip. All tests are now passing locally. If this PR is ok for you - I think it's good to go 🚀 @mikeyshulman @gkucsko

Copy link
Contributor

@gkucsko gkucsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants