-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32
[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
My only question is about testing. While we don't want our tests to actually call out to hf hub, maybe we can still add a test. Even if it's thin and mocks out the load_from_hf_hub
to just return the binary contents of the little arpa file checked into the tests
directory, it will make sure the code actually runs and returns a kosher LanguageModel
. I'd also be in favor of putting huggingface_hub
in the dev
requirements in setup.py
.
As an aside, are there any other guidelines/best practices HF recommends to package developers to make sure hub integration works?
Awesome! Yeah that's a good question about testing! Actually what would be nice is to add some functionality that if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome, very exciting!
pyctcdecode/language_model.py
Outdated
"See https://pypi.org/project/huggingface-hub/ for installation." | ||
) | ||
|
||
# download and cache kenLM model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
download each file seperatly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there no way to download the whole thing as a tarball?
if not, there's the potential to download a kenlm file and a unigrams file that is incompatible with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could find a solution to download a tar ball instead if you prefer!
IMO there would be the following downsides though:
- Users cannot verify the files online on the hub. E.g. right now one can easily check the saved configs of the Language Model here:
https://huggingface.co/kensho/dummy_full_language_model/blob/main/attrs.json
=> This would not be possible if a.tar
file is saved instead - the downloaded tar file would be cached instead of each of the three files. This means that after caching, the tar file would be untared every time
load_from_hf_hub
is called. I don't think it's possible to download a tar ball and untar it and then cache the whole directory.
Especially 2.) does not give a great user experience IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also cc'ing @osanseviero here - do we have examples where whole folders are saved as a .tar
ball on the hub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patrickvonplaten there might be a couple of repos that have compressed formats, but in general that's discouraged for the reasons you give.
Qq: instead of downloading each file separately, why not use snapshot_download
and get all the repo files in a single go? This downloads multiple files + has caching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking for others' thoughts here ideally
pyctcdecode/language_model.py
Outdated
"See https://pypi.org/project/huggingface-hub/ for installation." | ||
) | ||
|
||
# download and cache kenLM model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there no way to download the whole thing as a tarball?
if not, there's the potential to download a kenlm file and a unigrams file that is incompatible with it
@@ -688,8 +689,6 @@ def parse_directory_contents(filepath: str) -> Dict[str, Union[str, None]]: | |||
contents = os.listdir(filepath) | |||
# filter out hidden files | |||
contents = [c for c in contents if not c.startswith(".") and not c.startswith("__")] | |||
if len(contents) not in {1, 2}: # always alphabet, sometimes language model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we relax this error here?
It is already checked that:
- the
"alphabet.json"
file is present "language_model"
directory is present which is then further down the road checked aggressively
IMO there is no "danger" in having more than those files in the directory that is to be loaded. E.g. on huggingface.co
we always have a README.md
in the folder as well - see: https://huggingface.co/kensho/beamsearch_decoder_dummy/tree/main
Would it be ok for you to remove those lines here? @mikeyshulman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im ok with it. Do the remaining tests pass? I think we might need to change e.g.
https://github.com/kensho-technologies/pyctcdecode/blob/main/pyctcdecode/tests/test_decoder.py#L546
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks excellent to me. Very clean!
@gkucsko do you want to have one final look?
@@ -688,8 +689,6 @@ def parse_directory_contents(filepath: str) -> Dict[str, Union[str, None]]: | |||
contents = os.listdir(filepath) | |||
# filter out hidden files | |||
contents = [c for c in contents if not c.startswith(".") and not c.startswith("__")] | |||
if len(contents) not in {1, 2}: # always alphabet, sometimes language model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im ok with it. Do the remaining tests pass? I think we might need to change e.g.
https://github.com/kensho-technologies/pyctcdecode/blob/main/pyctcdecode/tests/test_decoder.py#L546
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, lgtm. very clean now
@mikeyshulman @gkucsko - thanks a lot for the review! I applied the proposed changes. So it's on us now to finish this ;-) I'll ping you here again once the PR is merged! |
huggingface/huggingface_hub#505 is merged and released on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, lgtm!
Hey PyCTCDecode team,
Update: Add
load_from_hf_hub
to BeamSearchDecoder insteadThis PR is a proposition to add the possibility to load KenLM models directly from the Hugging Face hub.
I've uploaded an example kenLM model under: https://huggingface.co/kensho/beamsearch_decoder_dummy so that you can try out the loading a beam search decoder from the hub as follows:
Models are hosted for free on the Hugging Face hub with the goal of facilitating the user experience sharing and versioning models. In this case, the user is not required to download the raw model manually (via
wget
), but instead can integrate the model loading with a single line of code:decoder = BeamSearchDecoderCTC.load_from_hf_hub("kensho/beamsearch_decoder_dummy")
into a python script. The loading method automatically caches the downloaded file so that the user will only have to download the model once.@mikeyshulman @gkucsko @poneill - please let me know what you think about the integration and whether anything can be improved :-)