[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32

patrickvonplaten · 2021-11-12T11:20:53Z

Hey PyCTCDecode team,

Update: Add load_from_hf_hub to BeamSearchDecoder instead

This PR is a proposition to add the possibility to load KenLM models directly from the Hugging Face hub.
I've uploaded an example kenLM model under: https://huggingface.co/kensho/beamsearch_decoder_dummy so that you can try out the loading a beam search decoder from the hub as follows:

from pyctcdecode import BeamSearchDecoderCTC

decoder = BeamSearchDecoderCTC.load_from_hf_hub("kensho/beamsearch_decoder_dummy")

Models are hosted for free on the Hugging Face hub with the goal of facilitating the user experience sharing and versioning models. In this case, the user is not required to download the raw model manually (via wget), but instead can integrate the model loading with a single line of code: decoder = BeamSearchDecoderCTC.load_from_hf_hub("kensho/beamsearch_decoder_dummy") into a python script. The loading method automatically caches the downloaded file so that the user will only have to download the model once.

@mikeyshulman @gkucsko @poneill - please let me know what you think about the integration and whether anything can be improved :-)

mikeyshulman

This looks great!

My only question is about testing. While we don't want our tests to actually call out to hf hub, maybe we can still add a test. Even if it's thin and mocks out the load_from_hf_hub to just return the binary contents of the little arpa file checked into the tests directory, it will make sure the code actually runs and returns a kosher LanguageModel. I'd also be in favor of putting huggingface_hub in the dev requirements in setup.py.

As an aside, are there any other guidelines/best practices HF recommends to package developers to make sure hub integration works?

pyctcdecode/language_model.py

patrickvonplaten · 2021-11-12T15:46:19Z

This looks great!

My only question is about testing. While we don't want our tests to actually call out to hf hub, maybe we can still add a test. Even if it's thin and mocks out the load_from_hf_hub to just return the binary contents of the little arpa file checked into the tests directory, it will make sure the code actually runs and returns a kosher LanguageModel. I'd also be in favor of putting huggingface_hub in the dev requirements in setup.py.

As an aside, are there any other guidelines/best practices HF recommends to package developers to make sure hub integration works?

Awesome!

Yeah that's a good question about testing! Actually what would be nice is to add some functionality that if pretrained_path is a local path -> then it should load the file simply pass the file name to init(). This could also be very easily tested - I'll update the PR :-)

pyctcdecode/tests/test_language_model.py

setup.py

gkucsko

awesome, very exciting!

pyctcdecode/language_model.py

pyctcdecode/tests/test_language_model.py

pyctcdecode/language_model.py

patrickvonplaten · 2021-11-19T16:08:51Z

pyctcdecode/language_model.py

+                "See https://pypi.org/project/huggingface-hub/ for installation."
+            )
+
+        # download and cache kenLM model


download each file seperatly

is there no way to download the whole thing as a tarball?
if not, there's the potential to download a kenlm file and a unigrams file that is incompatible with it

I think we could find a solution to download a tar ball instead if you prefer!

IMO there would be the following downsides though:

Users cannot verify the files online on the hub. E.g. right now one can easily check the saved configs of the Language Model here:
https://huggingface.co/kensho/dummy_full_language_model/blob/main/attrs.json
=> This would not be possible if a .tar file is saved instead

the downloaded tar file would be cached instead of each of the three files. This means that after caching, the tar file would be untared every time load_from_hf_hub is called. I don't think it's possible to download a tar ball and untar it and then cache the whole directory.

Especially 2.) does not give a great user experience IMO.

Also cc'ing @osanseviero here - do we have examples where whole folders are saved as a .tar ball on the hub?

@patrickvonplaten there might be a couple of repos that have compressed formats, but in general that's discouraged for the reasons you give.

Qq: instead of downloading each file separately, why not use snapshot_download and get all the repo files in a single go? This downloads multiple files + has caching

mikeyshulman

looking for others' thoughts here ideally

pyctcdecode/language_model.py

mikeyshulman · 2021-11-22T19:11:19Z

pyctcdecode/language_model.py

+                "See https://pypi.org/project/huggingface-hub/ for installation."
+            )
+
+        # download and cache kenLM model


is there no way to download the whole thing as a tarball?
if not, there's the potential to download a kenlm file and a unigrams file that is incompatible with it

pyctcdecode/tests/test_decoder.py

pyctcdecode/decoder.py

patrickvonplaten · 2021-11-29T12:27:24Z

pyctcdecode/decoder.py

@@ -688,8 +689,6 @@ def parse_directory_contents(filepath: str) -> Dict[str, Union[str, None]]:
        contents = os.listdir(filepath)
        # filter out hidden files
        contents = [c for c in contents if not c.startswith(".") and not c.startswith("__")]
-        if len(contents) not in {1, 2}:  # always alphabet, sometimes language model


Can we relax this error here?

It is already checked that:

the "alphabet.json" file is present

"language_model" directory is present which is then further down the road checked aggressively

IMO there is no "danger" in having more than those files in the directory that is to be loaded. E.g. on huggingface.co we always have a README.md in the folder as well - see: https://huggingface.co/kensho/beamsearch_decoder_dummy/tree/main

Would it be ok for you to remove those lines here? @mikeyshulman

im ok with it. Do the remaining tests pass? I think we might need to change e.g.
https://github.com/kensho-technologies/pyctcdecode/blob/main/pyctcdecode/tests/test_decoder.py#L546

I'll update the tests

mikeyshulman

this looks excellent to me. Very clean!
@gkucsko do you want to have one final look?

setup.py

pyctcdecode/tests/test_decoder.py

mikeyshulman · 2021-11-29T14:49:06Z

pyctcdecode/decoder.py

@@ -688,8 +689,6 @@ def parse_directory_contents(filepath: str) -> Dict[str, Union[str, None]]:
        contents = os.listdir(filepath)
        # filter out hidden files
        contents = [c for c in contents if not c.startswith(".") and not c.startswith("__")]
-        if len(contents) not in {1, 2}:  # always alphabet, sometimes language model


im ok with it. Do the remaining tests pass? I think we might need to change e.g.
https://github.com/kensho-technologies/pyctcdecode/blob/main/pyctcdecode/tests/test_decoder.py#L546

pyctcdecode/decoder.py

gkucsko

nice, lgtm. very clean now

patrickvonplaten · 2021-11-30T15:49:49Z

@mikeyshulman @gkucsko - thanks a lot for the review! I applied the proposed changes.
All tests except pyctcdecode/tests/test_decoder.py::TestSerialization::test_load_from_hub_offline are now passing.
pyctcdecode/tests/test_decoder.py::TestSerialization::test_load_from_hub_offline does pass locally for me, but we'll need to wait until huggingface/huggingface_hub#505 is merged and a patch is released.

So it's on us now to finish this ;-) I'll ping you here again once the PR is merged!

patrickvonplaten · 2021-12-01T17:44:05Z

huggingface/huggingface_hub#505 is merged and released on pip. All tests are now passing locally. If this PR is ok for you - I think it's good to go 🚀 @mikeyshulman @gkucsko

gkucsko

Great, lgtm!

add load_from_hub to LanguageModel

c37a6ab

patrickvonplaten changed the title ~~[Integration with Hugging Face] Aad load_from_hub to LanguageModel~~ [Integration with 🤗 Hugging Face] Add load_from_hub to LanguageModel Nov 12, 2021

patrickvonplaten mentioned this pull request Nov 12, 2021

[Wav2Vec2] PyCTCDecode Integration to support language model boosted decoding huggingface/transformers#14339

Merged

improve code a bit

79f91a9

mikeyshulman reviewed Nov 12, 2021

View reviewed changes

pyctcdecode/language_model.py Outdated Show resolved Hide resolved

pyctcdecode/language_model.py Outdated Show resolved Hide resolved

add tests

50b9c0f

patrickvonplaten commented Nov 12, 2021

View reviewed changes

pyctcdecode/tests/test_language_model.py Outdated Show resolved Hide resolved

patrickvonplaten commented Nov 12, 2021

View reviewed changes

setup.py Outdated Show resolved Hide resolved

Update setup.py

cbf4a34

gkucsko reviewed Nov 12, 2021

View reviewed changes

pyctcdecode/language_model.py Outdated Show resolved Hide resolved

pyctcdecode/language_model.py Outdated Show resolved Hide resolved

patrickvonplaten added 3 commits November 19, 2021 17:03

finalize

d972237

finalize

7eac160

finalize

66a9d5c

patrickvonplaten commented Nov 19, 2021

View reviewed changes

pyctcdecode/tests/test_language_model.py Outdated Show resolved Hide resolved

patrickvonplaten commented Nov 19, 2021

View reviewed changes

pyctcdecode/language_model.py Outdated Show resolved Hide resolved

patrickvonplaten commented Nov 19, 2021

View reviewed changes

mikeyshulman reviewed Nov 22, 2021

View reviewed changes

patrickvonplaten added 2 commits November 29, 2021 11:15

apply suggestions

a0a5d92

remove unnecessary file

434a5b0

patrickvonplaten mentioned this pull request Nov 29, 2021

[Snapshot download] Allow to load local repo id with snapshot download huggingface/huggingface_hub#505

Merged

patrickvonplaten added 2 commits November 29, 2021 13:08

improve further

ddc0d27

clean up more files

7f3b516

patrickvonplaten commented Nov 29, 2021

View reviewed changes

pyctcdecode/tests/test_decoder.py Show resolved Hide resolved

patrickvonplaten changed the title ~~[Integration with 🤗 Hugging Face] Add load_from_hub to LanguageModel~~ [Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder Nov 29, 2021

patrickvonplaten commented Nov 29, 2021

View reviewed changes

pyctcdecode/decoder.py Outdated Show resolved Hide resolved

patrickvonplaten commented Nov 29, 2021

View reviewed changes

pyctcdecode/decoder.py Outdated Show resolved Hide resolved

Apply suggestions from code review

829541e

patrickvonplaten commented Nov 29, 2021

View reviewed changes

mikeyshulman reviewed Nov 29, 2021

View reviewed changes

mikeyshulman reviewed Nov 30, 2021

View reviewed changes

pyctcdecode/decoder.py Outdated Show resolved Hide resolved

gkucsko reviewed Nov 30, 2021

View reviewed changes

apply last suggestions

97b94c2

fix test

4801e81

patrickvonplaten added 2 commits December 1, 2021 23:08

fix linting

ed842e0

fix mypy

b3b15d7

gkucsko approved these changes Dec 2, 2021

View reviewed changes

gkucsko merged commit 482e7c5 into kensho-technologies:main Dec 2, 2021

patrickvonplaten mentioned this pull request Dec 2, 2021

[Decoder Dir Parsing] Allow decoder folder to contain additional files #37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32

[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32

patrickvonplaten commented Nov 12, 2021 •

edited

Loading

mikeyshulman left a comment

patrickvonplaten commented Nov 12, 2021

gkucsko left a comment

patrickvonplaten Nov 19, 2021

mikeyshulman Nov 22, 2021

patrickvonplaten Nov 25, 2021

patrickvonplaten Nov 25, 2021

osanseviero Nov 29, 2021

mikeyshulman left a comment

mikeyshulman Nov 22, 2021

patrickvonplaten Nov 29, 2021

mikeyshulman Nov 29, 2021

patrickvonplaten Nov 29, 2021

mikeyshulman left a comment

mikeyshulman Nov 29, 2021

gkucsko left a comment

patrickvonplaten commented Nov 30, 2021

patrickvonplaten commented Dec 1, 2021

gkucsko left a comment

[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32

[Integration with 🤗 Hugging Face] Add load_from_hub to BeamSearchDecoder #32

Conversation

patrickvonplaten commented Nov 12, 2021 • edited Loading

mikeyshulman left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Nov 12, 2021

gkucsko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikeyshulman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikeyshulman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gkucsko left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Nov 30, 2021

patrickvonplaten commented Dec 1, 2021

gkucsko left a comment

Choose a reason for hiding this comment

patrickvonplaten commented Nov 12, 2021 •

edited

Loading