Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[model weights caching] model upload doesn't check model weights hash #6916

Closed
stas00 opened this issue Sep 2, 2020 · 15 comments · Fixed by #8324
Closed

[model weights caching] model upload doesn't check model weights hash #6916

stas00 opened this issue Sep 2, 2020 · 15 comments · Fixed by #8324
Labels

Comments

@stas00
Copy link
Contributor

stas00 commented Sep 2, 2020

I have re-uploaded model weights via transformers-cli upload and noticed that when I tried to use it - it didn't get re-downloaded, and instead continued to use the cached version.

The problem seems to come from the fact that the other uploaded files haven't changed, only the model weights.

I double checked that the md5sum of the old weights file is different from the new one.

I re-uploaded the whole folder using:

transformers-cli upload fsmt-wmt19-en-de

If I hunt down the cached files (not an easy task), and delete those, it does re-download the new version.

If I diff the cached weights file and the updated cache file, which gets re-downloaded if I move away the original cached file, they aren't the same.:

Binary files 
before/d97352d9f1f96ee4c6055f203812035b4597258a837db1f4f0803a2932cc3071.53ce64c7097bfcd85418af04a21b4a897c78c8440de3af078e577727ad9de3a0 
and  
 after/d97352d9f1f96ee4c6055f203812035b4597258a837db1f4f0803a2932cc3071.53ce64c7097bfcd85418af04a21b4a897c78c8440de3af078e577727ad9de3a0 
differ

Could we please include the model weights file in the hash calculation?

Thank you.

@LysandreJik
Copy link
Member

I can confirm it was previously checking the model weights and re-downloading if the weights had been changed. Investigating.

@LysandreJik
Copy link
Member

This is due to the CDN caching files, with a 24 hour delay. After 24 hours it should download your file, but if you want it now you can use the use_cdn flag and set it to False. You can see the documentation for this here.

@stas00
Copy link
Contributor Author

stas00 commented Sep 4, 2020

Thank you for the hint, @LysandreJik. So from_pretrained(mname, use_cdn=False)

But that might be tricky for end users who won't know that the code base has changed yet the model weights they get are out sync.

Is there a way to signal CDN to invalidate the cache for some files? It could then be done from the upload util.

@stas00
Copy link
Contributor Author

stas00 commented Sep 4, 2020

FWIW, I wrote a one liner to force cache update for the 4 models I'm working at the moment.

PYTHONPATH="src" python -c 'from transformers import AutoModel; [AutoModel.from_pretrained("stas/fsmt-wmt19-"+p, use_cdn=False) for p in ["en-ru","ru-en","en-de","de-en"]]'

I now have that in my script, so I don't need to think about it.

@stas00
Copy link
Contributor Author

stas00 commented Sep 4, 2020

@LysandreJik, unfortunately this doesn't solve the issue

AutoModel.from_pretrained(mname, use_cdn=False)

Indeed forces a download of the recently updated model - but then if this flag is no longer used in the application - it still downloads the CDN cached version and ends up using the wrong version.

So, basically, this results in 2 copies (different hashes) sitting in the cache dir.

And normal usage w/o using use_cdn=False looks up the old version and not the new one. (so things like run_eval.py still use the old one)

Thanks.

@julien-c
Copy link
Member

julien-c commented Sep 4, 2020

can you run AutoModel.from_pretrained(mname, use_cdn=False) in a debugger and check whether the downloaded url is a https://cdn.huggingface.co or a https://s3.amazonaws.com/models.huggingface.co url?

@stas00
Copy link
Contributor Author

stas00 commented Sep 4, 2020

I can do that, but I already checked that it downloads the updated model w/ use_cdn=False. But then if you run it again w/o use_cdn=False it ignores the new download and uses the old model again (if I delete the cached version, it redownloads the old cached version w/o use_cdn=False ).

@julien-c
Copy link
Member

julien-c commented Sep 4, 2020

Oh yeah ok, I see. Can you run_eval.py on a local folder path then?

@stas00
Copy link
Contributor Author

stas00 commented Sep 4, 2020

Can you run_eval.py on a local folder path then?

Yes. Except others can't as they don't have my local copy.

e.g. @sshleifer wants to eval my PR #6940, but now has to wait till tomorrow for CDN to expire (or hack around it).

Last night I uploaded an experimental model, which proved to be invalid, thought I re-downloaded it OK as it was working OK and made a PR, except I was testing against the non-current cached version, which was a good one.

@stas00
Copy link
Contributor Author

stas00 commented Sep 6, 2020

Can we please re-open this ticket? It hasn't been resolved

@sshleifer sshleifer reopened this Sep 6, 2020
@julien-c
Copy link
Member

julien-c commented Sep 7, 2020

Can we add a --no_cdn boolean flag to run_eval.py that would then call AutoModelForSeq2SeqLM.from_pretrained(use_cdn=False)?

In our dev workflow we mostly don't use the cdn while the files are still in-flux. Cloudfront invalidation comes with its own set of issues so it's better to view cdn as a means to distribute permanent files. (for this reason we don't serve config.json files from Cloudfront)

@stas00
Copy link
Contributor Author

stas00 commented Sep 7, 2020

Can we add a --no_cdn boolean flag to run_eval.py that would then call AutoModelForSeq2SeqLM.from_pretrained(use_cdn=False)?

It could be done. I have a feeling then there will be others.

Perhaps an alternative solution would be to introduce an env var, that would transparently override cdn cache in any situation w/o needing to change every script? TRANSFORMERS_USE_CDN=False?

In our dev workflow we mostly don't use the cdn while the files are still in-flux. Cloudfront invalidation comes with its own set of issues so it's better to view cdn as a means to distribute permanent files. (for this reason we don't serve config.json files from Cloudfront)

Understood!

How do you let others onto testing the model files? Putting them on dropbox or something and sharing the link?

@julien-c
Copy link
Member

julien-c commented Sep 8, 2020

No, just S3 links!

@stale
Copy link

stale bot commented Nov 7, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 7, 2020
@stas00
Copy link
Contributor Author

stas00 commented Nov 8, 2020

#8324 should resolve this.

@stas00 stas00 closed this as completed Nov 8, 2020
@julien-c julien-c linked a pull request Nov 8, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants