Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where does the pre-trained bert model gets cached in my system by default? #2323

Closed
13Ashu opened this issue Dec 26, 2019 · 9 comments
Closed

Comments

@13Ashu
Copy link

13Ashu commented Dec 26, 2019

❓ Questions & Help

I used model_class.from_pretrained('bert-base-uncased') to download and use the model. The next time when I use this command, it picks up the model from cache. But when I go into the cache, I see several files over 400M with large random names. How do I know which is the bert-base-uncased or distilbert-base-uncased model? Maybe I am looking at the wrong place

@shashankMadan-designEsthetics

AFAIK, the cache folder is hidden. You can download the files manually and the save them to your desired location two files to download is config.json and .bin and you can call it through pretrained suppose you wanted to instantiate BERT then do BertForMaskedLM.from_pretrained(Users/<Your location>/<your folder name>)

@aaugustin
Copy link
Contributor

Each file in the cache comes with a .json file describing what's inside.

This isn't part of transformers' public API and may change at any time in the future.

Anyway, here's how you can locate a specific file:

$ cd ~/.cache/torch/transformers
$ grep /bert-base-uncased *.json
26bc1ad6c0ac742e9b52263248f6d0f00068293b33709fae12320c0e35ccfbbb.542ce4285a40d23a559526243235df47c5f75c197f04f37d1a0c124c32c9a084.json:{"etag": "\"64800d5d8528ce344256daf115d4965e\"", "url": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt"}
4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.bf3b9ea126d8c0001ee8a1e8b92229871d06d36d8808208cc2449280da87785c.json:{"etag": "\"74d4f96fdabdd865cbdbe905cd46c1f1\"", "url": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json"}
d667df51ec24c20190f01fb4c20a21debc4c4fc12f7e2f5441ac0a99690e3ee9.4733ec82e81d40e9cf5fd04556267d8958fb150e9339390fc64206b7e5a79c83.h5.json:{"etag": "\"41a0e56472bad33498744818c8b1ef2c-64\"", "url": "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-tf_model.h5"}

Here, bert-base-uncased-tf_model.h5 is cached as d667df51ec24c20190f01fb4c20a21debc4c4fc12f7e2f5441ac0a99690e3ee9.4733ec82e81d40e9cf5fd04556267d8958fb150e9339390fc64206b7e5a79c83.h5.

@aaugustin
Copy link
Contributor

The discussion in #2157 could be useful too.

@Mayar2009
Copy link

Hi!
What if I use colab then how can I find the cash file? @aaugustin

@kaniblu
Copy link

kaniblu commented May 1, 2020

For anyone landed here wondering if one can globally change the cache directory: set PYTORCH_TRANSFORMERS_CACHE environment variable in shell before running the python interpreter.

@attardi
Copy link

attardi commented Jun 6, 2020

You can get find it the same way transformers do it:

from transformers.file_utils import hf_bucket_url, cached_path
pretrained_model_name = 'DeepPavlov/rubert-base-cased'
archive_file = hf_bucket_url(
    pretrained_model_name,
    filename='pytorch_model.bin',
    use_cdn=True,
)
resolved_archive_file = cached_path(archive_file)

@persunde
Copy link

For me huggingface changed the default cache folder to:

~/.cache/huggingface/transformers

@johntiger1
Copy link

You can get find it the same way transformers do it:

from transformers.file_utils import hf_bucket_url, cached_path
pretrained_model_name = 'DeepPavlov/rubert-base-cased'
archive_file = hf_bucket_url(
    pretrained_model_name,
    filename='pytorch_model.bin',
    use_cdn=True,
)
resolved_archive_file = cached_path(archive_file)

Thank you, this worked for me!

Note that I had to remove the use_cdn option. Additionally, it does not seem to tell you where the vocab.txt and other files are located

@Phobia-Cosmos
Copy link

You can get find it the same way transformers do it:

from transformers.file_utils import hf_bucket_url, cached_path
pretrained_model_name = 'DeepPavlov/rubert-base-cased'
archive_file = hf_bucket_url(
    pretrained_model_name,
    filename='pytorch_model.bin',
    use_cdn=True,
)
resolved_archive_file = cached_path(archive_file)

Thank you, this worked for me!

Note that I had to remove the use_cdn option. Additionally, it does not seem to tell you where the vocab.txt and other files are located

Note that the hf_bucket_url has been removed so you can use this now. ImportError: cannot import name 'hf_bucket_url' from 'transformers.file_utils' #22390

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants