Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when prepare pretrained model in distributed training #44

Closed
llidev opened this issue Nov 20, 2018 · 4 comments
Closed

Comments

@llidev
Copy link
Contributor

llidev commented Nov 20, 2018

Hi,

I launched two processes per node to run distributed run_classifier.py. However, I am occasionally get below error:

11/20/2018 09:31:48 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmpa25_y4es to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba

 93%|█████████▎| 381028352/407873900 [00:11<00:01, 14366075.22B/s]
 94%|█████████▍| 383812608/407873900 [00:11<00:01, 16210783.00B/s]
 95%|█████████▍| 386455552/407873900 [00:11<00:01, 16205260.89B/s]11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   creating metadata file for /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   removing temp file /tmp/tmpa25_y4es

 95%|█████████▌| 388946944/407873900 [00:11<00:01, 18097539.03B/s]11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.modeling -   loading archive file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz from cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.modeling -   extracting archive file /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba to temp dir /tmp/tmpvxvnr8_1

 97%|█████████▋| 393660416/407873900 [00:11<00:00, 22199883.93B/s]
 98%|█████████▊| 399411200/407873900 [00:11<00:00, 27211860.00B/s]
 99%|█████████▉| 405128192/407873900 [00:11<00:00, 32287252.94B/s]
100%|██████████| 407873900/407873900 [00:11<00:00, 34098120.40B/s]
11/20/2018 09:31:49 - INFO - pytorch_pretrained_bert.file_utils -   copying /tmp/tmp5fcm4v8x to cache at /root/.pytorch_pretrained_bert/9c41111e2de84547a463fd39217199738d1e3deb72d4fec4399e6e241983c6f0.ae3cef932725ca7a30cdcb93fc6e09150a55e2a130ec7af63975a16c153ae2ba
Traceback (most recent call last):
  File "examples/run_classifier.py", line 629, in <module>
    main()
  File "examples/run_classifier.py", line 485, in main
    model = BertForSequenceClassification.from_pretrained(args.bert_model, len(label_list))
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/site-packages/pytorch_pretrained_bert-0.1.2-py3.6.egg/pytorch_pretrained_bert/modeling.py", line 495, in from_pretrained
    archive.extractall(tempdir)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2007, in extractall
    numeric_owner=numeric_owner)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2049, in extract
    numeric_owner=numeric_owner)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2119, in _extract_member
    self.makefile(tarinfo, targetpath)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 2168, in makefile
    copyfileobj(source, target, tarinfo.size, ReadError, bufsize)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/tarfile.py", line 248, in copyfileobj
    buf = src.read(bufsize)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/azureml-envs/azureml_49b6ba977c83839baa597001c9b55a6f/lib/python3.6/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

It looks like a race-condition that two processes are simultaneously writing model file to /root/.pytorch_pretrained_bert/.

Please help to advice any workaround. Thanks!

@llidev
Copy link
Contributor Author

llidev commented Nov 20, 2018

My current workaround is to set the env var PYTORCH_PRETRAINED_BERT_CACHE to a different path per process before import pytorch_pretrained_bert. But I think the module itself should handle this properly

@thomwolf
Copy link
Member

I see, thanks for the feedback. I will find a way to make that better in the next release. Not sure we need to store the model gzipped anyway since they mostly contains a torch dump which is already compressed.

@thomwolf
Copy link
Member

Ok, I've added a cache_dir option in from_pretrained in the master to specify a different cache dir for a script. I will release the updated version today on pip. Thanks for the feedback.

@llidev
Copy link
Contributor Author

llidev commented Nov 27, 2018

Thanks for fixing this.

Since the way I use this repo is to add ./pytorch_pretrained_bert in PYTHONPATH, so I think directly add the following import in run_classifier.py and run_squad.py is more appropriate in my case

from pytorch_pretrained_bert.file_utils import PYTORCH_PRETRAINED_BERT_CACHE

which is included in my PR: #58

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020
combine multitask training + adversarial training for mat-roberta-qa
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants