Sync preprocesses before loading the processor at run_speech_recognition_ctc.py#21926
Sync preprocesses before loading the processor at run_speech_recognition_ctc.py#21926sgugger merged 4 commits intohuggingface:mainfrom
Conversation
Make sure all processes wait until data is saved before loading the processor from the output_dit
|
The documentation is not available anymore as the PR was closed or merged. |
sanchit-gandhi
left a comment
There was a problem hiding this comment.
Thanks @mpenagar, great catch! Would you also mind updating the seq2seq ASR fine-tuning script too?
Is exactly the same update that's required!
…rocessor from the output_dit
|
Updated seq2seq ASR fine-tuning script. I'm not very good with github, I guess there is no need to do a new PR. |
|
Wait... there is something I don't get correctly. As far as I understand from the (documentation ) , any code inside a block But in my experience, the code is executed by all the processes, not just the main one. Take this minimal `example.py': from transformers import TrainingArguments,HfArgumentParser
from transformers.trainer_utils import is_main_process
def main():
parser = HfArgumentParser((TrainingArguments,))
training_args, = parser.parse_args_into_dataclasses()
rank = training_args.local_rank
main_process = is_main_process(rank)
print(f'\nBEFORE WITH - local_rank={rank} is_main_process={main_process}')
with training_args.main_process_first():
print(f'\nINSIDE WITH - local_rank={rank}')
if __name__ == "__main__":
main()If I execute it in a 4 GPU node:
The synching is working, but all processes execute the "INSIDE" Executing with newer
What I am geting wrong? |
|
No, as the name indicates, it executes the code in the context manager on the main process, and then on all the others. The code is indeed executed in all processes, just in a certain order. Since with Datasets, everything is cached, executing the preprocessing inside that contextmanager means that process 0 will do the preprocessing, and then all the other will load the result from the cache without needing to do the preprocessing. |
|
Ok, then the PR is not correct, since all the processes will try to write the json files. I removed the original:
that should be there inside the |
|
Indeed, your changes are perfect. Is this ready to be merged now? |
|
It is working in my end without any problem |
There was a problem hiding this comment.
Thanks for iterating and fixing the seq2seq training script as well @mpenagar! I'll merge once you've confirmed this is ready!
|
Is this good for merge @mpenagar? Changes LGTM! |
|
Yes, it is ready. Anyway, I don't know how github works. Should I close the PR (there is a "Close" button there)? |
|
Awesome, thanks for confirming @mpenagar and for your contribution 🤗 |
…ion_ctc.py (huggingface#21926) * Update run_speech_recognition_ctc.py Make sure all processes wait until data is saved before loading the processor from the output_dit * Make sure all processes wait until data is saved before loading the processor from the output_dit * Update run_speech_recognition_ctc.py * Update run_speech_recognition_seq2seq.py
What does this PR do?
Make sure all processes wait until data is saved before loading the processor from the
output_dirin thepytorch/speech-recognition/run_speech_recognition_ctc.pyexample.Issue:
output_dirbefore it is saved.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.