Skip to content

Sync preprocesses before loading the processor at run_speech_recognition_ctc.py#21926

Merged
sgugger merged 4 commits intohuggingface:mainfrom
mpenagar:main
Apr 5, 2023
Merged

Sync preprocesses before loading the processor at run_speech_recognition_ctc.py#21926
sgugger merged 4 commits intohuggingface:mainfrom
mpenagar:main

Conversation

@mpenagar
Copy link
Copy Markdown
Contributor

@mpenagar mpenagar commented Mar 3, 2023

What does this PR do?

Make sure all processes wait until data is saved before loading the processor from the output_dir in the pytorch/speech-recognition/run_speech_recognition_ctc.py example.

Issue:

  • Non-main proccesses might try to load the processor from the output_dir before it is saved.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Make sure all processes wait until data is saved before loading the processor from the output_dit
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

HuggingFaceDocBuilderDev commented Mar 3, 2023

The documentation is not available anymore as the PR was closed or merged.

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Mar 3, 2023

cc @sanchit-gandhi

Copy link
Copy Markdown
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mpenagar, great catch! Would you also mind updating the seq2seq ASR fine-tuning script too?

if is_main_process(training_args.local_rank):

Is exactly the same update that's required!

@mpenagar
Copy link
Copy Markdown
Contributor Author

mpenagar commented Mar 7, 2023

Updated seq2seq ASR fine-tuning script. I'm not very good with github, I guess there is no need to do a new PR.

@mpenagar
Copy link
Copy Markdown
Contributor Author

mpenagar commented Mar 8, 2023

Wait... there is something I don't get correctly.

As far as I understand from the (documentation ) , any code inside a block with training_args.main_process_first(): should be executed only by the main process:

A context manager for torch distributed environment where
on needs to do something on the main process, while blocking
replicas, and when it’s finished releasing the replicas.

One such use is for datasets’s map feature which to be efficient
should be run once on the main process, which upon completion
saves a cached version of results and which then automatically
gets loaded by the replicas.

But in my experience, the code is executed by all the processes, not just the main one. Take this minimal `example.py':

from transformers import TrainingArguments,HfArgumentParser
from transformers.trainer_utils import is_main_process

def main():
    parser = HfArgumentParser((TrainingArguments,))
    training_args, = parser.parse_args_into_dataclasses()
    rank = training_args.local_rank
    main_process = is_main_process(rank)
    print(f'\nBEFORE WITH - local_rank={rank} is_main_process={main_process}')
    with training_args.main_process_first():
        print(f'\nINSIDE WITH - local_rank={rank}')

if __name__ == "__main__":
    main()

If I execute it in a 4 GPU node:

OMP_NUM_THREADS=1 python3 -m torch.distributed.launch --nproc_per_node 4 example.py --output_dir none

The synching is working, but all processes execute the "INSIDE" print

Executing with newer torchrun does the same:

OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=4 example.py --output_dir none

What I am geting wrong?

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Mar 8, 2023

No, as the name indicates, it executes the code in the context manager on the main process, and then on all the others. The code is indeed executed in all processes, just in a certain order.

Since with Datasets, everything is cached, executing the preprocessing inside that contextmanager means that process 0 will do the preprocessing, and then all the other will load the result from the cache without needing to do the preprocessing.

@mpenagar
Copy link
Copy Markdown
Contributor Author

mpenagar commented Mar 8, 2023

Ok, then the PR is not correct, since all the processes will try to write the json files. I removed the original:

if is_main_process(training_args.local_rank):

that should be there inside the with block...

@sgugger
Copy link
Copy Markdown
Collaborator

sgugger commented Mar 8, 2023

Indeed, your changes are perfect. Is this ready to be merged now?

@mpenagar
Copy link
Copy Markdown
Contributor Author

mpenagar commented Mar 9, 2023

It is working in my end without any problem

@mpenagar mpenagar changed the title Sync preocesses before loading the processor at run_speech_recognition_ctc.py Sync preprocesses before loading the processor at run_speech_recognition_ctc.py Mar 9, 2023
Copy link
Copy Markdown
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating and fixing the seq2seq training script as well @mpenagar! I'll merge once you've confirmed this is ready!

@huggingface huggingface deleted a comment from github-actions bot Apr 4, 2023
@sanchit-gandhi
Copy link
Copy Markdown
Contributor

sanchit-gandhi commented Apr 4, 2023

Is this good for merge @mpenagar? Changes LGTM!

@mpenagar
Copy link
Copy Markdown
Contributor Author

mpenagar commented Apr 5, 2023

Yes, it is ready. Anyway, I don't know how github works. Should I close the PR (there is a "Close" button there)?

@sgugger sgugger merged commit d5239ba into huggingface:main Apr 5, 2023
@sanchit-gandhi
Copy link
Copy Markdown
Contributor

Awesome, thanks for confirming @mpenagar and for your contribution 🤗

novice03 pushed a commit to novice03/transformers that referenced this pull request Jun 23, 2023
…ion_ctc.py (huggingface#21926)

* Update run_speech_recognition_ctc.py

Make sure all processes wait until data is saved before loading the processor from the output_dit

* Make sure all processes wait until data is saved before loading the processor from the output_dit

* Update run_speech_recognition_ctc.py

* Update run_speech_recognition_seq2seq.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants