Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added mask_time_prob and mask_time_length arguments to wav2vec2 pretraining script and readme #19997

Conversation

mpierrau
Copy link
Contributor

@mpierrau mpierrau commented Nov 1, 2022

What does this PR do?

This PR was requested by @patrickvonplaten following my question and following discussion in the Discord ask-for-help channel under the title Wav2vec2 - why is mask_time_prob=0.05?

This PR adds the arguments mask_time_prob and mask_time_length to the examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py script and the corresponding example use in the README.md.

mask_time_prob is a variable describing two things, depending on context:

  1. the percentage of the encoded feature vector to be masked during the contrastive learning task in pre-training
  2. to imitate SpecAugment during fine-tuning

In this script, we are considering it in the context of 1).

mask_time_length is a variable describing the length (in # of frames frames) of each applied mask. It is added for completion.

Background

In the original wav2vec 2.0 article, the variable mask_time_prob is set to 0.65, which (due to overlap) results in an effective masking of approximately 49% of the feature vectors during pretraining. mask_time_length corresponds to the M variable in the article and is set to 10 there.

However, when considering the config file of wav2vec2-base, one finds that mask_time_prob=0.05. This is because this model is usually used for finetuning, and not for (continued) pretraining, and for finetuning 0.05 is a better hyperparameter value (see Appendix B of wav2vec 2.0 article). This is a bit confusing.

By considering the config file of the wav2vec2-base-v2 model, which was used during Patricks experimentation (see the speech-pretraining readme) one finds that indeed mask_time_prob=0.65 was used for pretraining.

The values 0.65 and 10 are set as default values for the DataCollatorForWav2Vec2Pretraining class defined in the script (as this class may be extracted from the script by users), but no defaults are given in the argparser, as the argument values are also specified in the wav2vec2-base and wav2vec2-base-v2 model configs, and if setting defaults in the argparser, the model config values would never be applied, which may be desired. Hence, the parser argument will only be relevant if it is explicitly specified as an argument when executing the script.

I believe this PR may also lift a bigger question, which is if mask_time_prob should be split into two different variables to avoid confusion in the future.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Participants in discussion on Discord: @osanseviero @patrickvonplaten

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@sgugger
Copy link
Collaborator

sgugger commented Nov 1, 2022

cc @sanchit-gandhi

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but would be good to get a review from @sanchit-gandhi as well here

Copy link
Contributor

@sanchit-gandhi sanchit-gandhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mpierrau! Sorry for the long delay in getting you a final review. Thanks for digging into this and documenting the motivations so well - the change is most welcome!

Regarding your question:

if mask_time_prob should be split into two different variables to avoid confusion in the future

I would say 'no' to this. At heart, mask_time_prob controls the same behaviour for pre-training and fine-tuning. It's simply the interpretation of this behaviour that is different:

  • For pre-training, we're masking latents for our contrastive loss
  • For fine-tuning, we're applying SpecAug in the latent space

-> this is the same behaviour for both (masking in the latent space), just different interpretations of what they mean for training

What we could do is clarify the meaning of this variable in the docs (with something similar to the above) - this would help clear the confusion without introducing extra variables and the potential for breaking changes!

@github-actions
Copy link

github-actions bot commented Dec 8, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@mpierrau
Copy link
Contributor Author

mpierrau commented Dec 8, 2022

Hey! I'm not sure what the next step is here, like I said, this is my first PR :) Some checks failed, not sure why. All tests passed before I commited @sanchit-gandhi's suggestions, but it seems they are mainly failing due to timeouts?

@sgugger
Copy link
Collaborator

sgugger commented Dec 8, 2022

You will need to rebase your PR on main for the tests to pass, as your branch does not have the fixes for the last release of TensorFlow.

@sanchit-gandhi
Copy link
Contributor

Hey @mpierrau! Exciting to see that you've picked-up this PR again! Let me know if you need any final help - we're close to merging now!

As Sylvain has mentioned, you'll need to rebase onto main to fix the failing tests (see 5. in this guide: https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request, just remember to force push as detailed 😉)

gante and others added 20 commits December 15, 2022 15:32
* Fixing the doctests failures.

* Fixup.
…in pytorch cpu/cuda amp autocast (huggingface#20289)

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* Add docstrings for canine model

* Update CanineForTokenClassification

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Add ResNetBackbone

* Define channels and strides as property

* Remove file

* Add test for backbone

* Update BackboneOutput class

* Remove strides property

* Fix docstring

* Add backbones to SHOULD_HAVE_THEIR_OWN_PAGE

* Fix auto mapping name

* Add sanity check for out_features

* Set stage names based on depths

* Update to tuple

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
- simplifies the devce checking test
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* fix the doc to specify that add_prefix_space = False

* add correct expected output
* [ASR Examples] Update README for seq2seq

* add language info

* add training results

* re-word
* Add padding transformation

* Add in upstream changes

* Update tests & docs

* Code formatting tuples in docstring
* Add AnyPrecisionAdamW optimizer

* Add optim_args argument to TrainingArgs

* Add tests for AnyPrecisionOptimizer

* Change AnyPrecisionAdam default params to float32

* Move default_anyprecision_kwargs in trainer test

* Rename AnyPrecisionAdamW
… consistency. (huggingface#20280)

* [Proposal] Breaking change `zero-shot-object-detection` for improved
consistency.

This is a proposal to modify the output of `zero-shot-object-detection`
to provide better alignment with other pipelines.

The output is now strictly the same as `object-detection` whereas before
it would output lists of lists.

The name `candidate_labels` is used throughout for consistency with
other `zero-shot` pipelines.

The pipeline is changed to `ChunkPipeline` to support batching cleanly.

This removes all the lists and list of lists shenanigans, it's now a
matter of the base pipeline handling all this not this specific one.

**Breaking change**: It did remove complex calls potentials `pipe(images = [image1, image2],
text_queries=[candidates1, candidates2])` to support only
`pipe([{"image": image1, "candidate_labels": candidates1}, {"image": image2, "candidate_labels": candidates2}])`
when dealing with lists and/or datasets.
We could keep them, but it will add a lot of complexity to the code
base, since the pipeline is rather young, I'd rather break to keep the
code simpler, but we can revert this.

**Breaking change**: The name of the argument is now `image` instead of
`images` since it expects by default only 1 image. This is revertable
like the previous one.

**Breaking change**: The types is now simplified and flattened:

`pipe(inputs) == [{**object1}, {**object2}]`
instead of the previous
`pipe(inputs) == [[{**object1}, {**object1}], [{**object2}]]`
Where the different instances would be grouped by candidate labels
within lists.
IMHO this is not really desirable, since it would output empty lists and
is only adding superflous indirection compared to
`zero-shot-object-detection`.

It is relatively change free in terms of how the results, it does change
computation however since now the batching is handled by the pipeline
itself. It **did** change the results for the small models so there
seems to be a real difference in how the models handle this.

* Fixing the doctests.

* Behind is_torch_available.
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* remove double brackets

* oops get other bracket
amyeroberts and others added 25 commits December 15, 2022 15:32
* Add require_vision decorator

* Fixup

* Use requires_backends

* Add requires_backend to utils functions
* Add templates for gpt-sw3

* Add templates for gpt-sw3

* Added sentencepiece tokenizer

* intermediate commit with many changes

* fixed conflicts

* Init commit for tokenization port

* Tokenization progress

* Remove fast tokenizer

* Clean up and rename spm.model -> spiece.model

* Remove TF -> PT conversion script template, Clean up Megatron -> PT script

* Optimize encode & decode performance

* added new attention

* added new attention

* attention for gpt-sw3 working

* attention good

* Cache is now working

* fixed attention mask so that it works with causal attention

* fixed badbmm bug for cpu and caching

* updated config with correct parameters

* Refactor and leave optimizations as separate functions to avoid breaking expected functionality

* Fix special tokens mapping for both tokenizers

* cleaning up of code and comments

* HF compatible attention outputs

* Tokenizer now passing tests, add documentation

* Update documentation

* reverted back to base implementation after checking that it is identical to pretrained model

* updated gpt-sw3 config

* updated conversion script

* aligned parameters with gpt-sw3 config

* changed default scale_attn_by_inverse_layer_idx to true

* removed flag from conversion script

* added temporary model path

* reverted back to functioning convert script

* small changes to default config

* updated tests for gpt-sw3

* make style, make quality, minor cleanup

* Change local paths to testing online repository

* Change name: GptSw3 -> GPTSw3

* Remove GPTSw3TokenizerFast references

* Use official model repository and add more model sizes

* Added reference to 6.7b model

* Add GPTSw3DoubleHeadsModel to IGNORE_NON_AUTO_CONFIGURED, like GPT2DoubleHeadsModel

* Remove pointers to non-existing TFGPTSw3

* Add GPTSw3 to docs/_toctree.yml

* Remove TF artifacts from GPTSw3 in __init__ files

* Update README:s with 'make fix-copies'

* Add 20b model to archive list

* Add documentation for GPT-Sw3

* Fix typo in documentation for GPT-Sw3

* Do 'make fix-copies' again after having updated docs

* Fix some typos in docs

* Update src/transformers/models/gpt_sw3/configuration_gpt_sw3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/configuration_gpt_sw3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/__init__.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/__init__.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/convert_megatron_to_pytorch.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/modeling_gpt_sw3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update tests/models/gpt_sw3/test_tokenization_gpt_sw3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/modeling_gpt_sw3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/modeling_gpt_sw3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Resolve comments from PR feedback

* Resolve more comments from PR feedback, also set use_cache=True in convert script

* Add '# Copied from' comments for GPTSw3 modeling

* Set 'is_parallelizable = False'

* Remove '# Copied from' where code was modified and add 'with x->y' when appropriate

* Remove parallelize in mdx

* make style, make quality

* Update GPTSw3Config default values and corresponding documentation

* Update src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/__init__.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Clean up and protect GPTSw3Tokenizer imports with is_sentencepiece_available

* Make style, make quality

* Add dummy object for GPTSw3Tokenizer via 'make fix-copies'

* make fix-copies

* Remove GPTSw3 modeling classes

* make style, make quality

* Add GPTSw3 auto-mappings for other GPT2 heads

* Update docs/source/en/model_doc/gpt-sw3.mdx

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/convert_megatron_to_pytorch.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Update src/transformers/models/gpt_sw3/tokenization_gpt_sw3.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* Remove old TODO-comment

* Add example usage to GPTSw3Tokenizer docstring

* make style, make quality

* Add implementation details and example usage to gpt-sw3.mdx

Co-authored-by: JoeyOhman <joeyoh@kth.se>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
…ingface#20731)

* Disambiguate test for required_input in tokenization base file.

* Add test for size
* Add decorator for flaky tests

* Fix up
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* add `keep_in_fp32_modules` support

* pass it as class attribute

* few modifs

- make tests `slow`
- fix logic

* better logic

* fix failing test

* `bfloat16` support

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix

* simplify tests

* simplify tests

* fix test

* modify message

* more checks

* fix failing tests

* add more conditions

- add `is_accelerate_available`
- fixes pipleine tests that failed

* add suggestions

* Update src/transformers/modeling_utils.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix failing `bnb` test

* add last safety checker

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Fix the pipeline test regarding TF

* Fix the pipeline test regarding TF

* update comment

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Fix AdamWeightDecay for TF

* Fix AdamWeightDecay for TF

* make fixup
…ngface#20728)

`image = to_channel_dimension_format(image, ChannelDimension.LAST)`
is redundant as this same conversion is also applied in to_pil_image().

This redundant call actually makes the training fail in rare cases.
The problem can be reproduced with the following code snippet:
```
from transformers.models.clip import CLIPFeatureExtractor
vision_processor = CLIPFeatureExtractor.from_pretrained('openai/clip-vit-large-patch14')
images = [
    torch.rand(size=(3, 2, 10), dtype=torch.float),
    torch.rand(size=(3, 10, 1), dtype=torch.float),
    torch.rand(size=(3, 1, 10), dtype=torch.float)
]
for image in images:
    processed_image = vision_processor(images=image, return_tensors="pt")['pixel_values']
    print(processed_image.shape)
    assert processed_image.shape == torch.Size([1, 3, 224, 224])
```

The last image has a height of 1 pixel.
The second call to to_channel_dimesion_format() will transpose the image, and the height
dimension is wrongly treated as the channels dimension afterwards.
Because of this, the following normalize() step will result in an
exception.
* Add first draft

* Add out_features attribute to config

* Add corresponding test

* Add Dinat backbone

* Add BackboneMixin

* Add Backbone mixin, improve tests

* Fix embeddings

* Fix bug

* Improve backbones

* Fix Nat backbone tests

* Fix Dinat backbone tests

* Apply suggestions

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
* added model resources for xlm-roberta

* added model resources for xlm-roberta

* resolve suggested changes

* add resources to xlm-roberta
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…ce#20758)

Uninstall torch_tensorrt for now

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Improve tests

* Improve TF tests

* Apply suggestion

* Fix test

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
fix: 修复Trainer无法使用use_legacy_prediction_loop参数的问题

解决使用use_legacy_prediction_loop参数在predict阶段使用prediction_loop进行预测时,遇到AttributeError: 'PredictionOutput' object has no attribute 'num_samples'的问题

Co-authored-by: ZhouHang <zhouhang@idataway.com>
* weight -> weights

* model embedding resize does not work with both v2 and noraml

* remove useless test
* Replaces xxx_required with requires_backends

* Fixup
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* Add Swin backbone

* Remove line

* Add code example

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
* Even more validation.

* Fixing order.
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
@mpierrau
Copy link
Contributor Author

Hey, sorry, I somehow managed to miss the force push flag anyways... I hope it still works?

@sanchit-gandhi
Copy link
Contributor

sanchit-gandhi commented Dec 15, 2022

Hey @mpierrau! Unfortunately the commit history gets messed up after a rebase + non-force push. Not to worry though! Let's open a new PR with your changes in favour of this one. You can create a new branch and copy over the relevant file (run_pretraining_...):

git checkout -b new-branch-mask-time-prob
git restore --source adding-mask_time_prob-args-to-wav2vec2-pretraining-script -- /path/to/relevant/file

You can then commit, rebase, and force push to origin to open a new PR with just the required changes.

@sanchit-gandhi
Copy link
Contributor

Closing in favour of #20985.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet