Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

main to ssl synthesis #9

Merged
merged 57 commits into from
Jul 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
4638799
Megatron BART BOS / EOS bug fix (#4495)
michalivne Jul 6, 2022
ad61479
GPT Prompt Learning Improvements (#4496)
vadam5 Jul 6, 2022
ab6c46b
Megatron perceiver with tensor parallelism only (#4318)
MaximumEntropy Jul 6, 2022
4dd0e56
NMESC speaker counting algorithm update (#4500)
tango4j Jul 7, 2022
f21cf36
Fix dataset parameter typo on tacotron2 example yaml (#4471)
saarus72 Jul 7, 2022
cf95f93
Noam lr sched: do not force min_lr after max_steps (#4472)
alancucki Jul 7, 2022
1f97094
Refactor for punctuation model (#4367)
jubick1337 Jul 7, 2022
4d0bacb
bug fix - sample rate was being ignored in vocoder dataset when not l…
paarthneekhara Jul 7, 2022
2089016
Add ITN pt (#4516)
guidefloripa Jul 7, 2022
b4edbca
Fixed WER initialization in ASR_with_Nemo notebook (#4523)
anteju Jul 8, 2022
01f0422
Update cmudict (#4510)
jasro23 Jul 8, 2022
81df7a9
[Add] Support for Different LRs with Param Groups (#4508)
stevehuang52 Jul 8, 2022
c0f5bff
Weighted bucketing (#4474)
tbartley94 Jul 8, 2022
2af11fe
Add silence handling for speaker diarization pipeline (#4512)
nithinraok Jul 8, 2022
a8266e4
Fix runtime check (#4501)
borisfom Jul 8, 2022
726ad22
Update finetune label models (#4504)
nithinraok Jul 9, 2022
1428763
[ASR][Breaking Change] Update signature of Hypothesis alignments (#4511)
titu1994 Jul 9, 2022
309a81a
Weighted bucketing (#4530)
tbartley94 Jul 11, 2022
7f75191
Additional sentencepiece args - Byte fallback, split digits, split_on…
MaximumEntropy Jul 11, 2022
dc6bea2
Add support for ASR Adapter Auxiliary Losses (#4480)
titu1994 Jul 12, 2022
bc29ef2
update (#4520)
stevehuang52 Jul 12, 2022
b70ec73
fix duplex inference with grammars (#4517)
ekmb Jul 12, 2022
8e186eb
Add Bucketing support to TarredAudioToClassificationLabelDataset (#4465)
entn-at Jul 13, 2022
ff588a7
Add MTEncDec Finetune support (#4540)
aklife97 Jul 13, 2022
8b67ec6
Add nsys profiling (#4539)
ericharper Jul 13, 2022
4e43b7c
Update megatron prompt learning interface to dialogue (#4545)
Zhilin123 Jul 14, 2022
d66c8e9
Merge branch 'main' into main
XuesongYang Jul 14, 2022
7801639
remove the variable that is not used in the context. (#4547)
XuesongYang Jul 14, 2022
99c7661
update fastpitch to add export controls (#4509)
blisc Jul 14, 2022
fa2e55e
Adding multispeaker fastpitch and hifigan en model links to available…
subhankar-ghosh Jul 15, 2022
7d9b166
added MLM Scoring (#4476)
yzhang123 Jul 15, 2022
8abe0f4
Removed NLPDDPPlugin Import check (#4555)
vadam5 Jul 15, 2022
65b9b57
Add length ratio filtering script (#4551)
MaximumEntropy Jul 15, 2022
fea3775
Add Tokenization and Normalization pre-proecssing script for NMT (#4557)
aklife97 Jul 16, 2022
56694f0
handled n segments for a different sampling rate than original sampli…
paarthneekhara Jul 17, 2022
23a3496
Merge branch 'main' into main
paarthneekhara Jul 17, 2022
85fd5a9
Added case for n_segments 0, warning for n_segments greater than file…
paarthneekhara Jul 20, 2022
86fea2a
[Fix] Relative audio path in speech data explorer (#4570)
anteju Jul 21, 2022
e67c4ca
[Add] Catalan ASR NGC Resource (#4576)
stevehuang52 Jul 21, 2022
6442e33
Option to disregard document boundaries for t5, bart, ul2 (#4481)
MaximumEntropy Jul 22, 2022
2574f53
Merge branch 'NVIDIA:main' into main
paarthneekhara Jul 24, 2022
6b9617d
Integrating support for GPT/T5/BART for Question Answering (#4532)
ameyasm1154 Jul 25, 2022
468a3f3
add kw asr models, add itn ru checkpoint (tagger-based) (#4595)
bene-ges Jul 25, 2022
c324499
Add DALI pipeline to SSL model (#4592)
piraka9011 Jul 25, 2022
faf8ad8
divided parallel ci tests to reduce memory usage (#4600)
ameyasm1154 Jul 26, 2022
7890979
fix tarred dataset len when num shards is not divisible by workers (#…
itzsimpl Jul 26, 2022
5686fe2
[TTS][ASR] customize arguments for trimming the leading/trailing sile…
XuesongYang Jul 26, 2022
793cf48
Updating the default parameters in the example adapters config file (…
shan18 Jul 26, 2022
f1bf6c2
NeMo Megatron: Add sequence parallelism and selective activation che…
ericharper Jul 26, 2022
aa0a98c
Update Offline ASR with CTC Decoding (#4608)
titu1994 Jul 26, 2022
cbf3f66
normalize_batch error msg (#4614)
piraka9011 Jul 27, 2022
90ad5af
Support listing Hugging Face model info (#4619)
titu1994 Jul 27, 2022
2841c28
[TTS] Fix off-by-1 bug in Beta Binomial Prior (#4616)
rlangman Jul 28, 2022
16c96ba
Update diarization data loader to train meeting data (#4567)
tango4j Jul 28, 2022
96021f4
Add Squeezeformer to ASR (#4416)
titu1994 Jul 28, 2022
4f5ea8a
Merge branch 'NVIDIA:main' into main
paarthneekhara Jul 28, 2022
643002a
Merge branch 'ssl_synthesis' into main_ssl_merge
paarthneekhara Jul 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
299 changes: 209 additions & 90 deletions Jenkinsfile

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Key Features

* Speech processing
* `Automatic Speech Recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_
* Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, ...
* Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, Squeezeformer-CTC, Squeezeformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, ...
* Supports CTC and Transducer/RNNT losses/decoders
* Beam Search decoding
* `Language Modelling for ASR <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer
Expand Down
13 changes: 12 additions & 1 deletion docs/source/asr/asr_all.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1045,4 +1045,15 @@ @misc{ssl_inter
publisher = {arXiv},
year = {2021},
copyright = {arXiv.org perpetual, non-exclusive license}
}
}

@misc{kim2022squeezeformer,
doi = {10.48550/ARXIV.2206.00888},
url = {https://arxiv.org/abs/2206.00888},
author = {Kim, Sehoon and Gholami, Amir and Shaw, Albert and Lee, Nicholas and Mangalam, Karttikeya and Malik, Jitendra and Mahoney, Michael W. and Keutzer, Kurt},
keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Squeezeformer: An Efficient Transformer for Automatic Speech Recognition},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
11 changes: 11 additions & 0 deletions docs/source/asr/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -502,6 +502,17 @@ specify the tokenizer if you want to use sub-word encoding instead of character-
The encoder section includes the details about the Conformer-CTC encoder architecture. You may find more information in the
config files and also :doc:`nemo.collections.asr.modules.ConformerEncoder<./api.html#nemo.collections.asr.modules.ConformerEncoder>`.

Squeezeformer-CTC
~~~~~~~~~~~~~~~~~

The config files for Squeezeformer-CTC model contain character-based encoding and sub-word encoding at
``<NeMo_git_root>/examples/asr/conf/squeezeformer/squeezeformer_ctc_char.yaml`` and ``<NeMo_git_root>/examples/asr/conf/squeezeformer/squeezeformer_ctc_bpe.yaml``
respectively. Components of the configs of `Squeezeformer-CTC <./models.html#Squeezeformer-CTC>`__ are similar to Conformer config - `QuartzNet <./configs.html#Conformer-CTC>`__.

The encoder section includes the details about the Squeezeformer-CTC encoder architecture. You may find more information in the
config files and also :doc:`nemo.collections.asr.modules.SqueezeformerEncoder<./api.html#nemo.collections.asr.modules.SqueezeformerEncoder>`.


ContextNet
~~~~~~~~~~

Expand Down
3 changes: 3 additions & 0 deletions docs/source/asr/data/benchmark_rw.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Model,Model Base Class,Model Card
stt_rw_conformer_ctc_large,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_rw_conformer_ctc_large"
stt_rw_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_rw_conformer_transducer_large"
3 changes: 3 additions & 0 deletions docs/source/asr/data/scores/rw/conformer_rw.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Model Name,Language,MCV Test-Set v9.0 (rw)
stt_rw_conformer_ctc_large,rw,18.22
stt_rw_conformer_transducer_large,rw,16.19
66 changes: 36 additions & 30 deletions docs/source/asr/datasets.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Datasets
========

NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started
NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started
with those datasets by following the instructions to run those scripts in the section appropriate to each dataset below.

If the user has their own data and want to preprocess it to use with NeMo ASR models, refer to the `Preparing Custom ASR Data`_ section.
Expand All @@ -13,8 +13,8 @@ If the user already has a dataset that you want to convert to a tarred format, r
LibriSpeech
-----------

Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free
space is required.
Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free
space is required.

.. code-block:: bash

Expand All @@ -37,18 +37,18 @@ Fisher English Training Speech

Run these scripts to convert the Fisher English Training Speech data into a format expected by the ``nemo_asr`` collection.

In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the
smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets
In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the
smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets
(with one manifest each).

.. note::
- 106 GB of space is required to run the ``.wav`` conversion
- additional 105 GB is required for the slicing and matching
- ``sph2pipe`` is required in order to run the ``.wav`` conversion
- ``sph2pipe`` is required in order to run the ``.wav`` conversion

**Instructions**

The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure
The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure
that looks similar to the following:

.. code-block:: bash
Expand All @@ -67,7 +67,7 @@ that looks similar to the following:
├── fe_03_p2_sph3
└── ...

The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``)
The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``)
are located in the remaining directories in an ``audio`` subdirectory.

#. Convert the audio files from ``.sph`` to ``.wav`` by running:
Expand All @@ -78,7 +78,7 @@ are located in the remaining directories in an ``audio`` subdirectory.
python fisher_audio_to_wav.py \
--data_root=<fisher_root> --dest_root=<conversion_target_dir>

This will place the unsliced ``.wav`` files in ``<conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several
This will place the unsliced ``.wav`` files in ``<conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several
minutes to run.

#. Process the transcripts and slice the audio data.
Expand All @@ -90,7 +90,7 @@ are located in the remaining directories in an ``audio`` subdirectory.
--dest_root=<processing_target_dir> \
--remove_noises

This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders
This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders
in the destination directory. One manifest is written out per set, which includes each slice's transcript, duration, and path.

This will likely take around 20 minutes to run. Once finished, delete the 10 minute long ``.wav`` files.
Expand All @@ -100,8 +100,8 @@ are located in the remaining directories in an ``audio`` subdirectory.

Run the following script to convert the HUB5 data into a format expected by the ``nemo_asr`` collection.

Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and
transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments
Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and
transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments
are all written out to an audio directory and the corresponding transcripts are written to a manifest JSON file.

.. note::
Expand All @@ -123,7 +123,7 @@ You can optionally include ``--min_slice_duration=<num_seconds>`` if you would l
AN4 Dataset
-----------

This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out
This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out
addresses, names, etc. Information about this dataset can be found on the `official CMU site <http://www.speech.cs.cmu.edu/databases/an4/>`_.

#. `Download and extract the dataset <http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz>`_ (which is labeled "NIST's Sphere audio (.sph) format (64M)".
Expand Down Expand Up @@ -153,14 +153,14 @@ After the script finishes, the ``data`` folder should contain a ``data_aishell``
Aishell-2
---------

To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push
these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run:
To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push
these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run:

.. code-block:: bash

python process_aishell2_data.py --audio_folder=<data directory> --dest_folder=<destination directory>

After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory.
After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory.

Preparing Custom ASR Data
-------------------------
Expand All @@ -171,7 +171,7 @@ The audio files can be of any format supported by `Pydub <https://github.com/jia
WAV files as they are the default and have been most thoroughly tested.

There should be one manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation
datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice
datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice
versa.

Each line of the manifest should be in the following format:
Expand Down Expand Up @@ -210,16 +210,22 @@ of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single
``'/data/shard_{1..64}.tar'`` or ``'/data/shard__OP_1..64_CL_'`` (recommended, see note below).

.. note::
For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly
inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``,
``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special
For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly
inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``,
``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special
tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use.

As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length
As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length
of the manifest after filtering is the correct size of the dataset for reporting training progress.

The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with
The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with
multiple workers. It defaults to ``scatter``, which preallocates a set of shards per worker which do not change during runtime.
Note that this strategy, on specific occasions (when the number of shards is not divisible with ``world_size``), will not sample
the entire dataset. As an alternative the ``replicate`` strategy, will preallocate the entire set of shards to every worker and not
change it during runtime. The benefit of this strategy is that it allows each worker to sample data points from the entire dataset
independently of others. Note, though, that more than one worker may sample the same shard, and even sample the same data points!
As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Note that
for these reasons it is not advisable to use tarred datasets as validation and test datasets.

For more information about the individual tarred datasets and the parameters available, including shuffling options,
see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ section.
Expand All @@ -228,7 +234,7 @@ see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ secti
If using multiple workers, the number of shards should be divisible by the world size to ensure an even
split among workers. If it is not divisible, logging will give a warning but training will proceed, but likely hang at the last epoch.
In addition, if using distributed processing, each shard must have the same number of entries after filtering is
applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's
applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's
program may hang if the shards are uneven.

Conversion to Tarred Datasets
Expand Down Expand Up @@ -262,9 +268,9 @@ The files in the target directory should look similar to the following:
├── metadata.yaml
└── tarred_audio_manifest.json

Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that
filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are
simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav``
Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that
filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are
simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav``
in the tarred dataset manifest, and ``/data/directory2/file.wav`` would be converted to ``_data_directory2_file.wav``.

Bucketing Datasets
Expand Down Expand Up @@ -325,9 +331,9 @@ Currently bucketing feature is just supported for tarred datasets.
Upsampling Datasets
------------------

Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling.
Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling.

Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training.
Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training.

For example, by passing `[2,1,1,3]` to the code below:

Expand Down Expand Up @@ -363,7 +369,7 @@ If using adaptive bucketing, note that the same batch size will be assigned to e
model.train_ds.bucketing_weights=[2,1,1,3]
model.train_ds.bucketing_batch_size=[4,4,4,2]

All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1.
All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1.
If `bucketing_batch_size` is not specified, all datasets will be passed with the same fixed batch size as specified by the `batch_size` parameter.

It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training.
It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training.
Binary file added docs/source/asr/images/squeezeformer.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading