paarthneekhara · paarthneekhara · Jul 28, 2022 · Jul 6, 2022 · Jul 6, 2022 · Jul 6, 2022
diff --git a/Jenkinsfile b/Jenkinsfile
diff --git a/README.rst b/README.rst
@@ -45,7 +45,7 @@ Key Features
 
 * Speech processing
     * `Automatic Speech Recognition (ASR) <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/intro.html>`_
-        * Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, ...
+        * Supported models: Jasper, QuartzNet, CitriNet, Conformer-CTC, Conformer-Transducer, Squeezeformer-CTC, Squeezeformer-Transducer, ContextNet, LSTM-Transducer (RNNT), LSTM-CTC, ...
         * Supports CTC and Transducer/RNNT losses/decoders
         * Beam Search decoding
         * `Language Modelling for ASR <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html>`_: N-gram LM in fusion with Beam Search decoding, Neural Rescoring with Transformer

diff --git a/docs/source/asr/asr_all.bib b/docs/source/asr/asr_all.bib
@@ -1045,4 +1045,15 @@ @misc{ssl_inter
   publisher = {arXiv},
   year = {2021},
   copyright = {arXiv.org perpetual, non-exclusive license}
-}
+}
+
+@misc{kim2022squeezeformer,
+  doi = {10.48550/ARXIV.2206.00888},
+  url = {https://arxiv.org/abs/2206.00888},
+  author = {Kim, Sehoon and Gholami, Amir and Shaw, Albert and Lee, Nicholas and Mangalam, Karttikeya and Malik, Jitendra and Mahoney, Michael W. and Keutzer, Kurt},
+  keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {Squeezeformer: An Efficient Transformer for Automatic Speech Recognition},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
diff --git a/docs/source/asr/configs.rst b/docs/source/asr/configs.rst
@@ -502,6 +502,17 @@ specify the tokenizer if you want to use sub-word encoding instead of character-
 The encoder section includes the details about the Conformer-CTC encoder architecture. You may find more information in the 
 config files and also :doc:`nemo.collections.asr.modules.ConformerEncoder<./api.html#nemo.collections.asr.modules.ConformerEncoder>`.
 
+Squeezeformer-CTC
+~~~~~~~~~~~~~~~~~
+
+The config files for Squeezeformer-CTC model contain character-based encoding and sub-word encoding at
+``<NeMo_git_root>/examples/asr/conf/squeezeformer/squeezeformer_ctc_char.yaml`` and ``<NeMo_git_root>/examples/asr/conf/squeezeformer/squeezeformer_ctc_bpe.yaml``
+respectively. Components of the configs of `Squeezeformer-CTC <./models.html#Squeezeformer-CTC>`__ are similar to Conformer config - `QuartzNet <./configs.html#Conformer-CTC>`__.
+
+The encoder section includes the details about the Squeezeformer-CTC encoder architecture. You may find more information in the
+config files and also :doc:`nemo.collections.asr.modules.SqueezeformerEncoder<./api.html#nemo.collections.asr.modules.SqueezeformerEncoder>`.
+
+
 ContextNet
 ~~~~~~~~~~
 

diff --git a/docs/source/asr/data/benchmark_rw.csv b/docs/source/asr/data/benchmark_rw.csv
@@ -0,0 +1,3 @@
+Model,Model Base Class,Model Card
+stt_rw_conformer_ctc_large,EncDecCTCModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_rw_conformer_ctc_large"
+stt_rw_conformer_transducer_large,EncDecRNNTBPEModel,"https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_rw_conformer_transducer_large"
diff --git a/docs/source/asr/data/scores/rw/conformer_rw.csv b/docs/source/asr/data/scores/rw/conformer_rw.csv
@@ -0,0 +1,3 @@
+Model Name,Language,MCV Test-Set v9.0 (rw)
+stt_rw_conformer_ctc_large,rw,18.22
+stt_rw_conformer_transducer_large,rw,16.19
diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst
@@ -1,7 +1,7 @@
 Datasets
 ========
 
-NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started 
+NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started
 with those datasets by following the instructions to run those scripts in the section appropriate to each dataset below.
 
 If the user has their own data and want to preprocess it to use with NeMo ASR models, refer to the `Preparing Custom ASR Data`_ section.
@@ -13,8 +13,8 @@ If the user already has a dataset that you want to convert to a tarred format, r
 LibriSpeech
 -----------
 
-Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free 
-space is required. 
+Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free
+space is required.
 
 .. code-block:: bash
 
@@ -37,18 +37,18 @@ Fisher English Training Speech
 
 Run these scripts to convert the Fisher English Training Speech data into a format expected by the ``nemo_asr`` collection.
 
-In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the 
-smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets 
+In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the
+smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets
 (with one manifest each).
 
 .. note::
   - 106 GB of space is required to run the ``.wav`` conversion
   - additional 105 GB is required for the slicing and matching
-  - ``sph2pipe`` is required in order to run the ``.wav`` conversion 
+  - ``sph2pipe`` is required in order to run the ``.wav`` conversion
 
 **Instructions**
 
-The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure 
+The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure
 that looks similar to the following:
 
 .. code-block:: bash
@@ -67,7 +67,7 @@ that looks similar to the following:
       ├── fe_03_p2_sph3
       └── ...
 
-The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``) 
+The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``)
 are located in the remaining directories in an ``audio`` subdirectory.
 
 #. Convert the audio files from ``.sph`` to ``.wav`` by running:
@@ -78,7 +78,7 @@ are located in the remaining directories in an ``audio`` subdirectory.
      python fisher_audio_to_wav.py \
        --data_root=<fisher_root> --dest_root=<conversion_target_dir>
 
-   This will place the unsliced ``.wav`` files in ``<conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several 
+   This will place the unsliced ``.wav`` files in ``<conversion_target_dir>/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several
    minutes to run.
 
 #. Process the transcripts and slice the audio data.
@@ -90,7 +90,7 @@ are located in the remaining directories in an ``audio`` subdirectory.
        --dest_root=<processing_target_dir> \
        --remove_noises
 
-   This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders 
+   This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders
    in the destination directory. One manifest is written out per set, which includes each slice's transcript, duration, and path.
 
    This will likely take around 20 minutes to run. Once finished, delete the 10 minute long ``.wav`` files.
@@ -100,8 +100,8 @@ are located in the remaining directories in an ``audio`` subdirectory.
 
 Run the following script to convert the HUB5 data into a format expected by the ``nemo_asr`` collection.
 
-Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and 
-transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments 
+Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and
+transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments
 are all written out to an audio directory and the corresponding transcripts are written to a manifest JSON file.
 
 .. note::
@@ -123,7 +123,7 @@ You can optionally include ``--min_slice_duration=<num_seconds>`` if you would l
 AN4 Dataset
 -----------
 
-This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out 
+This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out
 addresses, names, etc. Information about this dataset can be found on the `official CMU site <http://www.speech.cs.cmu.edu/databases/an4/>`_.
 
 #. `Download and extract the dataset <http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz>`_ (which is labeled "NIST's Sphere audio (.sph) format (64M)".
@@ -153,14 +153,14 @@ After the script finishes, the ``data`` folder should contain a ``data_aishell``
 Aishell-2
 ---------
 
-To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push 
-these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run: 
+To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push
+these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run:
 
 .. code-block:: bash
 
     python process_aishell2_data.py --audio_folder=<data directory> --dest_folder=<destination directory>
 
-After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory. 
+After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory.
 
 Preparing Custom ASR Data
 -------------------------
@@ -171,7 +171,7 @@ The audio files can be of any format supported by `Pydub <https://github.com/jia
 WAV files as they are the default and have been most thoroughly tested.
 
 There should be one manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation
-datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice 
+datasets, they should also have separate manifests. Otherwise, thay will be loading validation data with their training data and vice
 versa.
 
 Each line of the manifest should be in the following format:
@@ -210,16 +210,22 @@ of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single
 ``'/data/shard_{1..64}.tar'`` or ``'/data/shard__OP_1..64_CL_'`` (recommended, see note below).
 
 .. note::
-  For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly 
-  inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``, 
-  ``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special 
+  For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly
+  inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``,
+  ``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special
   tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use.
 
-As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length 
+As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length
 of the manifest after filtering is the correct size of the dataset for reporting training progress.
 
-The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with 
+The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with
 multiple workers. It defaults to ``scatter``, which preallocates a set of shards per worker which do not change during runtime.
+Note that this strategy, on specific occasions (when the number of shards is not divisible with ``world_size``), will not sample
+the entire dataset. As an alternative the ``replicate`` strategy, will preallocate the entire set of shards to every worker and not
+change it during runtime. The benefit of this strategy is that it allows each worker to sample data points from the entire dataset
+independently of others. Note, though, that more than one worker may sample the same shard, and even sample the same data points!
+As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Note that
+for these reasons it is not advisable to use tarred datasets as validation and test datasets.
 
 For more information about the individual tarred datasets and the parameters available, including shuffling options,
 see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ section.
@@ -228,7 +234,7 @@ see the corresponding class APIs in the `Datasets <./api.html#Datasets>`__ secti
   If using multiple workers, the number of shards should be divisible by the world size to ensure an even
   split among workers. If it is not divisible, logging will give a warning but training will proceed, but likely hang at the last epoch.
   In addition, if using distributed processing, each shard must have the same number of entries after filtering is
-  applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's 
+  applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's
   program may hang if the shards are uneven.
 
 Conversion to Tarred Datasets
@@ -262,9 +268,9 @@ The files in the target directory should look similar to the following:
   ├── metadata.yaml
   └── tarred_audio_manifest.json
 
-Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that 
-filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are 
-simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav`` 
+Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that
+filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are
+simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav``
 in the tarred dataset manifest, and ``/data/directory2/file.wav`` would be converted to ``_data_directory2_file.wav``.
 
 Bucketing Datasets
@@ -325,9 +331,9 @@ Currently bucketing feature is just supported for tarred datasets.
 Upsampling Datasets
 ------------------
 
-Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling.   
+Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling.
 
-Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training. 
+Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training.
 
 For example, by passing `[2,1,1,3]` to the code below:
 
@@ -363,7 +369,7 @@ If using adaptive bucketing, note that the same batch size will be assigned to e
 	model.train_ds.bucketing_weights=[2,1,1,3]
 	model.train_ds.bucketing_batch_size=[4,4,4,2]
 
-All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1. 
+All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1.
 If `bucketing_batch_size` is not specified, all datasets will be passed with the same fixed batch size as specified by the `batch_size` parameter.
 
-It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training. 
+It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training.
diff --git a/docs/source/asr/images/squeezeformer.png b/docs/source/asr/images/squeezeformer.png