By Divyanshu Kakwani

- Data Preparation:
    - Datasets used for training:  Detailed analysis of the training dataset can be [found here](https://maglev.nvda.ai/ide/redash/dashboard/speech-mlops-final-dataset-dashboard?p_database=jarvis_speech&p_language=DE&p_asr_set=DE_ASR_SET%202.0)
        - Note: duration in brackets below refer to duration after cleaning
        - Public datasets: MCV 7.0 (571 hours), MLS (1918 hours), Voxpopuli (214 hours)
        - Proprietary datasets: 2 Magic dataset (382 hours), 3 Speechocean datasets (447 hours)
        - Total duration (public + proprietary) = 3533 hours
        - Not sure about how much impact adding each additional dataset has. 
    - Dataset ingestion (script converting the dataset to standard manifest format):  
        - [RIVA scripts here](https://gitlab-master.nvidia.com/jarvis/speech-mlops/-/tree/main/scripts/data_ingestion/datasets): You can find MLS, MCV, voxpopuli in their respective folders.
    - Normalization: Normalize transcript text, audio (sample rate, channels), metadata values 
        - For text normalization, we use NeMo’s code – if it is not available for a particular language, we develop a simpler, makeshift normalization code within RIVA.
        - [RIVA's normalization here](https://gitlab-master.nvidia.com/jarvis/speech-mlops/-/tree/main/scripts/data_normalization/de)
        - [Nemo's text normalization here](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/de)
    - Run Inference on this dataset using some existing/previous model: The predictions produced by this model are used in the filtering stage
        1. Metric Computation: Compute WER/CER w.r.t. existing/previous model.
        2. Note that this step is carried out to filter some noisy datasets; Since for public notebooks, we may not be having any existing/previous model, this stage can be skipped, I think.
    - Filtering: filter out too long, too short, empty samples, and sample having very high WER/CER w.r.t. our previous models
        1. [Data curation scripts here](https://gitlab-master.nvidia.com/jarvis/speech-mlops/-/tree/main/scripts/data_curation) 
    - Data Tarring and uploading to storage container:
        1. [Scripts here](https://gitlab-master.nvidia.com/jarvis/speech-mlops/-/tree/main/scripts/data_preparation), [nemo scripts here](https://github.com/NVIDIA/NeMo/blob/main/scripts/speech_recognition/convert_to_tarred_audio_dataset.py)
        2. I think the RIVA’s scripts are built on top of nemo’s script with some platform specific customizations.
- Model Training
    - Acoustic model:
        - Model architectures we use: Citrinet, Conformer
        - Training from scratch vs fine-tuning: [example this blog](https://developer.nvidia.com/blog/jump-start-training-for-speech-recognition-models-with-nemo/). Not much aware of other experiments.
        - Hyperparameters: we document the hyperparameters we use in [model release checklist](https://drive.google.com/drive/u/0/folders/1rCB7qWDkgNvVM5tNI48gmv4AuG43E1HP); not aware if we have a written guideline to choose the hyperparameters, but some tips can be found in the [Citrinet paper](https://arxiv.org/pdf/2104.01721.pdf) 
        - Training script: [Citrinet training config](https://gitlab-master.nvidia.com/jarvis/speech-mlops/-/blob/main/workflows/wf_asr_train.yaml), [Conformer training config](https://gitlab-master.nvidia.com/jarvis/speech-mlops/-/blob/main/workflows/wf_asr_train_conformer.yaml) and [training script](https://gitlab-master.nvidia.com/jarvis/speech-mlops/-/blob/main/scripts/model_training/asr_train.sh). It uses NeMo under the hood (the path to script can be found in TRAINING_SCRIPT variable in the config).
    - LM Model:
        - Model: Kenlm
        - Training set: we create training set by combining all the transcript text in our ASR set. 
        - [Training Script](https://drive.google.com/drive/u/0/folders/1UzXMzS9uZjdjK-TCimugROROf2uVFAeu)
        - [Flashlight Decoder parameter tuning](https://gitlab-master.nvidia.com/dl/riva/riva-speech/-/tree/main/quickstart/asr_lm_tools) - this is part of the RIVA quickstart scripts
        - Impact: We generally get 1-2% WER improvement by using LMs.
    - P&C Model:
        - Model: BERT-based
        - [Training dataset](https://github.com/NVIDIA/NeMo/blob/feat/punc_tarred/examples/speech_translation/punct_ds_preparation/prepare_big_data_for_punctuation_capitalization_task_simple.py)
        - [Training Script](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py)
    - Training tools
        - Wandb: For monitoring training
- Model Testing
    - Testing on a good evaluation set: For our purposes, we compile multiple test datasets to form our evaluation set. Examples [here](https://drive.google.com/drive/u/2/folders/1DgD2jkPQxt0rDCgNkqwC9kKKF-fk3mYa). Our eval set is also internal only, so I think the user (of the tutorial notebook) can compile their own set of test dataset and benchmark the model:
        - Evaluate using NeMo: evaluate using nemo library in offline mode only. I think nemo provides a sample notebook to do this.
        - Evaluate using RIVA: using quickstart scripts to evaluate the model in streaming and offline modes. [Official documentation here](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html)
- Deployment
    - Internal review process (just FYI): We benchmark models (performance latency/througput, check WER) to make sure the model is good to release. Once it passes the tests, it is published on NGC (all of .nemo, .riva and .rmir files)
    - Now, given the final .nemo model, here are the things that are done to deploy on riva:
        - Download RIVA Quickstart scripts – it provides nemo2riva, servicemaker, riva-speech server and client images
        - Build .riva: using nemo2riva command in servicemaker container
        - Build RMIR: use the riva-build tool in servicemaker container. The riva-build command can be found [here](https://gitlab-master.nvidia.com/dl/riva/riva-speech/-/blob/main/python/servicemaker/tests/asr_e2e_helper.sh) 
        - Deploy the model and start the server
        - All these things can be found in [riva-speech docs](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/index.html). Here is [another document](https://confluence.nvidia.com/display/AIJ/3.1+-+Evaluating+ASR+Models) I had written that might help you.
