<a href="https://colab.research.google.com/github/nanekeshishyan/speech-tech-rau/blob/main/hw2_nane_keshishyan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This is the task for the second home work (HW2)**

Need to get Armenian MCV dataset and train Armenian ASR model

Quality metric is WER on Armenian MCV test subset.

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect


NOTE: User is responsible for checking the content of datasets and the applicable licenses and determining if suitable for the intended use.
"""

# Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg libsox-fmt-mp3
!pip install text-unidecode
!pip install matplotlib>=3.3.2

## Install NeMo
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case
that you want to use the "Run All Cells" (or similar) option.
"""
# exit()

In [None]:
import os
import glob
import subprocess
import tarfile
import wget
import copy
from omegaconf import OmegaConf, open_dict


In [None]:
data_dir = 'datasets/'

if not os.path.exists(data_dir):
  os.makedirs(data_dir, exist_ok=True)

if not os.path.exists("scripts"):
  os.makedirs("scripts")

import nemo
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.metrics.wer import word_error_rate
from nemo.utils import logging, exp_manager

**Download dataset**

We will use the NeMo script in the scripts directory to download and prepare the Mozilla Common Voice (MCV) dataset for Armenian.

The data preparation script will download the audio files and respective transcripts and then process the audio into mono-channel 16 kHz wave files that can be easily used for training ASR models.

**Hugging Face**

Now, let's download the Mozilla CommonVoice Spanish dataset. We will ignore the larger train file and get just the test part for the purposes of this tutorial. For good results, you will need to get the train files and likely other datasets too, bringing the total to over 1k hours.

Website steps:

Visit https://huggingface.co/settings/profile

Visit "Access Tokens" on list of items.

Create new token - provide a name for the token and "read" access is sufficient.

PRESERVE THAT TOKEN API KEY. You can copy that key for next step.

Visit the HuggingFace Dataset page for [Mozilla Common Voice 16.1](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1)

There should be a section that asks you for your approval.

Make sure you are logged in and then read that agreement.

If and only if you agree to the text, then accept the terms.

Code steps:

* Now below, run login()

* Paste your preserved HF TOKEN API KEY to the text box."

In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
VERSION = "mozilla-foundation/common_voice_16_1"
LANGUAGE = "hy-AM"

In [None]:
tokenizer_dir = os.path.join('tokenizers', LANGUAGE)
manifest_dir = os.path.join('datasets', LANGUAGE, VERSION, LANGUAGE)

In [None]:
# If something goes wrong during data processing, un-comment the following line to delete the cached dataset
# !rm -rf datasets/$LANGUAGE
!mkdir -p datasets

The following cell will download the Japanese MCV corpus, preprocess the audio and prepare manifest files that can be directly used by NeMo models.

We will use the convert_hf_dataset_to_nemo.py script located in the nemo/scripts/speech_recognition dir if you cloned NeMo repo

In [None]:
if not os.path.exists("convert_hf_dataset_to_nemo.py"):
    !wget https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/speech_recognition/convert_hf_dataset_to_nemo.py

In [None]:
!python convert_hf_dataset_to_nemo.py \
    output_dir=datasets/$LANGUAGE \
    path=$VERSION \
    name=$LANGUAGE \
    split="train" \
    ensure_ascii=False \
    use_auth_token=True

!python convert_hf_dataset_to_nemo.py \
    output_dir=datasets/$LANGUAGE \
    path=$VERSION \
    name=$LANGUAGE \
    split="validation" \
    ensure_ascii=False \
    use_auth_token=True

!python convert_hf_dataset_to_nemo.py \
    output_dir=datasets/$LANGUAGE \
    path=$VERSION \
    name=$LANGUAGE \
    split="test" \
    ensure_ascii=False \
    use_auth_token=True

!python convert_hf_dataset_to_nemo.py \
    output_dir=datasets/$LANGUAGE \
    path=$VERSION \
    name=$LANGUAGE \
    split="other" \
    ensure_ascii=False \
    use_auth_token=True

!python convert_hf_dataset_to_nemo.py \
    output_dir=datasets/$LANGUAGE \
    path=$VERSION \
    name=$LANGUAGE \
    split="invalidated" \
    ensure_ascii=False \
    use_auth_token=True

The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_name='hfds_config', config_path=None)
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
Downloading builder script: 100% 8.17k/8.17k [00:00<00:00, 31.2MB/s]
Downloading readme: 100% 12.3k/12.3k [00:00<00:00, 33.3MB/s]
Downloading extra modules: 100% 3.74k/3.74k [00:00<00:00, 20.4MB/s]
Downloading extra modules: 100% 77.3k/77.3k [00:00<00:00, 36.1MB/s]
Downloading data: 100% 14.6k/14.6k [00:00<00:00, 22.9MB/s]
Downloading data: 100% 123M/123M [00:01<00:00, 89.7MB/s]
Downloading data: 100% 89.6M/89.6M [00:01<00:00, 84.2MB/s]
Downloading data: 100% 101M/101M [00:01<00:0

In [None]:
train_manifest = f"{manifest_dir}/train/train_mozilla-foundation_common_voice_16_1_manifest.json"
dev_manifest = f"{manifest_dir}/validation/validation_mozilla-foundation_common_voice_16_1_manifest.json"
test_manifest = f"{manifest_dir}/test/test_mozilla-foundation_common_voice_16_1_manifest.json"
other_manifest = f"{manifest_dir}/other/other_mozilla-foundation_common_voice_16_1_manifest.json"
invalidated_manifest = f"{manifest_dir}/invalidated/invalidated_mozilla-foundation_common_voice_16_1_manifest.json"

In [None]:
train_manifest_full = f"{manifest_dir}/train_full_mozilla-foundation_common_voice_16_1_manifest.json"
!cat $train_manifest $other_manifest $invalidated_manifest > $train_manifest_full

**Hint**: Convert texts to lowercase and remove punctuation to improve WER.

In [None]:
if not os.path.exists("scripts/process_asr_text_tokenizer.py"):
  !wget -P scripts/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/scripts/tokenizers/process_asr_text_tokenizer.py


--2024-04-03 13:25:39--  https://raw.githubusercontent.com/NVIDIA/NeMo/main/scripts/tokenizers/process_asr_text_tokenizer.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16631 (16K) [text/plain]
Saving to: ‘scripts/process_asr_text_tokenizer.py’


2024-04-03 13:25:39 (131 MB/s) - ‘scripts/process_asr_text_tokenizer.py’ saved [16631/16631]



In [None]:
import re

#convert to lowercase and remove punctuation
script_path = "scripts/process_asr_text_tokenizer.py"

with open(script_path, 'r') as file:
  script = file.read()

modified_script = re.sub(
    r'line\["text"\].strip\(\)',
    r're.sub(r"[^\\w\\s]", "", line["text"].lower().strip())',
    script)

with open(script_path, 'w') as file:
  file.write(modified_script)

**Hint**: Play with `VOCAB_SIZE` to improve WER.

In [None]:
TOKENIZER_TYPE = "bpe" # "bpe", "unigram"
VOCAB_SIZE = 128 + 4

In [None]:
!python scripts/process_asr_text_tokenizer.py \
  --manifest=$train_manifest_full,$dev_manifest \
  --vocab_size=$VOCAB_SIZE \
  --data_root=$tokenizer_dir \
  --tokenizer="spe" \
  --spe_type=$TOKENIZER_TYPE \
  --spe_character_coverage=1.0 \
  --no_lower_case \
  --log

INFO:root:Finished extracting manifest : datasets/hy-AM/mozilla-foundation/common_voice_16_1/hy-AM/train_full_mozilla-foundation_common_voice_16_1_manifest.json
INFO:root:Finished extracting manifest : datasets/hy-AM/mozilla-foundation/common_voice_16_1/hy-AM/validation/validation_mozilla-foundation_common_voice_16_1_manifest.json
INFO:root:Finished extracting all manifests ! Number of sentences : 12707
[NeMo I 2024-04-03 13:30:35 sentencepiece_tokenizer:317] Processing tokenizers/hy-AM/text_corpus/document.txt and store at tokenizers/hy-AM/tokenizer_spe_bpe_v132
sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=tokenizers/hy-AM/text_corpus/document.txt --model_prefix=tokenizers/hy-AM/tokenizer_spe_bpe_v132/tokenizer --vocab_size=132 --shuffle_input_sentence=true --hard_vocab_limit=false --model_type=bpe --character_coverage=1.0 --bos_id=-1 --eos_id=-1
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: tokenizers/hy-AM/text_corpus/docu

**Hint**: Try different models.

In [None]:
model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_ctc_large", map_location='cpu')
#print(nemo_asr.models.EncDecCTCModel.list_available_models())
#model = nemo_asr.models.EncDecCTCModel.from_pretrained("QuartzNet15x5Base-En", map_location='cpu')

[NeMo I 2024-04-03 13:59:23 cloud:58] Found existing object /root/.cache/torch/NeMo/NeMo_1.23.0rc0/stt_en_fastconformer_ctc_large/00a071a9dac048acc3aeea942b0bfa40/stt_en_fastconformer_ctc_large.nemo.
[NeMo I 2024-04-03 13:59:23 cloud:64] Re-using file from: /root/.cache/torch/NeMo/NeMo_1.23.0rc0/stt_en_fastconformer_ctc_large/00a071a9dac048acc3aeea942b0bfa40/stt_en_fastconformer_ctc_large.nemo
[NeMo I 2024-04-03 13:59:23 common:815] Instantiating model from pre-trained checkpoint
[NeMo I 2024-04-03 13:59:24 mixins:172] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-04-03 13:59:25 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 1
    shuffle: true
    num_workers: 8
    pin_memory: true
    use_start_end_token: false
    trim_silence: false
    max_duration: 20
    min_duration: 0.1
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    bucketing_strategy: fully_randomized
    bucketing_batch_size: null
    
[NeMo W 2024-04-03 13:59:25 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    num_workers: 8
    pin_m

[NeMo I 2024-04-03 13:59:25 features:289] PADDING: 0
[NeMo I 2024-04-03 13:59:27 save_restore_connector:263] Model EncDecCTCModelBPE was successfully restored from /root/.cache/torch/NeMo/NeMo_1.23.0rc0/stt_en_fastconformer_ctc_large/00a071a9dac048acc3aeea942b0bfa40/stt_en_fastconformer_ctc_large.nemo.


In [None]:
import torch
import torch.nn as nn

freeze_encoder = True # set to False if dare lol

def enable_bn_se(m):
    if type(m) == nn.BatchNorm1d:
        m.train()
        for param in m.parameters():
            param.requires_grad_(True)

    if 'SqueezeExcite' in type(m).__name__:
        m.train()
        for param in m.parameters():
            param.requires_grad_(True)

if freeze_encoder:
  model.encoder.freeze()
  model.encoder.apply(enable_bn_se)
  logging.info("Model encoder has been frozen")
else:
  model.encoder.unfreeze()
  logging.info("Model encoder has been un-frozen")

[NeMo I 2024-04-03 13:59:30 <ipython-input-46-180610cd99a3>:20] Model encoder has been frozen


In [None]:
TOKENIZER_DIR = os.path.join(tokenizer_dir, f"tokenizer_spe_{TOKENIZER_TYPE}_v{VOCAB_SIZE}")

model.change_vocabulary(new_tokenizer_dir=TOKENIZER_DIR, new_tokenizer_type=TOKENIZER_TYPE)

new_tokenizer_dir = os.path.join(tokenizer_dir, f"tokenizer_spe_{TOKENIZER_TYPE}_v{VOCAB_SIZE}")

# Initialize the model with the new tokenizer directory
model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_transformer_ctc_large", map_location='cpu', tokenizer_dir=new_tokenizer_dir)

In [None]:
cfg = copy.deepcopy(model.cfg)

# Setup new tokenizer
cfg.tokenizer.dir = TOKENIZER_DIR
cfg.tokenizer.type = "bpe"

# Set tokenizer config
model.cfg.tokenizer = cfg.tokenizer

In [None]:
# Setup train/val/test configs
print(OmegaConf.to_yaml(cfg.train_ds))

In [None]:
# Setup train, validation, test configs
with open_dict(cfg):
  # Train dataset
  cfg.train_ds.manifest_filepath = f"{train_manifest_full},{dev_manifest}"
  cfg.train_ds.batch_size = 32
  cfg.train_ds.num_workers = 8
  cfg.train_ds.pin_memory = True
  cfg.train_ds.use_start_end_token = False
  cfg.train_ds.trim_silence = True

  # Validation dataset
  cfg.validation_ds.manifest_filepath = test_manifest
  cfg.validation_ds.batch_size = 8
  cfg.validation_ds.num_workers = 8
  cfg.validation_ds.pin_memory = True
  cfg.validation_ds.use_start_end_token = False
  cfg.validation_ds.trim_silence = True

  # Test dataset
  cfg.test_ds.manifest_filepath = test_manifest
  cfg.test_ds.batch_size = 8
  cfg.test_ds.num_workers = 8
  cfg.test_ds.pin_memory = True
  cfg.test_ds.use_start_end_token = False
  cfg.test_ds.trim_silence = True

In [None]:
# setup model with new configs
model.setup_training_data(cfg.train_ds)
model.setup_multiple_validation_data(cfg.validation_ds)
model.setup_multiple_test_data(cfg.test_ds)

[NeMo I 2024-04-03 13:59:44 collections:196] Dataset loaded with 12702 files totalling 18.82 hours
[NeMo I 2024-04-03 13:59:44 collections:197] 5 files were filtered totalling 0.13 hours


    


[NeMo I 2024-04-03 13:59:45 collections:196] Dataset loaded with 2853 files totalling 4.55 hours
[NeMo I 2024-04-03 13:59:45 collections:197] 0 files were filtered totalling 0.00 hours
[NeMo I 2024-04-03 13:59:45 collections:196] Dataset loaded with 2853 files totalling 4.55 hours
[NeMo I 2024-04-03 13:59:45 collections:197] 0 files were filtered totalling 0.00 hours


In [None]:
print(OmegaConf.to_yaml(cfg.optim))

name: adamw
lr: 0.001
betas:
- 0.9
- 0.98
weight_decay: 0.001
sched:
  name: CosineAnnealing
  warmup_steps: 15000
  warmup_ratio: null
  min_lr: 0.0001



In [None]:
with open_dict(model.cfg.optim):
  model.cfg.optim.lr = 0.025
  model.cfg.optim.weight_decay = 0.001
  model.cfg.optim.sched.warmup_steps = None  # Remove default number of steps of warmup
  model.cfg.optim.sched.warmup_ratio = 0.10  # 10 % warmup
  model.cfg.optim.sched.min_lr = 1e-9

with open_dict(model.cfg.spec_augment):
  model.cfg.spec_augment.freq_masks = 2
  model.cfg.spec_augment.freq_width = 25
  model.cfg.spec_augment.time_masks = 10
  model.cfg.spec_augment.time_width = 0.05

model.spec_augmentation = model.from_config_dict(model.cfg.spec_augment)

In [None]:
use_cer = False
log_prediction = True

model.wer.use_cer = use_cer
model.wer.log_prediction = log_prediction

In [None]:
import torch
import pytorch_lightning as ptl

if torch.cuda.is_available():
  accelerator = 'gpu'
else:
  accelerator = 'gpu'

EPOCHS = 50  # will take approximately 4 hours

trainer = ptl.Trainer(devices=1,
                      accelerator=accelerator,
                      max_epochs=EPOCHS,
                      accumulate_grad_batches=1,
                      enable_checkpointing=False,
                      logger=False,
                      log_every_n_steps=5,
                      check_val_every_n_epoch=10)

# Setup model with the trainer
model.set_trainer(trainer)

# finally, update the model's internal config
model.cfg = model._cfg

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [None]:
from nemo.utils import exp_manager

# Environment variable generally used for multi-node multi-gpu training.
# In notebook environments, this flag is unnecessary and can cause logs of multiple training runs to overwrite each other.
os.environ.pop('NEMO_EXPM_VERSION', None)

config = exp_manager.ExpManagerConfig(
    exp_dir=f'experiments/lang-{LANGUAGE}/',
    name=f"ASR-Model-Language-{LANGUAGE}",
    checkpoint_callback_params=exp_manager.CallbackParams(
        monitor="val_wer",
        mode="min",
        always_save_nemo=True,
        save_best_model=True,
    ),
)

config = OmegaConf.structured(config)

logdir = exp_manager.exp_manager(trainer, config)

[NeMo I 2024-04-03 14:00:07 exp_manager:396] Experiments will be logged at experiments/lang-hy-AM/ASR-Model-Language-hy-AM/2024-04-03_14-00-07
[NeMo I 2024-04-03 14:00:07 exp_manager:856] TensorboardLogger has been set up


In [None]:
try:
  from google import colab
  COLAB_ENV = True
except (ImportError, ModuleNotFoundError):
  COLAB_ENV = False

# Load the TensorBoard notebook extension
if COLAB_ENV:
  %load_ext tensorboard
  %tensorboard --logdir /content/experiments/lang-$LANGUAGE/ASR-Model-Language-$LANGUAGE/
else:
  print("To use tensorboard, please use this notebook in a Google Colab environment.")

In [None]:
%%time
trainer.fit(model)

Please save and download your model.

In [None]:
save_path = f"Model-{LANGUAGE}.nemo"
model.save_to(f"{save_path}")
print(f"Model saved at path : {os.getcwd() + os.path.sep + save_path}")

Model saved at path : /content/Model-hy-AM.nemo
