<a href="https://colab.research.google.com/github/paschalk/Nemo-Swahili-ASR/blob/main/NemoSTT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
You can run either this notebook locally (if you have all the dependencies and a GPU) or on Google Colab.

Instructions for setting up Colab are as follows:
1. Open a new Python 3 notebook.
2. Import this notebook from GitHub (File -> Upload Notebook -> "GITHUB" tab -> copy/paste GitHub URL)
3. Connect to an instance with a GPU (Runtime -> Change runtime type -> select "GPU" for hardware accelerator)
4. Run this cell to set up dependencies.
5. Restart the runtime (Runtime -> Restart Runtime) for any upgraded packages to take effect
"""
# If you're using Google Colab and not running locally, run this cell.

## Install dependencies
!pip install wget
!apt-get install sox libsndfile1 ffmpeg
!pip install text-unidecode
!pip install matplotlib>=3.3.2

## Install NeMo
BRANCH = 'r1.15.0'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

"""
Remember to restart the runtime for the kernel to pick up any upgraded packages (e.g. matplotlib)!
Alternatively, you can uncomment the exit() below to crash and restart the kernel, in the case
that you want to use the "Run All Cells" (or similar) option.
"""
# exit()

In [2]:
pip install jiwer

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Taking a look at our data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import os
# This is where the an4/ directory will be placed.
# Change this if you don't want the data to be extracted in the current directory.
data_dir = '.'

if not os.path.exists(data_dir):
  os.makedirs(data_dir)

### Example

In [None]:
# Example file to check data

import librosa
import IPython.display as ipd

# Load and listen to the audio file
example_file = '/content/drive/Shareddrives/IMARIKA/train_wavs/common_voice_sw_29914942.wav'
audio, sample_rate = librosa.load(example_file)

ipd.Audio(example_file, rate=sample_rate)

KeyboardInterrupt: ignored

#### Plot the waveform

In [None]:
%matplotlib inline
import librosa.display
import matplotlib.pyplot as plt

# Plot our example audio file's waveform
plt.rcParams['figure.figsize'] = (15,7)
plt.title('Waveform of Audio Example')
plt.ylabel('Amplitude')

_ = librosa.display.waveshow(audio)

#### Spectogram
Apply Fourier Transform on the signal 

In [None]:
import numpy as np

# Get spectrogram using Librosa's Short-Time Fourier Transform (stft)
spec = np.abs(librosa.stft(audio))
spec_db = librosa.amplitude_to_db(spec, ref=np.max)  # Decibels

# Use log scale to view frequencies
librosa.display.specshow(spec_db, y_axis='log', x_axis='time')
plt.colorbar()
plt.title('Audio Spectrogram');

#### Mel Spectogram
- Perceptual scale of pitches judged by listeners to be equla in distance from one another
- The mel scale normalizes this such that equal distances sound like equal differences to the human ear


In [None]:
# Generate Mel Spectogram for the sample audio

n_fft = 2048
hop_length = 512
n_mels = 128
mel_spec = librosa.feature.melspectrogram(
    y=audio, sr=sample_rate, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

# Display Mel spectrogram
librosa.display.specshow(mel_spec_db, x_axis='time', y_axis='mel')
plt.colorbar()
plt.title('Mel Spectrogram')
plt.show()

### Training from Scratch

#### Using Nemo Model(STT En Citrinet 1024)

In [5]:
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name="stt_en_conformer_ctc_large")



[NeMo W 2023-05-19 06:19:25 optimizers:55] Apex was not found. Using the lamb or fused_adam optimizer will error out.
[NeMo W 2023-05-19 06:19:26 experimental:27] Module <class 'nemo.collections.asr.models.audio_to_audio_model.AudioToAudioModel'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-19 06:19:28 experimental:27] Module <class 'nemo.collections.asr.modules.audio_modules.SpectrogramToMultichannelFeatures'> is experimental, not ready for production and is not fully supported. Use at your own risk.
    
[NeMo W 2023-05-19 06:19:29 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_audio.BaseAudioDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2023-05-19 06:19:29 experimental:27] Module <class 'nemo.collections.asr.data.audio_to_audio.AudioToTargetDataset'> is experimental, not ready for production and is not fully supported. Use at your own ris

[NeMo I 2023-05-19 06:19:29 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_conformer_ctc_large/versions/1.10.0/files/stt_en_conformer_ctc_large.nemo to /root/.cache/torch/NeMo/NeMo_1.15.0/stt_en_conformer_ctc_large/afb212c5bcf904e326b5e5751e7c7465/stt_en_conformer_ctc_large.nemo
[NeMo I 2023-05-19 06:19:59 common:913] Instantiating model from pre-trained checkpoint
[NeMo I 2023-05-19 06:20:01 mixins:170] Tokenizer SentencePieceTokenizer initialized with 128 tokens


[NeMo W 2023-05-19 06:20:01 modelPT:156] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath:
    - - /data2/nemo_asr/nemo_asr_set_3.0/bucket1/tarred_audio_manifest.json
    - - /data2/nemo_asr/nemo_asr_set_3.0/bucket2/tarred_audio_manifest.json
    - - /data2/nemo_asr/nemo_asr_set_3.0/bucket3/tarred_audio_manifest.json
    - - /data2/nemo_asr/nemo_asr_set_3.0/bucket4/tarred_audio_manifest.json
    - - /data2/nemo_asr/nemo_asr_set_3.0/bucket5/tarred_audio_manifest.json
    - - /data2/nemo_asr/nemo_asr_set_3.0/bucket6/tarred_audio_manifest.json
    - - /data2/nemo_asr/nemo_asr_set_3.0/bucket7/tarred_audio_manifest.json
    - - /data2/nemo_asr/nemo_asr_set_3.0/bucket8/tarred_audio_manifest.json
    sample_rate: 16000
    batch_size: 1
    shuffle: true
    num_workers: 4
    pin_memory: true
    use_start_end_token: false
    trim_

[NeMo I 2023-05-19 06:20:01 features:267] PADDING: 0
[NeMo I 2023-05-19 06:20:04 save_restore_connector:243] Model EncDecCTCModelBPE was successfully restored from /root/.cache/torch/NeMo/NeMo_1.15.0/stt_en_conformer_ctc_large/afb212c5bcf904e326b5e5751e7c7465/stt_en_conformer_ctc_large.nemo.


In [6]:
# Transcription before retraining

files = ['/content/drive/Shareddrives/IMARIKA/train_wavs/common_voice_sw_29914944.wav']
for fname, transcription in zip(files, asr_model.transcribe(paths2audio_files=files)):
  print(f"Prediction: {transcription}")

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

Prediction: higher in na pass kutons were nakutumio i pa savio orkutembel era


#### Create Data Manifests
- NeMo data sets take in a standardized manifest format where each line corresponds to one sample of audio, such that the number of lines in a manifest is equal to the number of samples that are represented by that manifest. A line must contain the path to an audio file, the corresponding transcript (or path to a transcript file), and the duration of the audio sample.
- Here's an example of what one line in a NeMo-compatible manifest might look like:
```
{"audio_filepath": "path/to/audio.wav", "duration": 3.45, "text": "this is a nemo tutorial"}
```

In [7]:
data_dir = '.'

if not os.path.exists(data_dir):
  os.makedirs(data_dir)

In [None]:
import os
import json
import pandas as pd
import librosa

# Function to build a manifest
def build_manifest(transcripts_path, manifest_path, wav_path):
    df = pd.read_csv(transcripts_path, sep='\t')
    with open(manifest_path, 'w') as fout:
        for _, row in df.iterrows():
            transcript = row['sentence'].lower().strip()
            file_id = row['path']
            file_id = os.path.splitext(file_id)[0]
            file_id = file_id + '.wav'
            audio_path = os.path.join(wav_path, file_id)
            duration = librosa.get_duration(path=audio_path)

            # Write the metadata to the manifest
            metadata = {
                "audio_filepath": audio_path,
                "duration": duration,
                "text": transcript
            }
            json.dump(metadata, fout)
            fout.write('\n')

In [None]:
# Building Manifests

train_transcripts = '/content/drive/Shareddrives/IMARIKA/train.tsv'
train_manifest = '/content/train_manifest.json'
if not os.path.isfile(train_manifest): 
    build_manifest(train_transcripts, train_manifest, '/content/drive/Shareddrives/IMARIKA/train_wavs')
    print("Training manifest created.")


# test_transcripts = '/content/drive/Shareddrives/IMARIKA/test.tsv'
# test_manifest = '/content/test_manifest.json'
# if not os.path.isfile(test_manifest):
#     build_manifest(test_transcripts, test_manifest, '/content/drive/Shareddrives/IMARIKA/test_wavs')
#     print("Test manifest created.")
# print("***Done***")



***Done***


In [8]:
import os
import json
import pandas as pd
import librosa

# Function to build a manifest
def build_manifest(transcripts_path, manifest_path, wav_path, num_samples=None):
    df = pd.read_csv(transcripts_path, sep='\t')

    if num_samples is not None:
        df = df.head(num_samples)

    with open(manifest_path, 'w') as fout:
        for _, row in df.iterrows():
            transcript = row['sentence'].lower().strip()
            file_id = row['path']
            file_id = os.path.splitext(file_id)[0]
            file_id = file_id + '.wav'
            audio_path = os.path.join(wav_path, file_id)
            duration = librosa.get_duration(path=audio_path)

            # Write the metadata to the manifest
            metadata = {
                "audio_filepath": audio_path,
                "duration": duration,
                "text": transcript
            }
            json.dump(metadata, fout)
            fout.write('\n')


In [9]:
# Building Manifests

train_transcripts = '/content/drive/Shareddrives/IMARIKA/train.tsv'
train_manifest = '/content/train_manifest.json'
if not os.path.isfile(train_manifest): 
    build_manifest(train_transcripts, train_manifest, '/content/drive/Shareddrives/IMARIKA/train_wavs', num_samples=2100)
    print("Training manifest created.")


test_transcripts = '/content/drive/Shareddrives/IMARIKA/test.tsv'
test_manifest = '/content/test_manifest.json'
if not os.path.isfile(test_manifest):
    build_manifest(test_transcripts, test_manifest, '/content/drive/Shareddrives/IMARIKA/test_wavs', num_samples=800)
    print("Test manifest created.")
print("***Done***")

Training manifest created.
Test manifest created.
***Done***


#### Specify Model using YAML config file

In [11]:
# --- Config Information ---#

try:
    from ruamel.yaml import YAML
except ModuleNotFoundError:
    from ruamel_yaml import YAML
config_path = './configs/config.yaml'

if not os.path.exists(config_path):
    # Grab the config we'll use in this example
    BRANCH = 'r1.15.0'
    !mkdir configs
    !wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml

yaml = YAML(typ='safe')
with open(config_path) as f:
    params = yaml.load(f)
print(params)

{'name': 'QuartzNet15x5', 'sample_rate': 16000, 'repeat': 1, 'dropout': 0.0, 'separable': True, 'labels': [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"], 'model': {'train_ds': {'manifest_filepath': '???', 'sample_rate': 16000, 'labels': [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"], 'batch_size': 32, 'trim_silence': True, 'max_duration': 16.7, 'shuffle': True, 'num_workers': 8, 'pin_memory': True, 'is_tarred': False, 'tarred_audio_filepaths': None, 'shuffle_n': 2048, 'bucketing_strategy': 'synced_randomized', 'bucketing_batch_size': None}, 'validation_ds': {'manifest_filepath': '???', 'sample_rate': 16000, 'labels': [' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', "'"], 'batch_size': 32, 'shuffle': False, 'num_work

#### Training with PyTorch

In [13]:
import pytorch_lightning as pl
trainer = pl.Trainer(devices=1, accelerator='gpu', max_epochs=50)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


Next, we instatiate our ASR model based on the configured YAML file

In [14]:
from omegaconf import DictConfig
params['model']['train_ds']['manifest_filepath'] = train_manifest
params['model']['validation_ds']['manifest_filepath'] = test_manifest


In [17]:
first_asr_model = nemo_asr.models.EncDecCTCModel(cfg=DictConfig(params['model']), trainer=trainer)

[NeMo I 2023-05-19 07:52:57 audio_to_text_dataset:43] Model level config does not contain `sample_rate`, please explicitly provide `sample_rate` to the dataloaders.
[NeMo I 2023-05-19 07:52:57 audio_to_text_dataset:43] Model level config does not contain `labels`, please explicitly provide `labels` to the dataloaders.
[NeMo I 2023-05-19 07:52:58 collections:193] Dataset loaded with 2100 files totalling 3.33 hours
[NeMo I 2023-05-19 07:52:58 collections:194] 0 files were filtered totalling 0.00 hours


    


[NeMo I 2023-05-19 07:52:58 audio_to_text_dataset:43] Model level config does not contain `sample_rate`, please explicitly provide `sample_rate` to the dataloaders.
[NeMo I 2023-05-19 07:52:58 audio_to_text_dataset:43] Model level config does not contain `labels`, please explicitly provide `labels` to the dataloaders.
[NeMo I 2023-05-19 07:52:58 collections:193] Dataset loaded with 800 files totalling 1.29 hours
[NeMo I 2023-05-19 07:52:58 collections:194] 0 files were filtered totalling 0.00 hours
[NeMo I 2023-05-19 07:52:58 features:267] PADDING: 16


In [None]:
# Start training

trainer.fit(first_asr_model)

INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


[NeMo I 2023-05-19 07:53:23 modelPT:616] Optimizer config = Novograd (
    Parameter Group 0
        amsgrad: False
        betas: [0.8, 0.5]
        eps: 1e-08
        grad_averaging: False
        lr: 0.01
        weight_decay: 0.001
    )
[NeMo I 2023-05-19 07:53:23 lr_scheduler:910] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f823c3f8550>" 
    will be used during training (effective maximum steps = 3300) - 
    Parameters : 
    (warmup_steps: null
    warmup_ratio: null
    min_lr: 0.0
    last_epoch: -1
    max_steps: 3300
    )


INFO:pytorch_lightning.callbacks.model_summary:
  | Name              | Type                              | Params
------------------------------------------------------------------------
0 | preprocessor      | AudioToMelSpectrogramPreprocessor | 0     
1 | encoder           | ConvASREncoder                    | 1.2 M 
2 | decoder           | ConvASRDecoder                    | 29.7 K
3 | loss              | CTCLoss                           | 0     
4 | spec_augmentation | SpectrogramAugmentation           | 0     
5 | _wer              | WER                               | 0     
------------------------------------------------------------------------
1.2 M     Trainable params
0         Non-trainable params
1.2 M     Total params
4.836     Total estimated model params size (MB)


Sanity Checking: 0it [00:00, ?it/s]

    


Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

In [None]:
files = ['/content/original.wav']
for fname, transcription in zip(files, first_asr_model.transcribe(paths2audio_files=files)):
  print(f"Prediction: {transcription}")

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

Prediction: tla la za tllaianasiabasfiptaoji tillza toleafia afuita itallaptaewa hiygahafiaai


In [None]:
files = ['/content/drive/Shareddrives/IMARIKA/train_wavs/common_voice_sw_29914944.wav']
for fname, transcription in zip(files, first_asr_model.transcribe(paths2audio_files=files)):
  print(f"Prediction: {transcription}")

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

Prediction: aayyanapasa kuunz wa na kutuma ipasayo kwa kutembelewa 


In [None]:
# Check Word Error Rate

import jiwer
import json

# Load the manifest file
# with open('/content/train_manifest_up.json') as f:
#     manifest = json.load(f)

manifest = [json.loads(line) for line in open('/content/train_manifest.json', 'r')]

audio_file = manifest[2]['audio_filepath']
print(f"Processing audio file {audio_file}")

files = ['/content/drive/Shareddrives/IMARIKA/train_wavs/common_voice_sw_29914944.wav']

fname, transcription in zip(files, first_asr_model.transcribe(paths2audio_files=files))
print(f"Predicted: {transcription}")

# Find the corresponding ground truth transcription
ground_truth = None
for item in manifest:
    if item['audio_filepath'] == audio_file:
        ground_truth = item['text']
        break

if ground_truth is None:
    raise ValueError(f"Audio file not found in manifest")

print(f"Ground truth: {ground_truth}")

# Calculate the WER between the predicted and ground truth transcriptions
wer = jiwer.wer(ground_truth, transcription)

print(f"WER: {wer:.4f}")


Processing audio file /content/drive/Shareddrives/IMARIKA/train_wavs/common_voice_sw_29914944.wav


Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

Predicted: hae ya napasa kutuzwa na kutumiaipasao kwa kutemdelewa
Ground truth: haya yanapaswa kutunzwa na kutumiwa ipasavyo kwa kutembelewa
WER: 0.8750


In [None]:
import json

# Load the contents of the file
with open('/content/train_manifest.json', 'r') as f:
    contents = f.read()

# Split the contents into lines
lines = contents.split('\n')

# Remove any empty lines
lines = [line.strip() for line in lines if line.strip()]

# Add commas to the end of each line except the last one
formatted_lines = [line + ',' if index < len(lines) - 1 else line for index, line in enumerate(lines)]

# Join the lines back together with newlines
formatted_contents = '\n'.join(formatted_lines)


# Write the formatted contents back to the file
with open('/content/train_manifest_up.json', 'w') as f:
    f.write(formatted_contents)
