# Amazon SageMaker Real-Time Hosting with NeMo ASR

This notebook show's how to use [SageMaker's real-time inference endpoints](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) to host [NVIDIA's NeMo](https://github.com/NVIDIA/NeMo) ASR model for audio-to-text transcription in real time. In this notebook you will...

1. Install dependencies for NeMo ASR
2. Load a NeMo model (in this case, reazonspeech-v2 model)
3. Run inference locally on an example audio dataset
4. Create a SageMaker model
5. Deploy the SageMaker model to a real-time endpoint
6. Run inference on the SageMaker endpoint
7. Close the SageMaker endpoint

# Install NeMo ASR and its dependencies

The cells below will install the Python packages needed to use NeMo ASR models and evaluate the transcription results.

In [1]:
%pip install -r src/requirements.txt --quiet

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import numpy as np
import torch
import sagemaker
import time
from scipy.io import wavfile
from tqdm.notebook import tqdm

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


# Download sample audiofile and load it in np.ndarray

Download a sample audio from `reazonspeech` dataset, and convert it into float32 pcm, which is an adequate input form of NeMo.ASRModel

In [3]:
![[ -e speech-001.wav ]] || wget https://research.reazon.jp/_static/speech-001.wav

In [4]:
# To play an audio file using IPython, we can use the Audio class from the IPython.display module.
from IPython.display import Audio

# Replace 'path_to_file' with the path to the actual audio file you want to play.
audiofile = 'speech-001.wav'

# Play the audio file
Audio(audiofile)

In [5]:
sr, int16pcm = wavfile.read(audiofile)

In [6]:
float32pcm = int16pcm.astype(np.float32)
float32pcm /= 32767

# Run example inference locally using a Reazonspeech-V2 NeMo model

download model file from huggingface, and load it via `nemo_asr.restore_from()`
we will test the model by inferencing the sample audio

In [7]:
# download the model file if it is absent in the directory
![[ -e reazonspeech-nemo-v2.nemo ]] || wget https://huggingface.co/reazon-research/reazonspeech-nemo-v2/resolve/main/reazonspeech-nemo-v2.nemo

--2024-03-15 05:35:23--  https://huggingface.co/reazon-research/reazonspeech-nemo-v2/resolve/main/reazonspeech-nemo-v2.nemo
Resolving huggingface.co (huggingface.co)... 13.225.131.6, 13.225.131.94, 13.225.131.93, ...
Connecting to huggingface.co (huggingface.co)|13.225.131.6|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/0a/16/0a16fa530a403163ed8931830701c2cab14d7cd8e06982040c8dfbba314403f7/d196d43ad03466ca88beeda4bf5fafb07bab7202d4b663b8e4f12cb0a4381fae?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27reazonspeech-nemo-v2.nemo%3B+filename%3D%22reazonspeech-nemo-v2.nemo%22%3B&Expires=1710740123&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMDc0MDEyM319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzBhLzE2LzBhMTZmYTUzMGE0MDMxNjNlZDg5MzE4MzA3MDFjMmNhYjE0ZDdjZDhlMDY5ODIwNDBjOGRmYmJhMzE0NDAzZjcvZDE5NmQ0M2FkMDM0NjZjYTg4YmVlZGE0YmY1

In [8]:
from nemo.collections.asr.models import ASRModel
model_path = 'reazonspeech-nemo-v2.nemo'
asr_model = ASRModel.restore_from(model_path)

[NeMo I 2024-03-15 05:35:49 mixins:172] Tokenizer SentencePieceTokenizer initialized with 3000 tokens


[NeMo W 2024-03-15 05:35:50 modelPT:165] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: dataset/train.json
    sample_rate: 16000
    batch_size: 32
    shuffle: true
    num_workers: 8
    pin_memory: true
    max_duration: 30
    min_duration: 0.1
    use_start_end_token: false
    trim_silence: false
    is_tarred: false
    tarred_audio_filepaths: null
    shuffle_n: 2048
    bucketing_strategy: fully_randomized
    bucketing_batch_size: null
    
[NeMo W 2024-03-15 05:35:50 modelPT:172] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: dataset/valid.json
    sample_rate: 16000
    batch_size: 16
    shuffle: false

[NeMo I 2024-03-15 05:35:50 features:289] PADDING: 0
[NeMo I 2024-03-15 05:35:56 rnnt_models:220] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}
[NeMo I 2024-03-15 05:35:56 rnnt_models:220] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}
[NeMo I 2024-03-15 05:35:56 rnnt_models:220] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}
[NeMo I 2024-03-15 05:35:59 save_restore_connector:263] Model EncDecRNNTBPEModel was successfully restored from /home/ec2-user/SageMaker/nemo-asr-inference-for-amazon-sagemaker/reazonspeech-nemo-v2.nemo.


In [9]:
results = asr_model.transcribe([float32pcm] * 10, batch_size=1)

Transcribing:   0%|          | 0/10 [00:00<?, ?it/s]
Beam search progress::   0%|          | 0/1 [00:00<?, ?sample/s][A
Beam search progress:: 100%|██████████| 1/1 [00:00<00:00,  2.18sample/s][A
Transcribing:  10%|█         | 1/10 [00:01<00:12,  1.35s/it]
Beam search progress::   0%|          | 0/1 [00:00<?, ?sample/s][A
Beam search progress:: 100%|██████████| 1/1 [00:00<00:00,  2.35sample/s][A
Transcribing:  20%|██        | 2/10 [00:01<00:06,  1.15it/s]
Beam search progress::   0%|          | 0/1 [00:00<?, ?sample/s][A
Beam search progress:: 100%|██████████| 1/1 [00:00<00:00,  2.34sample/s][A
Transcribing:  30%|███       | 3/10 [00:02<00:05,  1.39it/s]
Beam search progress::   0%|          | 0/1 [00:00<?, ?sample/s][A
Beam search progress:: 100%|██████████| 1/1 [00:00<00:00,  2.34sample/s][A
Transcribing:  40%|████      | 4/10 [00:02<00:03,  1.54it/s]
Beam search progress::   0%|          | 0/1 [00:00<?, ?sample/s][A
Beam search progress:: 100%|██████████| 1/1 [00:00<00:00,  

In [10]:
from omegaconf import OmegaConf
# Create a DictConfig for greedy decoding
greedy_decoding_cfg = OmegaConf.create({
    'strategy': 'greedy',
    # Add other necessary configuration parameters here
})

In [11]:
asr_model.change_decoding_strategy(greedy_decoding_cfg)

[NeMo I 2024-03-15 05:36:05 rnnt_models:220] Using RNNT Loss : warprnnt_numba
    Loss warprnnt_numba_kwargs: {'fastemit_lambda': 0.0, 'clamp': -1.0}
[NeMo I 2024-03-15 05:36:05 rnnt_bpe_models:490] Changed decoding strategy to 
    model_type: rnnt
    strategy: greedy
    compute_hypothesis_token_set: false
    preserve_alignments: null
    confidence_cfg:
      preserve_frame_confidence: false
      preserve_token_confidence: false
      preserve_word_confidence: false
      exclude_blank: true
      aggregation: min
      method_cfg:
        name: entropy
        entropy_type: tsallis
        alpha: 0.33
        entropy_norm: exp
        temperature: DEPRECATED
    fused_batch_size: null
    compute_timestamps: null
    compute_langs: false
    word_seperator: ' '
    rnnt_timestamp_type: all
    greedy:
      max_symbols_per_step: 10
      preserve_alignments: false
      preserve_frame_confidence: false
      confidence_method_cfg:
        name: entropy
        entropy_type: tsal

In [12]:
result = asr_model.transcribe([float32pcm] * 10, batch_size=1)

Transcribing: 100%|██████████| 10/10 [00:02<00:00,  4.37it/s]


# SageMaker Inference

In this section, you will deploy the NeMo model from the previous section to a real time API endpoint on Amazon SageMaker. You start this section by instantiating a sagemaker session and defining a path in Amazon S3 for your model artifacts shown below.

In [13]:
sess = sagemaker.session.Session()
bucket = sess.default_bucket()
prefix = 'nemo-reazonspeech-deploy/'
s3_uri = f's3://{bucket}/{prefix}'

## Create Model Artifacts in S3

You can now take the NeMo model which was loaded previously and save it using PyTorch. Make sure you save both a model state as well as model dimensions to be compatible with the NeMo library.

Once the model has been saved, you will package the model into a tar.gz file and upload it to Amazon S3. This serialized model will be the model artifact which is referenced for real-time inference.

In [14]:
!mkdir -p model
!mv reazonspeech-nemo-v2.nemo model
!cd model && tar -czvf model.tar.gz reazonspeech-nemo-v2.nemo
!mv model/model.tar.gz .
!tar -tvf model.tar.gz
model_uri = sess.upload_data('model.tar.gz', bucket = bucket, key_prefix=f"{prefix}model")
!rm model.tar.gz
!rm -rf model

reazonspeech-nemo-v2.nemo
-rw-rw-r-- ec2-user/ec2-user 2477946880 2024-01-30 02:10 reazonspeech-nemo-v2.nemo


## Create SageMaker Model Object

Once the model artifact has been uploaded to S3, you will use the SageMaker SDK to create a `model` object which references the model artifact in S3, one of SageMaker's PyTorch inference containers, and the inference code stored in the `src` directory in this repository. The `inference.py` is the code which is executed at runtime while the `requirements.txt` tells SageMaker to install the NeMo related libraries.

In [15]:
image = sagemaker.image_uris.retrieve(
    framework='pytorch',
    region='ap-northeast-2',
    image_scope='inference',
    version='2.1',
    instance_type='ml.g4dn.xlarge',
)

model_name = f'nemo-reazonspeech-{int(time.time())}'
nemo_model_sm = sagemaker.model.Model(
    model_data=model_uri,
    image_uri=image,
    role=sagemaker.get_execution_role(),
    entry_point="inference.py",
    source_dir='src',
    name=model_name,
)

## Deploy to a Real Time Endpoint

Deploying the `model` object to sagemaker can be done with the `deploy` function. Notice that you will be using a `ml.g4dn.xlarge` instance type in order to take advantage of a AWS's low cost GPU instances for accelerated inference.

In [16]:
endpoint_name = f'nemo-reazonspeech-endpoint-{int(time.time())}'
nemo_model_sm.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
    endpoint_name=endpoint_name,
    wait=True,
)

----------!

## Test Inference

Once the model has deployed, you can connect to the endpoint using the `Predictor` class in the SageMaker SDK. This connection can then use the `predict` method in order to transcribe the same audio signal used previously in this notebook. Notice how the results are consistent across the local execution and the API call.

In [17]:
nemo_endpoint = sagemaker.predictor.Predictor(endpoint_name)
nemo_endpoint.serializer = sagemaker.serializers.NumpySerializer()

assert nemo_endpoint.endpoint_context().properties['Status'] == 'InService'

In [18]:
inp = int16pcm
out = nemo_endpoint.predict(inp)
print(f'Example Transcription: \n{out}')

Example Transcription: 
b'\xe6\xb0\x97\xe8\xb1\xa1\xe5\xba\x81\xe3\x81\xaf\xe9\x9b\xaa\xe3\x82\x84\xe8\xb7\xaf\xe9\x9d\xa2\xe3\x81\xae\xe5\x87\x8d\xe7\xb5\x90\xe3\x81\xab\xe3\x82\x88\xe3\x82\x8b\xe4\xba\xa4\xe9\x80\x9a\xe3\x81\xb8\xe3\x81\xae\xe5\xbd\xb1\xe9\x9f\xbf\xe3\x80\x81\xe6\x9a\xb4\xe9\xa2\xa8\xe9\x9b\xaa\xe3\x82\x84\xe9\xab\x98\xe6\xb3\xa2\xe3\x81\xab\xe8\xad\xa6\xe6\x88\x92\xe3\x81\x99\xe3\x82\x8b\xe3\x81\xa8\xe3\x81\xa8\xe3\x82\x82\xe3\x81\xab\xe9\x9b\xaa\xe5\xb4\xa9\xe3\x82\x84\xe5\xb1\x8b\xe6\xa0\xb9\xe3\x81\x8b\xe3\x82\x89\xe3\x81\xae\xe8\x90\xbd\xe9\x9b\xaa\xe3\x81\xab\xe3\x82\x82\xe5\x8d\x81\xe5\x88\x86\xe6\xb3\xa8\xe6\x84\x8f\xe3\x81\x99\xe3\x82\x8b\xe3\x82\x88\xe3\x81\x86\xe5\x91\xbc\xe3\x81\xb3\xe3\x81\x8b\xe3\x81\x91\xe3\x81\xa6\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99\xe3\x80\x82'


In [19]:
out.decode()

'気象庁は雪や路面の凍結による交通への影響、暴風雪や高波に警戒するとともに雪崩や屋根からの落雪にも十分注意するよう呼びかけています。'

## Sequential Latency Test

You can also run a latency test to see how fast the g4dn instance is able to process single input requests. The first cell will ensure the instance is warmed and the next cell will time the requests coming into the endpoint.

In [20]:
# warm up the instance
for i in range(10):
    out = nemo_endpoint.predict(inp)

In [21]:
%%timeit
out = nemo_endpoint.predict(inp)

265 ms ± 7.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [22]:
out.decode()

'気象庁は雪や路面の凍結による交通への影響、暴風雪や高波に警戒するとともに雪崩や屋根からの落雪にも十分注意するよう呼びかけています。'

## Optional: Clean Up Endpoint

Once you have finished testing you endpoint, you have the option to delete your SageMaker endpoint. This is a good practice as experimental endpoints can be removed in order to decrease your SageMaker costs when they are not in use.

In [23]:
nemo_endpoint.delete_endpoint()