# Getting Started: Sample Conversational AI application
This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.

The demo demonstrates how to: 

* Instantiate pre-trained NeMo models from NVIDIA NGC.
* Transcribe audio with English speech recognition model.
* Translate text to Spanish with machine translation model.
* Generate audio with text-to-speech models fine-tuned to Spanish speach.

## Import all necessary packages

In [13]:
# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing collection
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts
# We'll use this to listen to audio
import IPython

import mlflow
import os
from mlflow.types.schema import Schema, ColSpec
from mlflow.types import ParamSchema, ParamSpec
from mlflow.models import ModelSignature
import io
import base64
import soundfile as sf
import json
import logging

## Test whether NeMo is properly imported

In this cell, we show a list of available NeMo models for Automatic Speech Recognition on NGC, to show our Workspace is capable to load NeMo and connect to NGC

* ``list_available_models()`` - it will list all models currently available on NGC and their names.



In [3]:
# Here is an example of all CTC-based models:
nemo_asr.models.EncDecCTCModel.list_available_models()
# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()

[PretrainedModelInfo(
 	pretrained_model_name=QuartzNet15x5Base-En,
 	description=QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. Please visit https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels for further details.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo
 ),
 PretrainedModelInfo(
 	pretrained_model_name=stt_en_quartznet15x5,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_quartznet15x5,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_quartznet15x5/versions/1.0.0rc1/files/stt_en_quartznet15x5.nemo
 ),
 PretrainedModelInfo(
 	pre

## Loading from local saved models

Here, instead of downloading the models directly from NGC via code, we are showing that we can access the models that were downloaded previously, using Ai Studio assets manager

In [4]:
# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2
asr_model = nemo_asr.models.EncDecCTCModel.restore_from("/home/jovyan/datafabric/NeMo/stt_en_citrinet_1024_gamma_0_25.nemo")

# Neural Machine Translation model
nmt_model = nemo_nlp.models.MTEncDecModel.restore_from("/home/jovyan/datafabric/NeMo/nmt_en_es_transformer12x2.nemo")

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from("/home/jovyan/datafabric/NeMo/tts_es_fastpitch_multispeaker.nemo")

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.restore_from("/home/jovyan/datafabric/NeMo/tts_es_hifigan_ft_fastpitch_multispeaker.nemo")

[NeMo I 2024-02-19 16:27:40 mixins:170] Tokenizer SentencePieceTokenizer initialized with 1024 tokens


[NeMo W 2024-02-19 16:27:40 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    trim_silence: false
    max_duration: 20.0
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    use_start_end_token: false
    
[NeMo W 2024-02-19 16:27:40 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: null
    sample_rate: 16000
    batch_size: 32
    shuffle: false
    use_start_end_token: false
    
[NeMo W 2024-02-19 16:27:40 modelPT:174] Please call the ModelPT.setup_test_data() or ModelPT.setup_multiple_test_data() method and provide a v

[NeMo I 2024-02-19 16:27:40 features:289] PADDING: 16
[NeMo I 2024-02-19 16:27:45 save_restore_connector:249] Model EncDecCTCModelBPE was successfully restored from /home/jovyan/datafabric/NeMo/stt_en_citrinet_1024_gamma_0_25.nemo.
[NeMo I 2024-02-19 16:28:35 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmpnzh2i15v/tokenizer.32000.BPE.model with r2l: False.
[NeMo I 2024-02-19 16:28:35 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmpnzh2i15v/tokenizer.32000.BPE.model with r2l: False.


[NeMo W 2024-02-19 16:28:35 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    src_file_name: /raid/batches.tokens.16000.pkl
    tgt_file_name: /raid/batches.tokens.16000.pkl
    tokens_in_batch: 16000
    clean: true
    max_seq_length: 512
    cache_ids: false
    cache_data_per_node: false
    use_cache: false
    shuffle: true
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 8
    load_from_cached_dataset: true
    reverse_lang_direction: false
    load_from_tarred_dataset: false
    metadata_path: null
    tar_shuffle_n: 100
    
[NeMo W 2024-02-19 16:28:35 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config

[NeMo I 2024-02-19 16:28:41 nlp_overrides:752] Model MTEncDecModel was successfully restored from /home/jovyan/datafabric/NeMo/nmt_en_es_transformer12x2.nemo.


[NeMo W 2024-02-19 16:28:46 deprecated:63] Function ``g2p_backward_compatible_support`` is deprecated. But it will not be removed until a further notice. G2P object root directory `nemo_text_processing.g2p` has been replaced with `nemo.collections.tts.g2p`. Please use the latter instead as of NeMo 1.18.0.
[NeMo W 2024-02-19 16:28:46 experimental:26] `<class 'nemo.collections.tts.g2p.models.i18n_ipa.IpaG2p'>` is experimental and not ready for production yet. Use at your own risk.
[NeMo W 2024-02-19 16:28:47 i18n_ipa:124] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2024-02-19 16:28:47 experimental:26] `<class 'nemo.collections.common.tokenizers.text_to_speech.tts_tokenizers.IPATokenizer'>` is experimental and not ready for production yet. Use at your own ri

[NeMo I 2024-02-19 16:28:47 features:289] PADDING: 1
[NeMo I 2024-02-19 16:28:47 save_restore_connector:249] Model FastPitchModel was successfully restored from /home/jovyan/datafabric/NeMo/tts_es_fastpitch_multispeaker.nemo.


[NeMo W 2024-02-19 16:28:59 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.VocoderDataset
      manifest_filepath: /home/rlangman/Data/openslr/spanish/ipa/train_hifi_gta_manifest.json
      sample_rate: 44100
      n_segments: 16384
      max_duration: null
      min_duration: 0.75
      load_precomputed_mel: true
      hop_length: 512
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 16
      num_workers: 4
      pin_memory: true
    
[NeMo W 2024-02-19 16:28:59 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo

[NeMo I 2024-02-19 16:28:59 features:289] PADDING: 0
[NeMo I 2024-02-19 16:28:59 features:297] STFT using exact pad
[NeMo I 2024-02-19 16:28:59 features:289] PADDING: 0
[NeMo I 2024-02-19 16:28:59 features:297] STFT using exact pad


    


[NeMo I 2024-02-19 16:29:00 save_restore_connector:249] Model HifiGanModel was successfully restored from /home/jovyan/datafabric/NeMo/tts_es_hifigan_ft_fastpitch_multispeaker.nemo.


In [5]:
audio_sample = 'ForrestGump.mp3'
os.path.exists(audio_sample)
# To listen it, click on the play button below
IPython.display.Audio(audio_sample)

In [6]:
transcribed_text = asr_model.transcribe([audio_sample])
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

["my mom always said life like a box of chocolates never know what you're going to get"]


In [7]:
cuda_nmt_model = nmt_model.cuda()
translated_text = cuda_nmt_model.translate(transcribed_text)

print(translated_text)

['mi mamá siempre dijo la vida como una caja de chocolates nunca saben lo que vas a conseguir']


In [8]:
cuda_spectrogram_generator = spectrogram_generator.cuda()
cuda_vocoder = vocoder.cuda()


parsed = cuda_spectrogram_generator.parse(translated_text[0])
spectrogram = cuda_spectrogram_generator.generate_spectrogram(tokens=parsed, speaker=0)
audio = cuda_vocoder.convert_spectrogram_to_audio(spec=spectrogram)
IPython.display.Audio(audio.to('cpu').detach().numpy(), rate=44100)


[NeMo W 2024-02-19 16:29:13 fastpitch:291] parse() is meant to be called in eval mode.
[NeMo W 2024-02-19 16:29:13 fastpitch:368] generate_spectrogram() is meant to be called in eval mode.


In [24]:
class NemoTranslationModel(mlflow.pyfunc.PythonModel):
    def _preprocess_audio(self, inputs):
        """
        Deserializes base64-serialized audio to a NumPy array.
        Assume the audio is in WAV format for simplicity
        """
        serialized_audio = inputs['source_serialized_audio'][0]
        audio_buffer = io.BytesIO(base64.b64decode(serialized_audio))
        audio, samplerate = sf.read(audio_buffer)
        return audio  # Retorno do áudio como array NumPy

    def text_to_audio(self, text):
        """
        Generates audio from text using TTS templates.
        """
        parsed = self.spectrogram_generator.cuda().parse(text)
        spectrogram = self.spectrogram_generator.cuda().generate_spectrogram(tokens=parsed)
        audio = self.vocoder.cuda().convert_spectrogram_to_audio(spec=spectrogram)
        return audio.to('cpu').detach().numpy()

    def serialize_audio(self, audio_np):
        """
        Serializes an audio NumPy array to a base64 string representing a WAV file.
        """
        with io.BytesIO() as audio_buffer:
            sf.write(audio_buffer, audio_np, samplerate=22050, format='WAV')
            audio_buffer.seek(0)
            audio_base64 = base64.b64encode(audio_buffer.read()).decode('utf-8')
        return audio_base64

    def load_context(self, context):
        """
        Loads the necessary models for translation and speech synthesis.
        """
        model_name = context.artifacts["model"]
        self.asr_model = nemo_asr.models.EncDecCTCModel.restore_from(f"{model_name}/enc_dec_CTC.nemo")
        self.nmt_model = nemo_nlp.models.MTEncDecModel.restore_from(f"{model_name}/MT_enc_dec.nemo")
        self.spectrogram_generator = nemo_tts.models.FastPitchModel.restore_from(f"{model_name}/fast_pitch.nemo")
        self.vocoder = nemo_tts.models.HifiGanModel.restore_from(f"{model_name}/hifi_gan.nemo")

    def predict(self, context, model_input, params):
        use_audio = params.get("use_audio", False)
        source_text = ""
        
        if use_audio:
            audio_data = self._preprocess_audio(model_input)
            source_text = self.asr_model.cuda().transcribe([audio_data])[0]
        else:
            source_text = model_input['source_text'][0] if 'source_text' in model_input else ""
        
        translated_text = self.nmt_model.cuda().translate([source_text])[0]
        translated_audio = ""
        if use_audio:
            audio = self.text_to_audio(translated_text)
            translated_audio = self.serialize_audio(audio)
        
        response = {"translated_text": translated_text, "translated_serialized_audio": translated_audio}
        
        return response

    @classmethod
    def log_model(cls, model_name, nemo_models, demo_folder):
        """
        Registers the model in MLflow.w.
        """
        if not os.path.isdir(model_name):
            os.mkdir(model_name)

        for key, model in nemo_models.items():
            model.save_to(f"{model_name}/{key}.nemo")

        input_schema = Schema([
            ColSpec("string", "source_text"),
            ColSpec("string", "source_serialized_audio"),
        ])
        output_schema = Schema([
            ColSpec("string", "translated_text"),
            ColSpec("string", "translated_serialized_audio"),
        ])
        params_schema = ParamSchema([
            ParamSpec("use_audio", "boolean", False)
        ])
        signature = ModelSignature(inputs=input_schema, outputs=output_schema, params=params_schema)

        mlflow.pyfunc.log_model(
            model_name,
            python_model=cls(),
            artifacts={"model": model_name, "demo": demo_folder},
            signature=signature
        )

In [23]:

mlflow.set_experiment(experiment_name='NeMo_translation')

with mlflow.start_run(run_name='NeMo_en_es_translation') as run:

    model_set = {
        "enc_dec_CTC": asr_model,  
        "MT_enc_dec": nmt_model,   
        "fast_pitch": spectrogram_generator,  
        "hifi_gan": vocoder  
    }

    
    NemoTranslationModel.log_model(model_name='nemo_en_es', nemo_models=model_set, demo_folder="demo")
    
   
    mlflow.register_model(
        model_uri=f"runs:/{run.info.run_id}/nemo_en_es", 
        name="NeMo JSON"  
    )


Downloading artifacts:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/3 [00:00<?, ?it/s]

Registered model 'NeMo JSON' already exists. Creating a new version of this model...
2024/02/19 17:28:47 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: NeMo JSON, version 2
Created version '2' of model 'NeMo JSON'.
