# Getting Started: Sample Conversational AI application
This notebook shows how to use NVIDIA NeMo (https://github.com/NVIDIA/NeMo) to construct a toy demo which translate Mandarin audio file into English one.

The demo demonstrates how to:

* Instantiate pre-trained NeMo models from NVIDIA NGC.
* Transcribe audio with (Mandarin) speech recognition model.
* Translate text with machine translation model.
* Generate audio with text-to-speech models.

## Installation
NeMo can be installed via simple pip command.
This will take about 4 minutes.

(The installation method below should work inside your new Conda environment or in an NVIDIA docker container.)

In [None]:
BRANCH = 'main'
!python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]


[33mDEPRECATION: git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mCollecting nemo_toolkit[all]
  Cloning https://github.com/NVIDIA/NeMo.git (to revision main) to /tmp/pip-install-de5erhw2/nemo-toolkit_16ce1005865b4248af09b3f2be30992f
  Running command git clone --filter=blob:none --quiet https://github.com/NVIDIA/NeMo.git /tmp/pip-install-de5erhw2/nemo-toolkit_16ce1005865b4248af09b3f2be30992f
  Resolved https://github.com/NVIDIA/NeMo.git to commit 43c93d8a5578dadf4f56f21eb9cf0f0870e60fb7
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting huggingface-hub (from nemo_toolkit[all])
  Downl

## Import all necessary packages

In [None]:
# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing collection
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts
# We'll use this to listen to audio
import IPython

## Instantiate pre-trained NeMo models

Every NeMo model has these methods:

* ``list_available_models()`` - it will list all models currently available on NGC and their names.

* ``from_pretrained(...)`` API downloads and initialized model directly from the NGC using model name.


In [None]:
# Here is an example of all CTC-based models:
nemo_asr.models.EncDecCTCModel.list_available_models()
# More ASR Models are available - see: nemo_asr.models.ASRModel.list_available_models()

[PretrainedModelInfo(
 	pretrained_model_name=QuartzNet15x5Base-En,
 	description=QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other. Please visit https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels for further details.,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemospeechmodels/versions/1.0.0a5/files/QuartzNet15x5Base-En.nemo
 ),
 PretrainedModelInfo(
 	pretrained_model_name=stt_en_quartznet15x5,
 	description=For details about this model, please visit https://ngc.nvidia.com/catalog/models/nvidia:nemo:stt_en_quartznet15x5,
 	location=https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_en_quartznet15x5/versions/1.0.0rc1/files/stt_en_quartznet15x5.nemo
 ),
 PretrainedModelInfo(
 	pre

In [None]:
# Speech Recognition model - Citrinet initially trained on Multilingual LibriSpeech English corpus, and fine-tuned on the open source Aishell-2
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_zh_citrinet_1024_gamma_0_25").cuda()

# Neural Machine Translation model
nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_zh_en_transformer6x6').cuda()

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_en_hifigan").cuda()

[NeMo I 2023-09-20 05:29:44 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_zh_citrinet_1024_gamma_0_25/versions/1.0.0/files/stt_zh_citrinet_1024_gamma_0_25.nemo to /root/.cache/torch/NeMo/NeMo_1.21.0rc0/stt_zh_citrinet_1024_gamma_0_25/e4a8b1119971335507d9672e03bc80f4/stt_zh_citrinet_1024_gamma_0_25.nemo
[NeMo I 2023-09-20 05:30:20 common:913] Instantiating model from pre-trained checkpoint


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
    - 佯
    - 佰
    - 佳
    - 佶
    - 佻
    - 佼
    - 使
    - 侃
    - 侄
    - 侈
    - 例
    - 侍
    - 侏
    - 侑
    - 侗
    - 供
    - 依
    - 侠
    - 侣
    - 侥
    - 侦
    - 侧
    - 侨
    - 侬
    - 侮
    - 侯
    - 侵
    - 便
    - 促
    - 俄
    - 俊
    - 俎
    - 俏
    - 俐
    - 俑
    - 俗
    - 俘
    - 俚
    - 保
    - 俞
    - 俟
    - 信
    - 俨
    - 俩
    - 俪
    - 俭
    - 修
    - 俯
    - 俱
    - 俸
    - 俺
    - 俾
    - 倌
    - 倍
    - 倒
    - 倔
    - 倘
    - 候
    - 倚
    - 倜
    - 借
    - 倡
    - 倦
    - 倩
    - 倪
    - 倭
    - 债
    - 值
    - 倾
    - 偃
    - 假
    - 偈
    - 偌
    - 偎
    - 偏
    - 偓
    - 偕
    - 做
    - 停
    - 健
    - 偶
    - 偷
    - 偻
    - 偿
    - 傀
    - 傅
    - 傍
    - 傣
    - 傥
    - 储
    - 催
    - 傲
    - 傻
    - 像
    - 僚
    - 僧
    - 僮
    - 僵
    - 僻
    - 儋
    - 儒
    - 儡
    - 儿
    - 兀
    - 允
    - 元
    - 兄
    - 充
    - 兆
    - 先
    - 光
    - 克
    - 免
    - 兑
    - 兔
    - 兖
    - 党

[NeMo I 2023-09-20 05:30:45 features:289] PADDING: 16
[NeMo I 2023-09-20 05:30:53 save_restore_connector:249] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0rc0/stt_zh_citrinet_1024_gamma_0_25/e4a8b1119971335507d9672e03bc80f4/stt_zh_citrinet_1024_gamma_0_25.nemo.
[NeMo I 2023-09-20 05:30:53 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/nmt_zh_en_transformer6x6/versions/1.0.0rc1/files/nmt_zh_en_transformer6x6.nemo to /root/.cache/torch/NeMo/NeMo_1.21.0rc0/nmt_zh_en_transformer6x6/eff3792e6f4420ba83436be889e92d79/nmt_zh_en_transformer6x6.nemo
[NeMo I 2023-09-20 05:31:48 common:913] Instantiating model from pre-trained checkpoint
[NeMo I 2023-09-20 05:32:00 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmpz3m9g13u/tokenizer.decoder.32000.BPE.model with r2l: False.
[NeMo I 2023-09-20 05:32:00 tokenizer_utils:179] Getting YouTokenToMeTokenizer with model: /tmp/tmpz3m9g13u/tokenizer.encoder.32000.BPE

[NeMo W 2023-09-20 05:32:00 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    src_file_name: /raid/tarred_data_accaligned_16k_tokens_32k_vocab_cov_0.999/batches.tokens.16000._OP_1..144_CL_.tar
    tgt_file_name: /raid/tarred_data_accaligned_16k_tokens_32k_vocab_cov_0.999/batches.tokens.16000._OP_1..144_CL_.tar
    tokens_in_batch: 16000
    clean: true
    max_seq_length: 512
    cache_ids: false
    cache_data_per_node: false
    use_cache: false
    shuffle: true
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 8
    load_from_cached_dataset: false
    reverse_lang_direction: true
    load_from_tarred_dataset: true
    metadata_path: /raid/tarred_data_accaligned_16k_tokens_32k_vocab_cov_0.999/metadata.json
    tar_shuffle_n: 100
    
[NeMo W 2023-09-20 05:32:00 modelPT:168] If you intend to do valida

[NeMo I 2023-09-20 05:32:07 save_restore_connector:249] Model MTEncDecModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0rc0/nmt_zh_en_transformer6x6/eff3792e6f4420ba83436be889e92d79/nmt_zh_en_transformer6x6.nemo.
[NeMo I 2023-09-20 05:32:07 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.8.1/files/tts_en_fastpitch_align.nemo to /root/.cache/torch/NeMo/NeMo_1.21.0rc0/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo
[NeMo I 2023-09-20 05:32:18 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2023-09-20 05:33:00 en_us_arpabet:66] apply_to_oov_word=None, This means that some of words will remain unchanged if they are not handled by any of the rules in self.parse_one_word(). This may be intended if phonemes and chars are both valid inputs, otherwise, you may see unexpected deletions in your input.
[NeMo W 2023-09-20 05:33:00 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.torch.data.TTSDataset
      manifest_filepath: /ws/LJSpeech/nvidia_ljspeech_train_clean_ngc.json
      sample_rate: 22050
      sup_data_path: /raid/LJSpeech/supplementary
      sup_data_types:
      - align_prior_matrix
      - pitch
      n_fft: 1024
      win_length: 1024
      hop_length: 256
      window: hann
      n_mels: 80
      lowfreq: 0
      highfreq: 8000
      max_duration: null
      

[NeMo I 2023-09-20 05:33:01 features:289] PADDING: 1
[NeMo I 2023-09-20 05:33:01 save_restore_connector:249] Model FastPitchModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0rc0/tts_en_fastpitch_align/b7d086a07b5126c12d5077d9a641a38c/tts_en_fastpitch_align.nemo.
[NeMo I 2023-09-20 05:33:01 cloud:68] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo to /root/.cache/torch/NeMo/NeMo_1.21.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2023-09-20 05:33:22 common:913] Instantiating model from pre-trained checkpoint


[NeMo W 2023-09-20 05:33:25 modelPT:161] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2023-09-20 05:33:25 modelPT:168] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2023-09-20 05:33:25 features:289] PADDING: 0


[NeMo W 2023-09-20 05:33:25 features:266] Using torch_stft is deprecated and has been removed. The values have been forcibly set to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2023-09-20 05:33:25 features:289] PADDING: 0
[NeMo I 2023-09-20 05:33:26 save_restore_connector:249] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.21.0rc0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


## Get an audio sample in Mandarin

In [None]:
# Download audio sample which we'll try
# This is a sample from MCV 6.1 Dev dataset - the model hasn't seen it before
# IMPORTANT: The audio must be mono with 16Khz sampling rate
audio_sample = 'common_voice_zh-CN_21347786.mp3'
!wget 'https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3'
# To listen it, click on the play button below
IPython.display.Audio(audio_sample)

--2023-09-20 05:33:26--  https://nemo-public.s3.us-east-2.amazonaws.com/zh-samples/common_voice_zh-CN_21347786.mp3
Resolving nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)... 52.219.93.194, 52.219.97.66, 3.5.129.114, ...
Connecting to nemo-public.s3.us-east-2.amazonaws.com (nemo-public.s3.us-east-2.amazonaws.com)|52.219.93.194|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 24813 (24K) [audio/mp3]
Saving to: ‘common_voice_zh-CN_21347786.mp3’


2023-09-20 05:33:27 (320 KB/s) - ‘common_voice_zh-CN_21347786.mp3’ saved [24813/24813]



## Transcribe audio file
We will use speech recognition model to convert audio into text.


In [None]:
transcribed_text = asr_model.transcribe([audio_sample])
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

['我们尽了最大努力']


## Translate Chinese text into English
NeMo's NMT models have a handy ``.translate()`` method.

In [None]:
english_text = nmt_model.translate(transcribed_text)
print(english_text)

['We tried our best']


## Generate English audio from text
Speech generation from text typically has two steps:
* Generate spectrogram from the text. In this example we will use FastPitch model for this.
* Generate actual audio from the spectrogram. In this example we will use HifiGan model for this.


In [None]:
# A helper function which combines FastPitch and HifiGan to go directly from
# text to audio
def text_to_audio(text):
  parsed = spectrogram_generator.parse(text)
  spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)
  audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
  return audio.to('cpu').detach().numpy()

In [None]:
# Listen to generated audio in English
IPython.display.Audio(text_to_audio(english_text[0]), rate=22050)

[NeMo W 2023-09-20 05:33:40 fastpitch:291] parse() is meant to be called in eval mode.
[NeMo W 2023-09-20 05:33:40 fastpitch:368] generate_spectrogram() is meant to be called in eval mode.


## Next steps
A demo like this is great for prototyping and experimentation. However, for real production deployment, you would want to use a service like [NVIDIA Riva](https://developer.nvidia.com/riva).

**NeMo is built for training.** You can fine-tune, or train from scratch on your data all models used in this example. We recommend you checkout the following, more in-depth, tutorials next:

* [NeMo fundamentals](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/00_NeMo_Primer.ipynb)
* [NeMo models](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/01_NeMo_Models.ipynb)
* [Speech Recognition](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_NeMo.ipynb)
* [Punctuation and Capitalization](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb)
* [Speech Synthesis](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/tts/Inference_ModelSelect.ipynb)


You can find scripts for training and fine-tuning ASR, NLP and TTS models [here](https://github.com/NVIDIA/NeMo/tree/main/examples).