# Audio translator



This notebook tests certain features from NeMo in the form of an audio translator. The translator takes in an audio file, and converts it into an audio file in the target language. 

You can find the online Google colab notebook for this demo [here](https://colab.research.google.com/drive/1nSxiTzLYxA9_PPsEK9VU-JIW9orPkVQ1?usp=sharing)

# How does it work?

The audio translator:

*   Converts audio to written text using ASR
*   Translates the written text to the target language
*   Creates a TTS audio file in the target language from the translated text


For GPU purposes, this notebook works best on Google Colab with a recording that isn't too long (under 1 minute). 

If you you run into issues with your own recording, see if you can find a shorter one to check if that works. 

# Importing the tools used

First, let's install and import the right collections.

In [None]:
!pip install nemo_toolkit[all]

Collecting nemo_toolkit[all]
  Downloading nemo_toolkit-1.4.0-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 4.1 MB/s 
[?25hCollecting wget
  Downloading wget-3.2.zip (10 kB)
Collecting unidecode
  Downloading Unidecode-1.3.2-py3-none-any.whl (235 kB)
[K     |████████████████████████████████| 235 kB 45.0 MB/s 
Collecting ruamel.yaml
  Downloading ruamel.yaml-0.17.17-py3-none-any.whl (109 kB)
[K     |████████████████████████████████| 109 kB 51.5 MB/s 
[?25hCollecting frozendict
  Downloading frozendict-2.0.7-py3-none-any.whl (8.3 kB)
Collecting sentencepiece<1.0.0
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 40.7 MB/s 
Collecting onnx>=1.7.0
  Downloading onnx-1.10.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (12.7 MB)
[K     |████████████████████████████████| 12.7 MB 81 kB/s 
[?25hCollecting pesq
  Downloading pesq-0.0.3.tar.gz (35

In [None]:
# From NeMo, we import the following:

# Import NeMo and it's ASR, NLP and TTS collections
import nemo
# Import Speech Recognition collection
import nemo.collections.asr as nemo_asr
# Import Natural Language Processing colleciton
import nemo.collections.nlp as nemo_nlp
# Import Speech Synthesis collection
import nemo.collections.tts as nemo_tts

# To listen to our audio files
import IPython

[NeMo W 2021-11-07 11:27:59 optimizers:47] Apex was not found. Using the lamb optimizer will error out.
[NeMo W 2021-11-07 11:28:01 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali._AudioTextDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-11-07 11:28:04 experimental:28] Module <class 'nemo.collections.nlp.data.text_normalization.decoder_dataset.TextNormalizationDecoderDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-11-07 11:28:04 experimental:28] Module <class 'nemo.collections.nlp.data.text_normalization.tagger_dataset.TextNormalizationTaggerDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk.
[NeMo W 2021-11-07 11:28:04 experimental:28] Module <class 'nemo.collections.nlp.data.text_normalization.test_dataset.TextNormalizationTestDataset'> is experimental, not ready for pro

Next, we need to clarify which specific models from our collections we'd like to use. In our example, we use a Spanish recording as input, and we want our output to be in English.

We need


*   An ASR model in the language of our audiofile
*   A translation model that translates from the language of our audiofile to our target language
*   A spectogram generator in our target language
*   A vocoder that can turn our spectogram into an audiofile





In [None]:
# Speech Recognition model 
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_es_quartznet15x5").cuda()

# Neural Machine Translation model 
nmt_model = nemo_nlp.models.MTEncDecModel.from_pretrained(model_name='nmt_es_en_transformer12x2').cuda()

# Spectrogram generator which takes text as an input and produces spectrogram
spectrogram_generator = nemo_tts.models.FastPitchModel.from_pretrained(model_name="tts_en_fastpitch").cuda()

# Vocoder model which takes spectrogram and produces actual audio
vocoder = nemo_tts.models.HifiGanModel.from_pretrained(model_name="tts_hifigan").cuda()

[NeMo I 2021-11-07 11:28:05 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/stt_es_quartznet15x5/versions/1.0.0rc1/files/stt_es_quartznet15x5.nemo to /root/.cache/torch/NeMo/NeMo_1.4.0/stt_es_quartznet15x5/a65f9c865cfd58f57bfba25a7e44e8e2/stt_es_quartznet15x5.nemo
[NeMo I 2021-11-07 11:28:12 common:702] Instantiating model from pre-trained checkpoint


[NeMo W 2021-11-07 11:28:13 modelPT:131] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /raid/noneval.json
    sample_rate: 16000
    labels:
    - ' '
    - a
    - b
    - c
    - d
    - e
    - f
    - g
    - h
    - i
    - j
    - k
    - l
    - m
    - 'n'
    - o
    - p
    - q
    - r
    - s
    - t
    - u
    - v
    - w
    - x
    - 'y'
    - z
    - ''''
    - á
    - é
    - í
    - ó
    - ú
    - ñ
    - ü
    batch_size: 16
    trim_silence: true
    max_duration: 16.7
    shuffle: true
    is_tarred: false
    tarred_audio_filepaths: null
    num_workers: 8
    pin_memory: true
    
[NeMo W 2021-11-07 11:28:13 modelPT:138] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup th

[NeMo I 2021-11-07 11:28:13 features:262] PADDING: 16
[NeMo I 2021-11-07 11:28:13 features:279] STFT using torch
[NeMo I 2021-11-07 11:28:34 save_restore_connector:143] Model EncDecCTCModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.4.0/stt_es_quartznet15x5/a65f9c865cfd58f57bfba25a7e44e8e2/stt_es_quartznet15x5.nemo.
[NeMo I 2021-11-07 11:28:34 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/nmt_es_en_transformer12x2/versions/1.0.0rc1/files/nmt_es_en_transformer12x2.nemo to /root/.cache/torch/NeMo/NeMo_1.4.0/nmt_es_en_transformer12x2/42fbff52240a2c8cb1127d2a97201f6d/nmt_es_en_transformer12x2.nemo
[NeMo I 2021-11-07 11:29:20 common:702] Instantiating model from pre-trained checkpoint
[NeMo I 2021-11-07 11:29:40 tokenizer_utils:136] Getting YouTokenToMeTokenizer with model: /tmp/tmp8xvj8hvi/tokenizer.32000.BPE.model with r2l: False.
[NeMo I 2021-11-07 11:29:40 tokenizer_utils:136] Getting YouTokenToMeTokenizer with model: /tmp/tmp8xvj8hvi/to

[NeMo W 2021-11-07 11:29:40 modelPT:131] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    src_file_name: /raid/sharded_tarfiles_60_even/batches.tokens.16000._OP_1..302_CL_.tar
    tgt_file_name: /raid/sharded_tarfiles_60_even/batches.tokens.16000._OP_1..302_CL_.tar
    tokens_in_batch: 16000
    clean: true
    max_seq_length: 512
    cache_ids: false
    cache_data_per_node: false
    use_cache: false
    shuffle: true
    num_samples: -1
    drop_last: false
    pin_memory: false
    num_workers: 8
    load_from_cached_dataset: false
    reverse_lang_direction: true
    load_from_tarred_dataset: true
    metadata_path: /raid/sharded_tarfiles_60_even/metadata.json
    tar_shuffle_n: 100
    
[NeMo W 2021-11-07 11:29:40 modelPT:138] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validat

[NeMo I 2021-11-07 11:29:46 save_restore_connector:143] Model MTEncDecModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.4.0/nmt_es_en_transformer12x2/42fbff52240a2c8cb1127d2a97201f6d/nmt_es_en_transformer12x2.nemo.
[NeMo I 2021-11-07 11:29:46 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_en_fastpitch/versions/1.0.0/files/tts_en_fastpitch.nemo to /root/.cache/torch/NeMo/NeMo_1.4.0/tts_en_fastpitch/9651f9eb32324e98f965b98e94978217/tts_en_fastpitch.nemo
[NeMo I 2021-11-07 11:29:58 common:702] Instantiating model from pre-trained checkpoint


[NeMo W 2021-11-07 11:30:00 modelPT:131] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    manifest_filepath: /raid/LJSpeech/nvidia_ljspeech_train.json
    max_duration: null
    min_duration: 0.1
    sample_rate: 22050
    trim: false
    parser: null
    drop_last: true
    shuffle: true
    batch_size: 48
    num_workers: 12
    
[NeMo W 2021-11-07 11:30:00 modelPT:138] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    manifest_filepath: /raid/LJSpeech/nvidia_ljspeech_val.json
    max_duration: null
    min_duration: 0.1
    sample_rate: 22050
    trim: false
    parser: null
    drop_last: false
    shuffle: false
    batch_size: 48
    num_workers: 8
   

[NeMo I 2021-11-07 11:30:00 features:262] PADDING: 1
[NeMo I 2021-11-07 11:30:00 features:279] STFT using torch
[NeMo I 2021-11-07 11:30:01 save_restore_connector:143] Model FastPitchModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.4.0/tts_en_fastpitch/9651f9eb32324e98f965b98e94978217/tts_en_fastpitch.nemo.
[NeMo I 2021-11-07 11:30:01 cloud:66] Downloading from: https://api.ngc.nvidia.com/v2/models/nvidia/nemo/tts_hifigan/versions/1.0.0rc1/files/tts_hifigan.nemo to /root/.cache/torch/NeMo/NeMo_1.4.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo
[NeMo I 2021-11-07 11:30:19 common:702] Instantiating model from pre-trained checkpoint


[NeMo W 2021-11-07 11:30:23 modelPT:131] If you intend to do training or fine-tuning, please call the ModelPT.setup_training_data() method and provide a valid configuration file to setup the train data loader.
    Train config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/train_finetune.txt
      min_duration: 0.75
      n_segments: 8192
    dataloader_params:
      drop_last: false
      shuffle: true
      batch_size: 64
      num_workers: 4
    
[NeMo W 2021-11-07 11:30:23 modelPT:138] If you intend to do validation, please call the ModelPT.setup_validation_data() or ModelPT.setup_multiple_validation_data() method and provide a valid configuration file to setup the validation data loader(s). 
    Validation config : 
    dataset:
      _target_: nemo.collections.tts.data.datalayers.MelAudioDataset
      manifest_filepath: /home/fkreuk/data/val_finetune.txt
      min_duration: 3
      n_segments: 66150


[NeMo I 2021-11-07 11:30:23 features:262] PADDING: 0
[NeMo I 2021-11-07 11:30:23 features:279] STFT using torch


[NeMo W 2021-11-07 11:30:23 features:240] Using torch_stft is deprecated and will be removed in 1.1.0. Please set stft_conv and stft_exact_pad to False for FilterbankFeatures and AudioToMelSpectrogramPreprocessor. Please set exact_pad to True as needed.


[NeMo I 2021-11-07 11:30:23 features:262] PADDING: 0
[NeMo I 2021-11-07 11:30:23 features:279] STFT using torch
[NeMo I 2021-11-07 11:30:24 save_restore_connector:143] Model HifiGanModel was successfully restored from /root/.cache/torch/NeMo/NeMo_1.4.0/tts_hifigan/e6da322f0f7e7dcf3f1900a9229a7e69/tts_hifigan.nemo.


In [None]:
# # If you'd like to use this with other models uncomment this block
# nemo_nlp.models.MTEncDecModel.list_available_models()
# nemo_asr.models.EncDecCTCModel.list_available_models()

# Let's start translating

Add the path to the audio file you would like to have translated.

In [None]:
# Feel free to add your own audio here, but if you don't have an audio sample yet, you can use the following
!wget 'https://www.lightbulblanguages.co.uk/resources/sp-audio/tengo-once-anos.mp3'

--2021-11-07 11:30:25--  https://www.lightbulblanguages.co.uk/resources/sp-audio/tengo-once-anos.mp3
Resolving www.lightbulblanguages.co.uk (www.lightbulblanguages.co.uk)... 65.39.193.60
Connecting to www.lightbulblanguages.co.uk (www.lightbulblanguages.co.uk)|65.39.193.60|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14750 (14K) [audio/mpeg]
Saving to: ‘tengo-once-anos.mp3’


2021-11-07 11:30:26 (83.9 KB/s) - ‘tengo-once-anos.mp3’ saved [14750/14750]



In [None]:
# Download audio sample which we'll try
# IMPORTANT: The audio must be mono with 16Khz sampling rate
audio_sample = 'tengo-once-anos.mp3'
audio_sample = 'amarens-wind-09092021-spanish.wav'

# To listen it, click on the play button below
IPython.display.Audio(audio_sample)

Next, we'll transcribe the text from the audio sample and print the transcribed text. 

In [None]:
transcribed_text = asr_model.transcribe([audio_sample])
print(transcribed_text)

Transcribing:   0%|          | 0/1 [00:00<?, ?it/s]

[NeMo W 2021-11-07 11:31:42 patch_utils:50] torch.stft() signature has been updated for PyTorch 1.7+
    Please update PyTorch to remain compatible with later versions of NeMo.
    To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
      return torch.floor_divide(self, other)
    


['tengo once años']


Then, we translate the transcribed text to our target language.

In [None]:
english_text = nmt_model.translate(transcribed_text)
print(english_text)

["I'm eleven years old"]


Lastly, we convert the translated into speech using a spectogram generator and a vocoder. 

In [None]:
# A helper function which combines FastPitch and HifiGan to go directly from 
# text to audio
def text_to_audio(text):
  parsed = spectrogram_generator.parse(text)
  spectrogram = spectrogram_generator.generate_spectrogram(tokens=parsed)
  audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
  return audio.to('cpu').detach().numpy()

Now we have our output

In [None]:
# Listen to generated audio in English
IPython.display.Audio(text_to_audio(english_text[0]), rate=22050)