<a href="https://colab.research.google.com/github/imrankedim/Q-A-NLP/blob/master/Ikedim_Amharic_TTS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




# Amharic Text to Speech Syntesis 
-------------------------------------------------------------------
*Author: NVIDIA*

*Custumized by : Emran Abdulkdim*

**WaveGlow model for generating speech from mel spectrograms (generated by Tacotron2)**

<img src="https://pytorch.org/assets/images/waveglow_diagram.png" alt="alt" width="50%"/>

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [3]:
import torch
print(torch.__version__)

1.6.0+cu101


In [4]:
import torch
AmharicTTS= torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_waveglow')

Downloading: "https://github.com/nvidia/DeepLearningExamples/archive/torchhub.zip" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth


will load the WaveGlow model pre-trained on [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/)

### Model Description

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts without any additional prosody information. The Tacotron 2 model (also available via torch.hub) produces mel spectrograms from input text using encoder-decoder architecture. WaveGlow is a flow-based model that consumes the mel spectrograms to generate speech.

### Example

In the example below:
- pretrained Tacotron2 and Waveglow models are loaded from torch.hub
- Tacotron2 generates mel spectrogram given tensor represantation of an input text ("Hello world, I missed you")
- Waveglow generates sound given the mel spectrogram
- the output sound is saved in an 'audio.wav' file

To run the example you need some extra python packages installed.
These are needed for preprocessing the text and audio, as well as for display and input / output.

In [5]:
%%bash
pip install numpy scipy librosa unidecode inflect librosa

Collecting unidecode
  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
Installing collected packages: unidecode
Successfully installed unidecode-1.1.1


In [6]:
import numpy as np
from scipy.io.wavfile import write

Prepare the waveglow model for inference

In [7]:
AmharicTTS = AmharicTTS.remove_weightnorm(AmharicTTS)
AmharicTTS = AmharicTTS.to('cuda')
AmharicTTS.eval()

WaveGlow(
  (upsample): ConvTranspose1d(80, 80, kernel_size=(1024,), stride=(256,))
  (WN): ModuleList(
    (0): WN(
      (in_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
        (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,))
        (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,))
        (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(16,))
        (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(32,))
        (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(64,))
        (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(128,), dilation=(128,))
      )
      (res_skip_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
        (1): Conv1d(51

Load tacotron2 from PyTorch Hub



In [8]:
tacotron2ForAmharicTTS = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2')
tacotron2ForAmharicTTS = tacotron2ForAmharicTTS.to('cuda')
tacotron2ForAmharicTTS.eval()

Using cache found in /root/.cache/torch/hub/nvidia_DeepLearningExamples_torchhub
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2pyt_fp32/versions/1/files/nvidia_tacotron2pyt_fp32_20190306.pth


Tacotron2(
  (embedding): Embedding(148, 512)
  (encoder): Encoder(
    (convolutions): ModuleList(
      (0): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (2): Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  )
  (decoder): Decoder(
    (prenet): Prenet(
      (layers): ModuleList(
        (0): LinearNorm(
          (lin

Now, let's make the model to read amharic text


In [9]:
amharicText1 = ("ኮሮና ቫይረስ ወረርሽኙን መከላከያ መንገዶች ከባድ አይደሉም ነገር ግን " + 
               "መሰላቸት እና መዘናጋት ይታያል ሁላችንም እኩል ኃላፊነት ሊሰማን ይገባል።")
amharicText2 = "ከመሞቱ በፊት የተወሰደው የኮሮናቫይረስ ናሙና ውጤት ፖዘቲቭ የሆነ አስከሬኑን ዘመዶች ወሰዱት። "
amharicText3 = "ጠ/ሚ ዶ/ር ዐቢይ ዕርቅን ማውረድ ለቀጠናው ውህደት መሠረት ነው ሲሉ ተናገሩ"
amharicText4 = "ባለፉት 24 ሰዓት በተደረገው 17,323 የላብራቶሪ ምርመራ 1,038 ሰዎች በቫይረሱ ተይዘዋል"

Now chain pre-processing -> tacotron2 -> waveglow

In [10]:
# preprocessing
sequence = np.array(tacotron2ForAmharicTTS.text_to_sequence(amharicText2, ['transliteration_cleaners']))[None, :]
sequence = torch.from_numpy(sequence).to(device='cuda', dtype=torch.int64)

# run the models
with torch.no_grad():
    _, mel, _, _ = tacotron2ForAmharicTTS.infer(sequence)
    audio = AmharicTTS.infer(mel)
amharic_audio_numpy = audio[0].data.cpu().numpy()
rate = 22500

You can write it to a file and listen to it

In [11]:
write("sample_amharic_audio.wav", rate, amharic_audio_numpy)

Alternatively, play it right away in a notebook with IPython widgets

In [12]:
from IPython.display import Audio
Audio(amharic_audio_numpy, rate=rate)

### Details
For detailed information on model input and output, training recipies, inference and performance visit: [github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) and/or [NGC](https://ngc.nvidia.com/catalog/model-scripts/nvidia:tacotron_2_and_waveglow_for_pytorch)

### References

 - [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
 - [WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002)
 - [Tacotron2 and WaveGlow on NGC](https://ngc.nvidia.com/catalog/model-scripts/nvidia:tacotron_2_and_waveglow_for_pytorch)
 - [Tacotron2 and Waveglow on github](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)