# Matcha TTS + vocos

This notebook shows how to easily synthesize a mel spectrogram generated with matcha using the vocos model.

In [1]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
%%capture
!pip install matcha-tts
!apt-get install espeak
!pip install git+https://github.com/langtech-bsc/vocos.git@matcha

Let's generate a sentence with matcha, by default the model creates and audio file with hifigan and also saves the mel spectrogram to a .npy file


In [3]:
!matcha-tts --text "Hello This is Matcha TTS with vocos vocoder" --model "matcha_vctk" --spk 5

[-] GPU not available or forced CPU run! Using CPU
[!] Configurations: 
	- Model: matcha_vctk
	- Vocoder: hifigan_univ_v1
	- Temperature: 0.667
	- Speaking rate: 0.85
	- Number of ODE steps: 10
	- Speaker: 5
[+] Model already present at /root/.local/share/matcha_tts/matcha_vctk.ckpt!
[+] Model already present at /root/.local/share/matcha_tts/hifigan_univ_v1!
[!] Loading matcha_vctk!
[+] matcha_vctk loaded!
[!] Loading hifigan_univ_v1!
Removing weight norm...
[+] hifigan_univ_v1 loaded!
[1] - Input text: Hello This is Matcha TTS with vocos vocoder
[1] - Phonetised text: həlˈoʊ ðɪs ɪz mˈætʃə tˌiːtˌiːˈɛs wɪð vˈoʊkoʊz vˈoʊkoʊdɚ
[🍵] Whisking Matcha-T(ea)TS for: 1
[🍵-1] Matcha-TTS RTF: 0.4023
[🍵-1] Matcha-TTS + VOCODER RTF: 1.7533
[+] Waveform saved: /content/utterance_001_speaker_005.wav
[🍵] Average Matcha-TTS RTF: 0.4023 ± 0.0
[🍵] Average Matcha-TTS + VOCODER RTF: 1.7533 ± 0.0
[🍵] Enjoy the freshly whisked 🍵 Matcha-TTS!


We can read the mel spectrogram and pass it to the vocos model to generate the audio. Note how the RTF of vocos is lower compared to hifigan RTF.

In [5]:
from vocos import Vocos
import numpy as np
import torch
from IPython.display import Audio, display
from time import perf_counter

vocos = Vocos.from_pretrained("BSC-LT/vocos-mel-22khz")

mel = torch.tensor(np.load("/content/utterance_001_speaker_005.npy"))
t0 = perf_counter()
audio = vocos.decode(mel)
vocos_infer_secs =  perf_counter() - t0


print(f"RTF: { vocos_infer_secs / (audio.shape[1]/22050) }")
print("Hifigan")
display(Audio(filename="/content/utterance_001_speaker_005.wav"))
print("Vocos")
display(Audio(data=audio, rate=22050))


RTF: 0.05597393930363124
Hifigan


Vocos
