# Matcha TTS + vocos

This notebook shows how to easily synthesize a mel spectrogram generated with matcha using the vocos model.

In [1]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [2]:
!pip install matcha-tts
!apt-get install espeak
!pip install git+https://github.com/langtech-bsc/vocos.git@matcha

Collecting matcha-tts
  Downloading matcha_tts-0.0.5.1-cp310-cp310-manylinux1_x86_64.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.2/302.2 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting lightning>=2.0.0 (from matcha-tts)
  Downloading lightning-2.2.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchmetrics>=0.11.4 (from matcha-tts)
  Downloading torchmetrics-1.3.2-py3-none-any.whl (841 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m841.5/841.5 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hydra-core==1.3.2 (from matcha-tts)
  Downloading hydra_core-1.3.2-py3-none-any.whl (154 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.5/154.5 kB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting hydra-colorlog==1.2.0 (from matcha-tts)
  Downloading hydra_colorlog-1.

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  espeak-data libespeak1 libportaudio2 libsonic0
The following NEW packages will be installed:
  espeak espeak-data libespeak1 libportaudio2 libsonic0
0 upgraded, 5 newly installed, 0 to remove and 39 not upgraded.
Need to get 1,382 kB of archives.
After this operation, 3,178 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libportaudio2 amd64 19.6.0-1.1 [65.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libsonic0 amd64 0.2.0-11build1 [10.3 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 espeak-data amd64 1.48.15+dfsg-3 [1,085 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libespeak1 amd64 1.48.15+dfsg-3 [156 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/universe amd64 espeak amd64 1.48.15+dfsg-3 [64.2 kB]
Fetched 1,382 kB in 2s (690 kB

Let's generate a sentence with matcha, by default the model creates and audio file with hifigan and also saves the mel spectrogram to a .npy file


In [3]:
!matcha-tts --text "Hello This is Matcha TTS with vocos vocoder" --model "matcha_vctk" --spk 5

[-] GPU not available or forced CPU run! Using CPU
[!] Configurations: 
	- Model: matcha_vctk
	- Vocoder: hifigan_univ_v1
	- Temperature: 0.667
	- Speaking rate: 0.85
	- Number of ODE steps: 10
	- Speaker: 5
[-] Model not found at /root/.local/share/matcha_tts/matcha_vctk.ckpt! Will download it
[-] Model not found at /root/.local/share/matcha_tts/hifigan_univ_v1! Will download it
[!] Loading matcha_vctk!
[+] matcha_vctk loaded!
[!] Loading hifigan_univ_v1!
Removing weight norm...
[+] hifigan_univ_v1 loaded!
[1] - Input text: Hello This is Matcha TTS with vocos vocoder
[1] - Phonetised text: həlˈoʊ ðɪs ɪz mˈætʃə tˌiːtˌiːˈɛs wɪð vˈoʊkoʊz vˈoʊkoʊdɚ
[🍵] Whisking Matcha-T(ea)TS for: 1
[🍵-1] Matcha-TTS RTF: 0.3600
[🍵-1] Matcha-TTS + VOCODER RTF: 2.0542
[+] Waveform saved: /content/utterance_001_speaker_005.wav
[🍵] Average Matcha-TTS RTF: 0.3600 ± 0.0
[🍵] Average Matcha-TTS + VOCODER RTF: 2.0542 ± 0.0
[🍵] Enjoy the freshly whisked 🍵 Matcha-TTS!


We can read the mel spectrogram and pass it to the vocos model to generate the audio. Note how the RTF of vocos is lower compared to hifigan RTF.

In [4]:
from vocos import Vocos
import numpy as np
import torch
from IPython.display import Audio, display
from time import perf_counter

vocos = Vocos.from_pretrained("BSC-LT/vocos-mel-22khz")

mel = torch.tensor(np.load("/content/utterance_001_speaker_005.npy"))
t0 = perf_counter()
audio = vocos.decode(mel)
vocos_infer_secs =  perf_counter() - t0


print(f"RTF: { vocos_infer_secs / (audio.shape[1]/22050) }")
print("Hifigan")
display(Audio(filename="/content/utterance_001_speaker_005.wav"))
print("Vocos")
display(Audio(data=audio, rate=22050))


Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.


config.yaml:   0%|          | 0.00/558 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/54.1M [00:00<?, ?B/s]

RTF: 0.05771143309665032
Hifigan


Vocos
