💥 This repository contains the test set of speech generation proposed in our work seed-TTS, as well as the scripts for metrics calculation. For security reasons, the source code and model weights of Seed-TTS will not be released. Welcome to try the speech generation feature in the products of ByteDance. 💥
To evaluate the zero-shot speech generation ability of our model, we propose an out-of-domain objective evaluation test set. This test set consists of samples extracted from English (EN) and Mandarin (ZH) public corpora that are used to measure the model's performance on various objective metrics. Specifically, we employ 1,000 samples from the Common Voice dataset and 2,000 samples from the DiDiSpeech-2 dataset.
To install all dependencies, run
pip3 install -r requirements.txt
The word error rate (WER) and speaker similarity (SIM) metrics are adopted for objective evaluation.
- For WER, we employ Whisper-large-v3 and Paraformer-zh as the automatic speech recognition (ASR) engines for English and Mandarin, respectively.
- For SIM, we use WavLM-large fine-tuned on the speaker verification task (model link) to obtain speaker embeddings used to calculate the cosine similarity of speech samples of each test utterance against reference clips.
You can download the test set for all tasks from this link. The test set is mainly organized using the method of meta file. The meaning of each line in the meta file: filename | the text of the prompt | the audio of the prompt | the text to be synthesized | the ground truth counterpart corresponding to the text to be synthesized (if exists). For different tasks, we adopt different meta files:
- Zero-shot text-to-speech (TTS):
- EN: seed-tts-eval/en/meta.lst
- ZH: seed-tts-eval/zh/meta.lst
- ZH (hard case): seed-tts-eval/zh/hardcase.lst
- Zero-shot voice conversion (VC):
- EN: seed-tts-eval/en/non_para_reconstruct_meta.lst
- ZH: seed-tts-eval/zh/non_para_reconstruct_meta.lst
We also release the evaluation code for both metrics:
# WER
bash cal_wer.sh {the path of the meta file} {the directory of synthesized audio} {language: zh or en}
# SIM
bash cal_sim.sh {the path of the meta file} {the directory of synthesized audio} {path/wavlm_large_finetune.pth}