Awesome Zero Shot TTS

Awesome Zero Shot TTS
- Gallery
- References

Gallery

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
[2407.02243] Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment, arXiv, 2406.17957, arxiv, pdf, cication: -1

Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg · (t5tts.github)
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS, arXiv, 2406.18009, arxiv, pdf, cication: -1

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan · (microsoft)
High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model, arXiv, 2406.17310, arxiv, pdf, cication: -1

Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young Cho, Nam Soo Kim · (arxiv)
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment, arXiv, 2406.07855, arxiv, pdf, cication: -1

Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei
MARS5-TTS - Camb-ai

MARS5 speech model (TTS) from CAMB.AI
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling, arXiv, 2406.05681, arxiv, pdf, cication: -1

Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model, arXiv, 2406.04904, arxiv, pdf, cication: -1

Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers, arXiv, 2406.05370, arxiv, pdf, cication: -1

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei · (aka)
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes, arXiv, 2406.02897, arxiv, pdf, cication: -1

Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer, arXiv, 2406.00976, arxiv, pdf, cication: -1

Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu · (youngsheen.github) · (GPST - youngsheen)
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec, arXiv, 2406.01205, arxiv, pdf, cication: -1

Shengpeng Ji, Jialong Zuo, Minghui Fang, Siqi Zheng, Qian Chen, Wen Wang, Ziyue Jiang, Hai Huang, Xize Cheng, Rongjie Huang · (ControlSpeech - jishengpeng) · (controlspeech.github)
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models, arXiv, 2406.02430, arxiv, pdf, cication: -1

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao · (bytedancespeech.github)

· (seed-tts-eval - BytedanceSpeech)
FlashSpeech: Efficient Zero-Shot Speech Synthesis, arXiv, 2404.14700, arxiv, pdf, cication: -1

Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He · (flashspeech.github)
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis, arXiv, 2404.03204, arxiv, pdf, cication: -1

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li · (ralle-demo.github)
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild, arXiv, 2403.16973, arxiv, pdf, cication: -1

Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath

· (jasonppy.github) · (VoiceCraft - jasonppy) · (jasonppy.github)
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis, arXiv, 2307.07218, arxiv, pdf, cication: -1

Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang

· (boostprompt.github)
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech | OpenReview

· (scholar-inbox) · (clam-tts.github)
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling, arXiv, 2403.05989, arxiv, pdf, cication: -1

Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen · (anonymous.4open)
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models, arXiv, 2403.03100, arxiv, pdf, cication: -1

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang

· (speechresearch.github)
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech, arXiv, 2402.09378, arxiv, pdf, cication: 1

Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao

· (mobilespeech.github)
metavoice-src - metavoiceio

AI for human-level speech intelligence · (huggingface) · (ttsdemo.themetavoice)
WhisperSpeech - collabora

An Open Source text-to-speech system built by inverting Whisper. · (collabora.github)
VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech, arXiv, 2401.14321, arxiv, pdf, cication: -1

Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, Kai Yu · (cpdu.github)
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering, arXiv, 2401.07333, arxiv, pdf, cication: -1

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen · (ereboas.github)
OpenVoice: Versatile Instant Voice Cloning, arXiv, 2312.01479, arxiv, pdf, cication: -1

Zengyi Qin, Wenliang Zhao, Xumin Yu, Xin Sun · (openvoice - myshell-ai)
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis, arXiv, 2311.12454, arxiv, pdf, cication: -1

Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee · (HierSpeechpp - sh-lee-prml)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models, arXiv, 2308.16692, arxiv, pdf, cication: 13

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu

· (speechtokenizer - zhangxinfd)
xtts - coqui 🤗

· (huggingface) · (tts.readthedocs)
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting - NVIDIA ADLR

· (openreview)
PromptTTS 2: Describing and Generating Voices with Text Prompt, arXiv, 2309.02285, arxiv, pdf, cication: 3

Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song · (speechresearch.github)
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer, arXiv, 2308.06873, arxiv, pdf, cication: 10

Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts, arXiv, 2307.07218, arxiv, pdf, cication: 3

Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma · (mega-tts.github)
GPT-SoVITS - RVC-Boss

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
- 耗时两个月自主研发的低成本AI音色克隆软件，免费送给大家！【GPT-SoVITS】_哔哩哔哩_bilibili
fish-speech - fishaudio

Brand new TTS solution · (speech.fish)

· (bilibili)
Pheme: Efficient and Conversational Speech Generation, arXiv, 2401.02839, arxiv, pdf, cication: -1

Paweł Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vulić · (arxiv) · (pheme - PolyAI-LDN) · (polyai-ldn.github)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech, arXiv, 2302.04215, arxiv, pdf, cication: 14

Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky

· (MQTTS - b04901014)
SC-CNN - hcy71o

SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
SC-CNN-demo: "Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems"
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus, arXiv, 2203.15447, arxiv, pdf, cication: 15

Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Sunghwan Ahn, Joun Yeop Lee, Nam Soo Kim · (TransferTTS - hcy71o) · (SC-VITS - hcy71o)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech, arXiv, 2205.07211, arxiv, pdf, cication: 28

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao · (generspeech.github) · (GenerSpeech - Rongjiehuang)
Make-A-Voice: Unified Voice Synthesis With Discrete Representation, arXiv, 2305.19269, arxiv, pdf, cication: 6

Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu
- Make-A-Voice
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias, arXiv, 2306.03509, arxiv, pdf, cication: 12

Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin
- Mega-TTS | demo-page
SoundStorm: Efficient Parallel Audio Generation, arXiv, 2305.09636, arxiv, pdf, cication: 18

Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi
- SoundStorm
- GitHub - lucidrains/soundstorm-pytorch: Implementation of SoundStorm, Efficient Parallel Audio Generation from Google Deepmind, in Pytorch
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers, arXiv, 2304.09116, arxiv, pdf, cication: 43

Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian
- GitHub - lucidrains/naturalspeech2-pytorch: Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
- NaturalSpeech2
- NaturalSpeech2 - a Hugging Face Space by amphion
- NaturalSpeech 2
- 微软 NaturalSpeech 2来了，基于扩散模型的语音合成_哔哩哔哩_bilibili
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling, arXiv, 2303.03926, arxiv, pdf, cication: 37

Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li
- GitHub - Plachtaa/VALL-E-X: An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io
- VALL-E X
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision, arXiv, 2302.03540, arxiv, pdf, cication: 45

Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour
- SPEAR-TTS
- GitHub - collabora/spear-tts-pytorch: An unofficial PyTorch implementation of SPEAR-TTS.
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers, arXiv, 2301.02111, arxiv, pdf, cication: 182

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li
- VALL-E
- GitHub - enhuiz/vall-e: An unofficial PyTorch implementation of the audio LM VALL-E
HierSpeech: Bridging the Gap between Text and Speech by Hierarchical Variational Inference using Self-supervised Representations for Speech Synthesis | OpenReview

· (sh-lee-prml.github) · (HierSpeech - CODEJIN)
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech, ieee signal processing letters, 2022, arxiv, pdf, cication: 7

Byoung Jin Choi, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim · (byoungjinchoi.github)
- GitHub - hcy71o/SNAC
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone, ICML, 2022, arxiv, pdf, cication: 164

Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren Gölge, Moacir Antonelli Ponti
- YourTTS

Projects

ai-voice-cloning - JarodMica
Vokan - ShoukanLabs 🤗

Products

Cartesia
Introducing Rapid Voice Cloning: Create AI Voices in Seconds

References

open-tts-tracker - Vaibhavs10
- open_tts_tracker
AR-NAR-TTS.pdf
【機器學習2023】語音基石模型 (助教張凱為講授) (1/2) - YouTube
【機器學習2023】語音基石模型 (助教張凱為講授) (2/2) - YouTube
ml2023-course-data/張凱爲-x-機器學習-x-語音基石模型.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

awesome_zeroshot_tts.md

awesome_zeroshot_tts.md

Awesome Zero Shot TTS

Gallery

Projects

Products

References

Files

awesome_zeroshot_tts.md

Latest commit

History

awesome_zeroshot_tts.md

File metadata and controls

Awesome Zero Shot TTS

Gallery

Projects

Products

References