-
Lightweight Zero-shot Text-to-Speech with Mixture of Adapters
-
[2407.02243] Robust Zero-Shot Text-to-Speech Synthesis with Reverse Inference Optimization
-
Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment,
arXiv, 2406.17957
, arxiv, pdf, cication: -1Paarth Neekhara, Shehzeen Hussain, Subhankar Ghosh, Jason Li, Rafael Valle, Rohan Badlani, Boris Ginsburg · (t5tts.github)
-
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS,
arXiv, 2406.18009
, arxiv, pdf, cication: -1Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan · (microsoft)
-
High Fidelity Text-to-Speech Via Discrete Tokens Using Token Transducer and Group Masked Language Model,
arXiv, 2406.17310
, arxiv, pdf, cication: -1Joun Yeop Lee, Myeonghun Jeong, Minchan Kim, Ji-Hyun Lee, Hoon-Young Cho, Nam Soo Kim · (arxiv)
-
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment,
arXiv, 2406.07855
, arxiv, pdf, cication: -1Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, Furu Wei
-
MARS5-TTS - Camb-ai
MARS5 speech model (TTS) from CAMB.AI
-
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling,
arXiv, 2406.05681
, arxiv, pdf, cication: -1Yuepeng Jiang, Tao Li, Fengyu Yang, Lei Xie, Meng Meng, Yujun Wang
-
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,
arXiv, 2406.04904
, arxiv, pdf, cication: -1Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi
-
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers,
arXiv, 2406.05370
, arxiv, pdf, cication: -1Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei · (aka)
-
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes,
arXiv, 2406.02897
, arxiv, pdf, cication: -1Trung Dang, David Aponte, Dung Tran, Kazuhito Koishida
-
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer,
arXiv, 2406.00976
, arxiv, pdf, cication: -1Yongxin Zhu, Dan Su, Liqiang He, Linli Xu, Dong Yu · (youngsheen.github) · (GPST - youngsheen)
-
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec,
arXiv, 2406.01205
, arxiv, pdf, cication: -1Shengpeng Ji, Jialong Zuo, Minghui Fang, Siqi Zheng, Qian Chen, Wen Wang, Ziyue Jiang, Hai Huang, Xize Cheng, Rongjie Huang · (ControlSpeech - jishengpeng)
· (controlspeech.github)
-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models,
arXiv, 2406.02430
, arxiv, pdf, cication: -1Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao · (bytedancespeech.github)
· (seed-tts-eval - BytedanceSpeech)
-
FlashSpeech: Efficient Zero-Shot Speech Synthesis,
arXiv, 2404.14700
, arxiv, pdf, cication: -1Zhen Ye, Zeqian Ju, Haohe Liu, Xu Tan, Jianyi Chen, Yiwen Lu, Peiwen Sun, Jiahao Pan, Weizhen Bian, Shulin He · (flashspeech.github)
-
RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis,
arXiv, 2404.03204
, arxiv, pdf, cication: -1Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li · (ralle-demo.github)
-
VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild,
arXiv, 2403.16973
, arxiv, pdf, cication: -1Puyuan Peng, Po-Yao Huang, Daniel Li, Abdelrahman Mohamed, David Harwath
· (jasonppy.github) · (VoiceCraft - jasonppy)
· (jasonppy.github)
-
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis,
arXiv, 2307.07218
, arxiv, pdf, cication: -1Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Zhenhui Ye, Shengpeng Ji, Qian Yang, Chen Zhang, Pengfei Wei, Chunfeng Wang
-
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech | OpenReview
· (scholar-inbox) · (clam-tts.github)
-
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling,
arXiv, 2403.05989
, arxiv, pdf, cication: -1Chunhui Wang, Chang Zeng, Bowen Zhang, Ziyang Ma, Yefan Zhu, Zifeng Cai, Jian Zhao, Zhonglin Jiang, Yong Chen · (anonymous.4open)
-
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models,
arXiv, 2403.03100
, arxiv, pdf, cication: -1Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang
-
MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech,
arXiv, 2402.09378
, arxiv, pdf, cication: 1Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao
-
metavoice-src - metavoiceio
AI for human-level speech intelligence · (huggingface) · (ttsdemo.themetavoice)
-
WhisperSpeech - collabora
An Open Source text-to-speech system built by inverting Whisper. · (collabora.github)
-
VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech,
arXiv, 2401.14321
, arxiv, pdf, cication: -1Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, Kai Yu · (cpdu.github)
-
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering,
arXiv, 2401.07333
, arxiv, pdf, cication: -1Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen · (ereboas.github)
-
OpenVoice: Versatile Instant Voice Cloning,
arXiv, 2312.01479
, arxiv, pdf, cication: -1Zengyi Qin, Wenliang Zhao, Xumin Yu, Xin Sun · (openvoice - myshell-ai)
-
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis,
arXiv, 2311.12454
, arxiv, pdf, cication: -1Sang-Hoon Lee, Ha-Yeong Choi, Seung-Bin Kim, Seong-Whan Lee · (HierSpeechpp - sh-lee-prml)
-
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models,
arXiv, 2308.16692
, arxiv, pdf, cication: 13Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
· (speechtokenizer - zhangxinfd)
-
xtts - coqui 🤗
· (huggingface) · (tts.readthedocs)
-
P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting - NVIDIA ADLR
· (openreview)
-
PromptTTS 2: Describing and Generating Voices with Text Prompt,
arXiv, 2309.02285
, arxiv, pdf, cication: 3Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song · (speechresearch.github)
-
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer,
arXiv, 2308.06873
, arxiv, pdf, cication: 10Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, Jinyu Li, Takuya Yoshioka
-
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts,
arXiv, 2307.07218
, arxiv, pdf, cication: 3Ziyue Jiang, Jinglin Liu, Yi Ren, Jinzheng He, Chen Zhang, Zhenhui Ye, Pengfei Wei, Chunfeng Wang, Xiang Yin, Zejun Ma · (mega-tts.github)
-
GPT-SoVITS - RVC-Boss
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
-
fish-speech - fishaudio
Brand new TTS solution · (speech.fish)
· (bilibili)
-
Pheme: Efficient and Conversational Speech Generation,
arXiv, 2401.02839
, arxiv, pdf, cication: -1Paweł Budzianowski, Taras Sereda, Tomasz Cichy, Ivan Vulić · (arxiv) · (pheme - PolyAI-LDN)
· (polyai-ldn.github)
-
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech,
arXiv, 2302.04215
, arxiv, pdf, cication: 14Li-Wei Chen, Shinji Watanabe, Alexander Rudnicky
· (MQTTS - b04901014)
-
SC-CNN - hcy71o
SC-CNN: Effective Speaker Conditioning Method for Zero-Shot Multi-Speaker Text-to-Speech Systems
-
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus,
arXiv, 2203.15447
, arxiv, pdf, cication: 15Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Sunghwan Ahn, Joun Yeop Lee, Nam Soo Kim · (TransferTTS - hcy71o)
· (SC-VITS - hcy71o)
-
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech,
arXiv, 2205.07211
, arxiv, pdf, cication: 28Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao · (generspeech.github) · (GenerSpeech - Rongjiehuang)
-
Make-A-Voice: Unified Voice Synthesis With Discrete Representation,
arXiv, 2305.19269
, arxiv, pdf, cication: 6Rongjie Huang, Chunlei Zhang, Yongqi Wang, Dongchao Yang, Luping Liu, Zhenhui Ye, Ziyue Jiang, Chao Weng, Zhou Zhao, Dong Yu
-
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias,
arXiv, 2306.03509
, arxiv, pdf, cication: 12Ziyue Jiang, Yi Ren, Zhenhui Ye, Jinglin Liu, Chen Zhang, Qian Yang, Shengpeng Ji, Rongjie Huang, Chunfeng Wang, Xiang Yin
-
SoundStorm: Efficient Parallel Audio Generation,
arXiv, 2305.09636
, arxiv, pdf, cication: 18Zalán Borsos, Matt Sharifi, Damien Vincent, Eugene Kharitonov, Neil Zeghidour, Marco Tagliasacchi
-
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers,
arXiv, 2304.09116
, arxiv, pdf, cication: 43Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian
-
Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling,
arXiv, 2303.03926
, arxiv, pdf, cication: 37Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li
-
Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision,
arXiv, 2302.03540
, arxiv, pdf, cication: 45Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, Neil Zeghidour
-
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers,
arXiv, 2301.02111
, arxiv, pdf, cication: 182Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li
-
· (sh-lee-prml.github) · (HierSpeech - CODEJIN)
-
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech,
ieee signal processing letters, 2022
, arxiv, pdf, cication: 7Byoung Jin Choi, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim · (byoungjinchoi.github)
-
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone,
ICML, 2022
, arxiv, pdf, cication: 164Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren Gölge, Moacir Antonelli Ponti
- ai-voice-cloning - JarodMica
- Vokan - ShoukanLabs 🤗