is that able to train on Chinese dataset? #2

lucasjinreal · 2021-06-11T06:09:46Z

is that able to train on Chinese dataset?

jaywalnut310 · 2021-06-11T06:44:04Z

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes.
This model gets phonemes as input rather than characters.

lucasjinreal · 2021-06-11T06:47:23Z

@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese?

LG-SS · 2021-06-11T09:24:35Z

@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese?
The phonemes in Chinese are initials and finals with tone, for example, "ni2 hao3" can be converted into "n i2 h ao3"

LG-SS · 2021-06-11T09:35:32Z

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes.
This model gets phonemes as input rather than characters.

Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.

jaywalnut310 · 2021-06-14T01:28:28Z

@LG-SS Now the paper is available: https://arxiv.org/abs/2106.06103

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes.
This model gets phonemes as input rather than characters.

Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.

lucasjinreal · 2021-06-15T12:54:50Z

@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?

leminhnguyen · 2021-06-15T16:28:05Z

@jaywalnut310 This model is autoregressive or non autoregressive ?

jaywalnut310 · 2021-06-15T23:24:53Z

@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?

Hi @jinfagang. In a previous work, Glow-TTS, a synthesis speed test between Tacotron 2 and Glow-TTS was reported. As the synthesis speed of VITS is faster than that of Glow-TTS + HiFi-GAN(vocoder), it would be much faster than that of Tacotron 2 + HiFi-GAN(vocoder).

jaywalnut310 · 2021-06-15T23:25:21Z

@jaywalnut310 This model is autoregressive or non autoregressive ?

Hi @leminhnguyen, this model is non autoregressive.

leminhnguyen · 2021-06-16T05:00:59Z

@jaywalnut310 Thanks you, I have some questions.

How about controllability ?
We can change the duration, energy or pitch ?
In the paper, you mentioned FastSpeech2 in related work. Did you try to compare the speed between Fastspeech2 and VITS ?

lucasjinreal · 2021-06-16T06:11:34Z

@jaywalnut310 I listened the sample audio from vits, it's much more better and natural than tactron2, so it's better and faster, more valuable to have a try. Do u guys have a Chinese pretrained model BTW?

jaywalnut310 · 2021-06-16T22:02:55Z

@leminhnguyen Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.

@jinfagang Thank you :). I haven't trained on Chinese dataset, but it would be great if someone try and share it later.

lucasjinreal · 2021-06-17T03:03:01Z

@jaywalnut310 I can train on BIAOBEI dataset which is a opensource Chinese dataset. But can u tell me which way should I organise it?

TaoTeCha · 2021-06-21T17:34:17Z

Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.

@jaywalnut310 I am only familiar with Tacotron, and have not yet used a model with variability. What parameters should I change in inference the inference code to change duration or pitch? Or are you saying this needs to be done during training?

WadoodAbdul · 2021-09-20T06:08:09Z

@jinfagang @TaoTeCha Have you trained a Chinese model successfully? Also are you planning to open source the model?

MaxMax2016 · 2021-09-26T07:06:20Z

vits_Chinese.zip
I surprise to find VITS has not limit to phoneme length, so amazing

lucasjinreal · 2021-09-27T02:58:15Z

@dtx525942103 that's amazing! It can synthesis so long voice! Do u plan to opensource your code training on Chinese?

hemath1001 · 2022-01-18T03:37:38Z

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

@dtx525942103 好棒呀！可否问一下您是用什么数据集训练的呢？以及提取一个中文的新音色需要的音频时长大概多久呢。非常非常感谢您！

MaxMax2016 · 2022-01-18T03:39:44Z

用的DB1那个数据集，它是1万句

hemath1001 · 2022-01-18T03:56:03Z

用的DB1那个数据集，它是1万句

@dtx525942103 感谢回复~可否告知一下数据集的全名呢，这个简称没有搜到 T_T 是databaker对吗？

lucasjinreal · 2022-01-18T07:22:06Z

@dtx525942103 同求

hemath1001 · 2022-01-18T07:37:32Z

weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar

感恩的心感谢有你~~

lucasjinreal · 2022-01-26T07:00:31Z

I have trained about 1000 epochs, not fully trained, but the result seems impressive.

I upload several examples on Mandrain, for anyone interested:
中文语音合成实例.zip

yuyu122 · 2022-03-11T08:53:20Z

is that able to train on Chinese dataset?
Hello, can you tell me if this error occurs when using the phonemizer function with the backend parameter as espeak?（RuntimeError: failed to find espeak library)
I would like to know how to download the espeak. Thank you!

yt605155624 · 2022-06-06T07:05:56Z

@dtx525942103 你好，你训练的效果非常棒，请问你训练的时候是不是设置了 add_blank=True ？

wac81 · 2022-06-15T02:54:24Z

yes, you have to add this arg in config file.

I provide a chinese example model in this repo.
https://github.com/wac81/vits_chinese

wgc7998 · 2022-08-05T07:20:22Z

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

我很想问一下，后验编码器为啥使用线性谱，不直接使用mel谱呢？我看论文里mel重建损失也是用mel谱计算的。。

MaxMax2016 · 2022-08-08T03:56:25Z

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

我很想问一下，后验编码器为啥使用线性谱，不直接使用mel谱呢？我看论文里mel重建损失也是用mel谱计算的。。

论文里面说的，使用线性普的效果比使用mel谱的效果更好

sixyang · 2022-08-17T17:03:13Z

大佬您的代码不开源了吗？已经找不到了~

tuannvhust · 2022-10-26T04:26:16Z

@jaywalnut310 @TaoTeCha you said that we can control and change the energy and pitch by manipulating the latent representation (z in our code). Can you specify how ? i mean what values of z affect energy, pitch,....?

hermanseu · 2022-12-28T08:08:31Z

@MaxMax2016 @jinfagang Hi，有碰到过发音错误的问题吗？我这训练的模型，有些字发音有问题，比如下面例子中的球员两个字。
sample.zip

FanhuaandLuomu · 2023-01-19T02:45:05Z

@hermanseu 你的input 是咋样的？

hermanseu · 2023-01-19T02:58:43Z

@FanhuaandLuomu 输入为拼音的声母、韵母序列；之前由于担心插入blank，会使输入序列变成2倍长度，导致工程实现中耗时变长，从而影响首包延时以及RTF。现在补上blank，没有出现发音问题了，加上blank后首包延迟为100ms，整体rtf为0.03的样子，还好。

FanhuaandLuomu · 2023-01-19T07:25:30Z

@FanhuaandLuomu 输入为拼音的声母、韵母序列；之前由于担心插入blank，会使输入序列变成2倍长度，导致工程实现中耗时变长，从而影响首包延时以及RTF。现在补上blank，没有出现发音问题了，加上blank后首包延迟为100ms，整体rtf为0.03的样子，还好。

@hermanseu 是的，我之前也是加了blank 改善了吞音现象。你是用gpu 推理的吗？还是改成了流式

hermanseu · 2023-01-19T08:09:53Z

@FanhuaandLuomu 输入为拼音的声母、韵母序列；之前由于担心插入blank，会使输入序列变成2倍长度，导致工程实现中耗时变长，从而影响首包延时以及RTF。现在补上blank，没有出现发音问题了，加上blank后首包延迟为100ms，整体rtf为0.03的样子，还好。

@hermanseu 是的，我之前也是加了blank 改善了吞音现象。你是用gpu 推理的吗？还是改成了流式

@FanhuaandLuomu

其实还是没搞明白吞音的根本原因在哪里。有检查过预测的隐变量的均值方差，均值特别小，方差看着有点大；只用均值作为隐变量的话，依然会有吞音的问题，想着是不是loss控制的不好导致均值学的有点偏，后面忙其他的事情就没继续做实验了。有思路的话，能否分享下~~
我这有逻辑会把训练得到的模型dump成二进制，然后c/c++读取二进制模型，在cpu上推理，流式输出。

15755841658 · 2023-02-09T02:00:43Z

@hermanseu 请问你是怎么做成流式，能指导一下吗？我在解码之前分片，但是最后音频和非流式效果相比有点差

hermanseu · 2023-02-09T02:57:21Z

@15755841658 我这的流式也不是全流程的流式，由于encoder的attention机制和flow中的翻转机制，这两个部分没法做流式，解码部分是个全卷积网络(包括反卷积)，每个卷积层只要填满一个kernel size就可以有一个输出，输出质量应该和非流式的一样。流式和非流式的区别在于解码的输入为逐帧(或者逐块)输入还是整体一次性输入。整个流程中最耗时的也是解码部分，解码流式了，整个流程看起来也流式了。另外，长句的话，前端会切成不长不短的句子再合成。希望能帮助到你。

pengzhendong · 2023-02-19T01:33:20Z

@15755841658 我这的流式也不是全流程的流式，由于encoder的attention机制和flow中的翻转机制，这两个部分没法做流式，解码部分是个全卷积网络(包括反卷积)，每个卷积层只要填满一个kernel size就可以有一个输出，输出质量应该和非流式的一样。流式和非流式的区别在于解码的输入为逐帧(或者逐块)输入还是整体一次性输入。整个流程中最耗时的也是解码部分，解码流式了，整个流程看起来也流式了。另外，长句的话，前端会切成不长不短的句子再合成。希望能帮助到你。

还需要满足一定长度的 overlap，具体长度需要根据 decoder 每一层 padding 的数量来计算。

JohnHerry · 2023-03-21T09:06:21Z

@FanhuaandLuomu 输入为拼音的声母、韵母序列；之前由于担心插入blank，会使输入序列变成2倍长度，导致工程实现中耗时变长，从而影响首包延时以及RTF。现在补上blank，没有出现发音问题了，加上blank后首包延迟为100ms，整体rtf为0.03的样子，还好。

请问rtf 0.03 是在GPU上呢？还是CPU的？

JohnHerry · 2023-03-21T09:16:10Z

@15755841658 我这的流式也不是全流程的流式，由于encoder的attention机制和flow中的翻转机制，这两个部分没法做流式，解码部分是个全卷积网络(包括反卷积)，每个卷积层只要填满一个kernel size就可以有一个输出，输出质量应该和非流式的一样。流式和非流式的区别在于解码的输入为逐帧(或者逐块)输入还是整体一次性输入。整个流程中最耗时的也是解码部分，解码流式了，整个流程看起来也流式了。另外，长句的话，前端会切成不长不短的句子再合成。希望能帮助到你。

还需要满足一定长度的 overlap，具体长度需要根据 decoder 每一层 padding 的数量来计算。

你说的应该是声学模型 + 声码器的模式这么处理。 VITS应该不用吧？
另外overlap这种方式我们试过，FS2+HiFiGAN的。在切掉overlap再拼接Wave后，结果音频还是有噪音的。特别是现在TTS都要求音量调节能力的。合成结果音量放大后，那个合成爆音就很明显的。除非你的每个chunk足够的大，但是这样会影响流式的效果。当然这些都是声学模型+声码器的二段式合成，跟VITS应该没关系。

pengzhendong · 2023-03-21T09:20:58Z

@JohnHerry 我说的就是 VITS，给定一长串文字，这里是希望一边合成一边播放。overlap 能在理论上无损。

JohnHerry · 2023-03-21T09:27:08Z

@JohnHerry 我说的就是 VITS，给定一长串文字，这里是希望一边合成一边播放。overlap 能在理论上无损。
没，VITS的声码器部分，也就是decoder部分，就是个HiFiGAN。这东西，你用GT的音频特征，mel谱，做overlap合成，然后切掉overlap的部分再拼接，结果都是有损的。可以试一下。

pengzhendong · 2023-03-21T09:33:14Z

@JohnHerry 我说的就是 VITS，给定一长串文字，这里是希望一边合成一边播放。overlap 能在理论上无损。
没，VITS的声码器部分，也就是decoder部分，就是个HiFiGAN。这东西，你用GT的音频特征，mel谱，做overlap合成，然后切掉overlap的部分再拼接，结果都是有损的。可以试一下。

那你是 overlap 的长度太小了

hermanseu · 2023-03-22T02:30:18Z

@FanhuaandLuomu 输入为拼音的声母、韵母序列；之前由于担心插入blank，会使输入序列变成2倍长度，导致工程实现中耗时变长，从而影响首包延时以及RTF。现在补上blank，没有出现发音问题了，加上blank后首包延迟为100ms，整体rtf为0.03的样子，还好。

请问rtf 0.03 是在GPU上呢？还是CPU的？

模型转为c/c++实现，在主频2.20GHz的CPU上rtf0.03

JohnHerry · 2023-03-22T02:42:25Z

@FanhuaandLuomu 输入为拼音的声母、韵母序列；之前由于担心插入blank，会使输入序列变成2倍长度，导致工程实现中耗时变长，从而影响首包延时以及RTF。现在补上blank，没有出现发音问题了，加上blank后首包延迟为100ms，整体rtf为0.03的样子，还好。

请问rtf 0.03 是在GPU上呢？还是CPU的？

模型转为c/c++实现，在主频2.20GHz的CPU上rtf0.03

感谢。我们一般C++用模型，也就是torch模型转JitScript格式，然后用libtorch加载。请问您的C/C++实现，是说将模型算子，参数，从底层开始用C++封装吗？有哪些层或者操作可以被压缩掉的？刚才直接Python下测试了一下，g_net to到CPU 2.7GHz
device上，好像RTF 0.07 到0.08，这个数值也超过我预期了。【我的模型是16KHz的，decoder部分做了一定的参数调整来适配16KHz，所以模型肯定小了很多。不过G模型也有300M 的存储大小】

hermanseu · 2023-03-22T02:56:35Z

@JohnHerry 我们这没有使用libtorch，底层逻辑是自己实现的，方便定制和修改，比如流式输出，网络层用c实现，前端处理c++实现，底层矩阵运算使用mkl。pytorch模型dump成二进制，整个浮点版模型加起来85M，c/c++结果和pytorch结果一致，无损。我这主要做的也是16k的TTS。

JohnHerry · 2023-03-22T03:11:13Z

@JohnHerry 我们这没有使用libtorch，底层逻辑是自己实现的，方便定制和修改，比如流式输出，网络层用c实现，前端处理c++实现，底层矩阵运算使用mkl。pytorch模型dump成二进制，整个浮点版模型加起来85M，c/c++结果和pytorch结果一致，无损。我这主要做的也是16k的TTS。

收到，感谢

JohnHerry · 2023-03-22T06:02:52Z

@JohnHerry 我们这没有使用libtorch，底层逻辑是自己实现的，方便定制和修改，比如流式输出，网络层用c实现，前端处理c++实现，底层矩阵运算使用mkl。pytorch模型dump成二进制，整个浮点版模型加起来85M，c/c++结果和pytorch结果一致，无损。我这主要做的也是16k的TTS。

注意到我们测试的时候，它是把机器的所有CPU核跑满的。您的0.03的测试结果，应该是2.2GHz CPU上单核单线程的效果吧？

hermanseu · 2023-03-23T02:24:11Z

注意到我们测试的时候，它是把机器的所有CPU核跑满的。您的0.03的测试结果，应该是2.2GHz CPU上单核单线程的效果吧？

@JohnHerry 我有部分网络层是并发计算的，单核单线程0.05的样子

JohnHerry · 2023-03-23T02:35:38Z

注意到我们测试的时候，它是把机器的所有CPU核跑满的。您的0.03的测试结果，应该是2.2GHz CPU上单核单线程的效果吧？

@JohnHerry 我有部分网络层是并发计算的，单核单线程0.05的样子

十分感谢

mudong0419 mentioned this issue Sep 28, 2021

关于add_blank和use_sdp PlayVoice/vits_chinese#2

Closed

JohnHerry mentioned this issue Apr 19, 2023

Best TTS based on BERT and VITS with some Natural Speech Features Of Microsoft #130

Open

is that able to train on Chinese dataset? #2

is that able to train on Chinese dataset? #2

Comments

lucasjinreal commented Jun 11, 2021

jaywalnut310 commented Jun 11, 2021

lucasjinreal commented Jun 11, 2021

LG-SS commented Jun 11, 2021

LG-SS commented Jun 11, 2021 • edited Loading

jaywalnut310 commented Jun 14, 2021

lucasjinreal commented Jun 15, 2021

leminhnguyen commented Jun 15, 2021

jaywalnut310 commented Jun 15, 2021

jaywalnut310 commented Jun 15, 2021

leminhnguyen commented Jun 16, 2021 • edited Loading

lucasjinreal commented Jun 16, 2021

jaywalnut310 commented Jun 16, 2021 • edited Loading

lucasjinreal commented Jun 17, 2021

TaoTeCha commented Jun 21, 2021 • edited Loading

WadoodAbdul commented Sep 20, 2021

MaxMax2016 commented Sep 26, 2021

lucasjinreal commented Sep 27, 2021

hemath1001 commented Jan 18, 2022

MaxMax2016 commented Jan 18, 2022

hemath1001 commented Jan 18, 2022 • edited Loading

lucasjinreal commented Jan 18, 2022

hemath1001 commented Jan 18, 2022

lucasjinreal commented Jan 26, 2022

yuyu122 commented Mar 11, 2022

yt605155624 commented Jun 6, 2022

wac81 commented Jun 15, 2022

wgc7998 commented Aug 5, 2022

MaxMax2016 commented Aug 8, 2022

sixyang commented Aug 17, 2022

tuannvhust commented Oct 26, 2022 • edited Loading

hermanseu commented Dec 28, 2022

FanhuaandLuomu commented Jan 19, 2023

hermanseu commented Jan 19, 2023

FanhuaandLuomu commented Jan 19, 2023 • edited Loading

hermanseu commented Jan 19, 2023

15755841658 commented Feb 9, 2023

hermanseu commented Feb 9, 2023

pengzhendong commented Feb 19, 2023

JohnHerry commented Mar 21, 2023

JohnHerry commented Mar 21, 2023

pengzhendong commented Mar 21, 2023

JohnHerry commented Mar 21, 2023

pengzhendong commented Mar 21, 2023

hermanseu commented Mar 22, 2023

JohnHerry commented Mar 22, 2023 • edited Loading

hermanseu commented Mar 22, 2023

JohnHerry commented Mar 22, 2023

JohnHerry commented Mar 22, 2023

hermanseu commented Mar 23, 2023

JohnHerry commented Mar 23, 2023

LG-SS commented Jun 11, 2021 •

edited

Loading

leminhnguyen commented Jun 16, 2021 •

edited

Loading

jaywalnut310 commented Jun 16, 2021 •

edited

Loading

TaoTeCha commented Jun 21, 2021 •

edited

Loading

hemath1001 commented Jan 18, 2022 •

edited

Loading

tuannvhust commented Oct 26, 2022 •

edited

Loading

FanhuaandLuomu commented Jan 19, 2023 •

edited

Loading

JohnHerry commented Mar 22, 2023 •

edited

Loading