Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is that able to train on Chinese dataset? #2

Open
lucasjinreal opened this issue Jun 11, 2021 · 50 comments
Open

is that able to train on Chinese dataset? #2

lucasjinreal opened this issue Jun 11, 2021 · 50 comments

Comments

@lucasjinreal
Copy link

is that able to train on Chinese dataset?

@jaywalnut310
Copy link
Owner

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes.
This model gets phonemes as input rather than characters.

@lucasjinreal
Copy link
Author

@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese?

@LG-SS
Copy link

LG-SS commented Jun 11, 2021

@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese?
The phonemes in Chinese are initials and finals with tone, for example, "ni2 hao3" can be converted into "n i2 h ao3"

@LG-SS
Copy link

LG-SS commented Jun 11, 2021

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes.
This model gets phonemes as input rather than characters.

Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.

@jaywalnut310
Copy link
Owner

@LG-SS Now the paper is available: https://arxiv.org/abs/2106.06103

Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes.
This model gets phonemes as input rather than characters.

Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently.

@lucasjinreal
Copy link
Author

@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?

@leminhnguyen
Copy link

@jaywalnut310 This model is autoregressive or non autoregressive ?

@jaywalnut310
Copy link
Owner

@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much?

Hi @jinfagang. In a previous work, Glow-TTS, a synthesis speed test between Tacotron 2 and Glow-TTS was reported. As the synthesis speed of VITS is faster than that of Glow-TTS + HiFi-GAN(vocoder), it would be much faster than that of Tacotron 2 + HiFi-GAN(vocoder).

image

@jaywalnut310
Copy link
Owner

@jaywalnut310 This model is autoregressive or non autoregressive ?

Hi @leminhnguyen, this model is non autoregressive.

@leminhnguyen
Copy link

leminhnguyen commented Jun 16, 2021

@jaywalnut310 Thanks you, I have some questions.

  1. How about controllability ?
  2. We can change the duration, energy or pitch ?
  3. In the paper, you mentioned FastSpeech2 in related work. Did you try to compare the speed between Fastspeech2 and VITS ?

@lucasjinreal
Copy link
Author

@jaywalnut310 I listened the sample audio from vits, it's much more better and natural than tactron2, so it's better and faster, more valuable to have a try. Do u guys have a Chinese pretrained model BTW?

@jaywalnut310
Copy link
Owner

jaywalnut310 commented Jun 16, 2021

@leminhnguyen Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.

@jinfagang Thank you :). I haven't trained on Chinese dataset, but it would be great if someone try and share it later.

@lucasjinreal
Copy link
Author

@jaywalnut310 I can train on BIAOBEI dataset which is a opensource Chinese dataset. But can u tell me which way should I organise it?

@TaoTeCha
Copy link

TaoTeCha commented Jun 21, 2021

Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality.

@jaywalnut310 I am only familiar with Tacotron, and have not yet used a model with variability. What parameters should I change in inference the inference code to change duration or pitch? Or are you saying this needs to be done during training?

@WadoodAbdul
Copy link

@jinfagang @TaoTeCha Have you trained a Chinese model successfully? Also are you planning to open source the model?

@MaxMax2016
Copy link

vits_Chinese.zip
I surprise to find VITS has not limit to phoneme length, so amazing

@lucasjinreal
Copy link
Author

@dtx525942103 that's amazing! It can synthesis so long voice! Do u plan to opensource your code training on Chinese?

@hemath1001
Copy link

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

@dtx525942103 好棒呀!可否问一下您是用什么数据集训练的呢?以及提取一个中文的新音色需要的音频时长大概多久呢。非常非常感谢您!

@MaxMax2016
Copy link

用的DB1那个数据集,它是1万句

@hemath1001
Copy link

hemath1001 commented Jan 18, 2022

用的DB1那个数据集,它是1万句

@dtx525942103 感谢回复~可否告知一下数据集的全名呢,这个简称没有搜到 T_T 是databaker对吗?

@lucasjinreal
Copy link
Author

@dtx525942103 同求

@hemath1001
Copy link

weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar ​​​

感恩的心 感谢有你~~

@lucasjinreal
Copy link
Author

I have trained about 1000 epochs, not fully trained, but the result seems impressive.

I upload several examples on Mandrain, for anyone interested:
中文语音合成实例.zip

@yuyu122
Copy link

yuyu122 commented Mar 11, 2022

is that able to train on Chinese dataset?
Hello, can you tell me if this error occurs when using the phonemizer function with the backend parameter as espeak?(RuntimeError: failed to find espeak library)
I would like to know how to download the espeak. Thank you!

@yt605155624
Copy link

@dtx525942103 你好,你训练的效果非常棒,请问你训练的时候是不是设置了 add_blank=True ?

@wac81
Copy link

wac81 commented Jun 15, 2022

yes, you have to add this arg in config file.

I provide a chinese example model in this repo.
https://github.com/wac81/vits_chinese

@wgc7998
Copy link

wgc7998 commented Aug 5, 2022

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

我很想问一下,后验编码器为啥使用线性谱,不直接使用mel谱呢?我看论文里mel重建损失也是用mel谱计算的。。

@MaxMax2016
Copy link

vits_Chinese.zip I surprise to find VITS has not limit to phoneme length, so amazing

我很想问一下,后验编码器为啥使用线性谱,不直接使用mel谱呢?我看论文里mel重建损失也是用mel谱计算的。。

论文里面说的,使用线性普的效果比使用mel谱的效果更好

@sixyang
Copy link

sixyang commented Aug 17, 2022

大佬您的代码不开源了吗?已经找不到了~

@tuannvhust
Copy link

tuannvhust commented Oct 26, 2022

@jaywalnut310 @TaoTeCha you said that we can control and change the energy and pitch by manipulating the latent representation (z in our code). Can you specify how ? i mean what values of z affect energy, pitch,....?

@hermanseu
Copy link

@MaxMax2016 @jinfagang Hi,有碰到过发音错误的问题吗?我这训练的模型,有些字发音有问题,比如下面例子中的 球员 两个字。
sample.zip

@FanhuaandLuomu
Copy link

@hermanseu 你的input 是咋样的?

@hermanseu
Copy link

@FanhuaandLuomu 输入为拼音的声母、韵母序列; 之前由于担心插入blank,会使输入序列变成2倍长度,导致工程实现中耗时变长,从而影响首包延时以及RTF。现在补上blank,没有出现发音问题了,加上blank后首包延迟为100ms,整体rtf为0.03的样子,还好。

@FanhuaandLuomu
Copy link

FanhuaandLuomu commented Jan 19, 2023

@FanhuaandLuomu 输入为拼音的声母、韵母序列; 之前由于担心插入blank,会使输入序列变成2倍长度,导致工程实现中耗时变长,从而影响首包延时以及RTF。现在补上blank,没有出现发音问题了,加上blank后首包延迟为100ms,整体rtf为0.03的样子,还好。

@hermanseu 是的,我之前也是加了blank 改善了吞音现象。你是用gpu 推理的吗?还是改成了流式

@hermanseu
Copy link

@FanhuaandLuomu 输入为拼音的声母、韵母序列; 之前由于担心插入blank,会使输入序列变成2倍长度,导致工程实现中耗时变长,从而影响首包延时以及RTF。现在补上blank,没有出现发音问题了,加上blank后首包延迟为100ms,整体rtf为0.03的样子,还好。

@hermanseu 是的,我之前也是加了blank 改善了吞音现象。你是用gpu 推理的吗?还是改成了流式

@FanhuaandLuomu

  1. 其实还是没搞明白吞音的根本原因在哪里。有检查过预测的隐变量的均值方差,均值特别小,方差看着有点大;只用均值作为隐变量的话,依然会有吞音的问题,想着是不是loss控制的不好导致均值学的有点偏,后面忙其他的事情就没继续做实验了。有思路的话,能否分享下~~
  2. 我这有逻辑会把训练得到的模型dump成二进制,然后c/c++读取二进制模型,在cpu上推理,流式输出。

@15755841658
Copy link

@hermanseu 请问你是怎么做成流式,能指导一下吗?我在解码之前分片,但是最后音频和非流式效果相比有点差

@hermanseu
Copy link

@15755841658 我这的流式也不是全流程的流式,由于encoder的attention机制和flow中的翻转机制,这两个部分没法做流式,解码部分是个全卷积网络(包括反卷积),每个卷积层只要填满一个kernel size就可以有一个输出,输出质量应该和非流式的一样。流式和非流式的区别在于解码的输入为逐帧(或者逐块)输入还是整体一次性输入。整个流程中最耗时的也是解码部分,解码流式了,整个流程看起来也流式了。另外,长句的话,前端会切成不长不短的句子再合成。希望能帮助到你。

@pengzhendong
Copy link

@15755841658 我这的流式也不是全流程的流式,由于encoder的attention机制和flow中的翻转机制,这两个部分没法做流式,解码部分是个全卷积网络(包括反卷积),每个卷积层只要填满一个kernel size就可以有一个输出,输出质量应该和非流式的一样。流式和非流式的区别在于解码的输入为逐帧(或者逐块)输入还是整体一次性输入。整个流程中最耗时的也是解码部分,解码流式了,整个流程看起来也流式了。另外,长句的话,前端会切成不长不短的句子再合成。希望能帮助到你。

还需要满足一定长度的 overlap,具体长度需要根据 decoder 每一层 padding 的数量来计算。
image

@JohnHerry
Copy link

@FanhuaandLuomu 输入为拼音的声母、韵母序列; 之前由于担心插入blank,会使输入序列变成2倍长度,导致工程实现中耗时变长,从而影响首包延时以及RTF。现在补上blank,没有出现发音问题了,加上blank后首包延迟为100ms,整体rtf为0.03的样子,还好。

请问rtf 0.03 是在GPU上呢?还是CPU的?

@JohnHerry
Copy link

@15755841658 我这的流式也不是全流程的流式,由于encoder的attention机制和flow中的翻转机制,这两个部分没法做流式,解码部分是个全卷积网络(包括反卷积),每个卷积层只要填满一个kernel size就可以有一个输出,输出质量应该和非流式的一样。流式和非流式的区别在于解码的输入为逐帧(或者逐块)输入还是整体一次性输入。整个流程中最耗时的也是解码部分,解码流式了,整个流程看起来也流式了。另外,长句的话,前端会切成不长不短的句子再合成。希望能帮助到你。

还需要满足一定长度的 overlap,具体长度需要根据 decoder 每一层 padding 的数量来计算。 image

你说的应该是 声学模型 + 声码器的模式这么处理。 VITS应该不用吧?
另外overlap这种方式我们试过,FS2+HiFiGAN的。在切掉overlap再拼接Wave后,结果音频还是有噪音的。特别是现在TTS都要求音量调节能力的。合成结果音量放大后,那个合成爆音就很明显的。除非你的每个chunk足够的大,但是这样会影响流式的效果。 当然这些都是声学模型+声码器的二段式合成,跟VITS应该没关系。

@pengzhendong
Copy link

@JohnHerry 我说的就是 VITS,给定一长串文字,这里是希望一边合成一边播放。overlap 能在理论上无损。

@JohnHerry
Copy link

@JohnHerry 我说的就是 VITS,给定一长串文字,这里是希望一边合成一边播放。overlap 能在理论上无损。
没,VITS的声码器部分,也就是decoder部分,就是个HiFiGAN。这东西,你用GT的音频特征,mel谱,做overlap合成,然后切掉overlap的部分再拼接,结果都是有损的。可以试一下。

@pengzhendong
Copy link

@JohnHerry 我说的就是 VITS,给定一长串文字,这里是希望一边合成一边播放。overlap 能在理论上无损。
没,VITS的声码器部分,也就是decoder部分,就是个HiFiGAN。这东西,你用GT的音频特征,mel谱,做overlap合成,然后切掉overlap的部分再拼接,结果都是有损的。可以试一下。

那你是 overlap 的长度太小了

@hermanseu
Copy link

@FanhuaandLuomu 输入为拼音的声母、韵母序列; 之前由于担心插入blank,会使输入序列变成2倍长度,导致工程实现中耗时变长,从而影响首包延时以及RTF。现在补上blank,没有出现发音问题了,加上blank后首包延迟为100ms,整体rtf为0.03的样子,还好。

请问rtf 0.03 是在GPU上呢?还是CPU的?

模型转为c/c++实现,在主频2.20GHz的CPU上rtf0.03

@JohnHerry
Copy link

JohnHerry commented Mar 22, 2023

@FanhuaandLuomu 输入为拼音的声母、韵母序列; 之前由于担心插入blank,会使输入序列变成2倍长度,导致工程实现中耗时变长,从而影响首包延时以及RTF。现在补上blank,没有出现发音问题了,加上blank后首包延迟为100ms,整体rtf为0.03的样子,还好。

请问rtf 0.03 是在GPU上呢?还是CPU的?

模型转为c/c++实现,在主频2.20GHz的CPU上rtf0.03

感谢。我们一般C++用模型,也就是torch模型转JitScript格式,然后用libtorch加载。请问您的C/C++实现,是说将模型算子,参数,从底层开始用C++封装吗?有哪些层或者操作可以被压缩掉的?刚才直接Python下测试了一下,g_net to到CPU 2.7GHz
device上,好像RTF 0.07 到0.08, 这个数值也超过我预期了。【我的模型是16KHz的,decoder部分做了一定的参数调整来适配16KHz,所以模型肯定小了很多。不过G模型也有300M 的存储大小】

@hermanseu
Copy link

@JohnHerry 我们这没有使用libtorch,底层逻辑是自己实现的,方便定制和修改,比如流式输出,网络层用c实现,前端处理c++实现,底层矩阵运算使用mkl。pytorch模型dump成二进制,整个浮点版模型加起来85M,c/c++结果和pytorch结果一致,无损。我这主要做的也是16k的TTS。

@JohnHerry
Copy link

@JohnHerry 我们这没有使用libtorch,底层逻辑是自己实现的,方便定制和修改,比如流式输出,网络层用c实现,前端处理c++实现,底层矩阵运算使用mkl。pytorch模型dump成二进制,整个浮点版模型加起来85M,c/c++结果和pytorch结果一致,无损。我这主要做的也是16k的TTS。

收到,感谢

@JohnHerry
Copy link

@JohnHerry 我们这没有使用libtorch,底层逻辑是自己实现的,方便定制和修改,比如流式输出,网络层用c实现,前端处理c++实现,底层矩阵运算使用mkl。pytorch模型dump成二进制,整个浮点版模型加起来85M,c/c++结果和pytorch结果一致,无损。我这主要做的也是16k的TTS。

注意到我们测试的时候,它是把机器的所有CPU核跑满的。您的0.03的测试结果,应该是2.2GHz CPU上单核单线程的效果吧?

@hermanseu
Copy link

注意到我们测试的时候,它是把机器的所有CPU核跑满的。您的0.03的测试结果,应该是2.2GHz CPU上单核单线程的效果吧?

@JohnHerry 我有部分网络层是并发计算的, 单核单线程0.05的样子

@JohnHerry
Copy link

注意到我们测试的时候,它是把机器的所有CPU核跑满的。您的0.03的测试结果,应该是2.2GHz CPU上单核单线程的效果吧?

@JohnHerry 我有部分网络层是并发计算的, 单核单线程0.05的样子

十分感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests