-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is that able to train on Chinese dataset? #2
Comments
Definitely, yes! But you may need any text-to-phone converter such as Phonemizer to convert Chinese text into phonemes. |
@jaywalnut310 I probably would try Biaobei data for Chinese, I am totally newbie in tts though. Let me have a deep look. what would be phonemes like in Chinese? |
|
Curiously, has the paper been released publically? I have not retrieved it on arxiv or Google Scholar currently. |
@LG-SS Now the paper is available: https://arxiv.org/abs/2106.06103
|
@jaywalnut310 Hi, may I ask one last question, how's the latency compare with tacotron2 (I mean e2e lantency, tacotron2 may also need a vcoder which count in), is vits faster in terms of predict same length sentence, or slow? How much? |
@jaywalnut310 This model is autoregressive or non autoregressive ? |
Hi @jinfagang. In a previous work, Glow-TTS, a synthesis speed test between Tacotron 2 and Glow-TTS was reported. As the synthesis speed of VITS is faster than that of Glow-TTS + HiFi-GAN(vocoder), it would be much faster than that of Tacotron 2 + HiFi-GAN(vocoder). |
Hi @leminhnguyen, this model is non autoregressive. |
@jaywalnut310 Thanks you, I have some questions.
|
@jaywalnut310 I listened the sample audio from vits, it's much more better and natural than tactron2, so it's better and faster, more valuable to have a try. Do u guys have a Chinese pretrained model BTW? |
@leminhnguyen Well, VITS provides controllability to some extent. You can control and change the duration manually. You can control and change the energy and pitch by manipulating the latent representation (z in our code), but you cannot predict how much the energy and pitch changed beforehand. and I only compared with open-sourced official implementations (unfortunately FastSpeech 2 is not), in quality. @jinfagang Thank you :). I haven't trained on Chinese dataset, but it would be great if someone try and share it later. |
@jaywalnut310 I can train on BIAOBEI dataset which is a opensource Chinese dataset. But can u tell me which way should I organise it? |
@jaywalnut310 I am only familiar with Tacotron, and have not yet used a model with variability. What parameters should I change in inference the inference code to change duration or pitch? Or are you saying this needs to be done during training? |
@jinfagang @TaoTeCha Have you trained a Chinese model successfully? Also are you planning to open source the model? |
vits_Chinese.zip |
@dtx525942103 that's amazing! It can synthesis so long voice! Do u plan to opensource your code training on Chinese? |
@dtx525942103 好棒呀!可否问一下您是用什么数据集训练的呢?以及提取一个中文的新音色需要的音频时长大概多久呢。非常非常感谢您! |
用的DB1那个数据集,它是1万句 |
@dtx525942103 感谢回复~可否告知一下数据集的全名呢,这个简称没有搜到 T_T 是databaker对吗? |
@dtx525942103 同求 |
感恩的心 感谢有你~~ |
I have trained about 1000 epochs, not fully trained, but the result seems impressive. I upload several examples on Mandrain, for anyone interested: |
|
@dtx525942103 你好,你训练的效果非常棒,请问你训练的时候是不是设置了 add_blank=True ? |
yes, you have to add this arg in config file. I provide a chinese example model in this repo. |
我很想问一下,后验编码器为啥使用线性谱,不直接使用mel谱呢?我看论文里mel重建损失也是用mel谱计算的。。 |
论文里面说的,使用线性普的效果比使用mel谱的效果更好 |
大佬您的代码不开源了吗?已经找不到了~ |
@jaywalnut310 @TaoTeCha you said that we can control and change the energy and pitch by manipulating the latent representation (z in our code). Can you specify how ? i mean what values of z affect energy, pitch,....? |
@MaxMax2016 @jinfagang Hi,有碰到过发音错误的问题吗?我这训练的模型,有些字发音有问题,比如下面例子中的 球员 两个字。 |
@hermanseu 你的input 是咋样的? |
@FanhuaandLuomu 输入为拼音的声母、韵母序列; 之前由于担心插入blank,会使输入序列变成2倍长度,导致工程实现中耗时变长,从而影响首包延时以及RTF。现在补上blank,没有出现发音问题了,加上blank后首包延迟为100ms,整体rtf为0.03的样子,还好。 |
@hermanseu 是的,我之前也是加了blank 改善了吞音现象。你是用gpu 推理的吗?还是改成了流式 |
|
@hermanseu 请问你是怎么做成流式,能指导一下吗?我在解码之前分片,但是最后音频和非流式效果相比有点差 |
@15755841658 我这的流式也不是全流程的流式,由于encoder的attention机制和flow中的翻转机制,这两个部分没法做流式,解码部分是个全卷积网络(包括反卷积),每个卷积层只要填满一个kernel size就可以有一个输出,输出质量应该和非流式的一样。流式和非流式的区别在于解码的输入为逐帧(或者逐块)输入还是整体一次性输入。整个流程中最耗时的也是解码部分,解码流式了,整个流程看起来也流式了。另外,长句的话,前端会切成不长不短的句子再合成。希望能帮助到你。 |
|
请问rtf 0.03 是在GPU上呢?还是CPU的? |
你说的应该是 声学模型 + 声码器的模式这么处理。 VITS应该不用吧? |
@JohnHerry 我说的就是 VITS,给定一长串文字,这里是希望一边合成一边播放。overlap 能在理论上无损。 |
|
那你是 overlap 的长度太小了 |
模型转为c/c++实现,在主频2.20GHz的CPU上rtf0.03 |
感谢。我们一般C++用模型,也就是torch模型转JitScript格式,然后用libtorch加载。请问您的C/C++实现,是说将模型算子,参数,从底层开始用C++封装吗?有哪些层或者操作可以被压缩掉的?刚才直接Python下测试了一下,g_net to到CPU 2.7GHz |
@JohnHerry 我们这没有使用libtorch,底层逻辑是自己实现的,方便定制和修改,比如流式输出,网络层用c实现,前端处理c++实现,底层矩阵运算使用mkl。pytorch模型dump成二进制,整个浮点版模型加起来85M,c/c++结果和pytorch结果一致,无损。我这主要做的也是16k的TTS。 |
收到,感谢 |
注意到我们测试的时候,它是把机器的所有CPU核跑满的。您的0.03的测试结果,应该是2.2GHz CPU上单核单线程的效果吧? |
@JohnHerry 我有部分网络层是并发计算的, 单核单线程0.05的样子 |
十分感谢 |
is that able to train on Chinese dataset?
The text was updated successfully, but these errors were encountered: