Implementation status and planned TODOs #1

r9y9 · 2020-04-19T16:28:03Z

Yablon · 2020-04-20T05:42:09Z

Hi, I am one of your followers. I feel glad and excited to see the growth of a great project and want to do something for that. I am familiar with Chinese and I can help with Chinese frontend if needed.

About Vocoder
I think LPCNet vocoder may fit the need ? It gets spectrogram and pitch to generate audio signals.

About Recipes
Kaldi wants to get rid of its heavy things like shell scripts and c++ interfaces. Could we consider offerring some recipes written by python or maybe else ?

Thank you, I am on your call !

r9y9 · 2020-04-20T06:39:55Z

Hi, @Yablon, many thanks for your comments! Your help with Chinese frontend support is definitely welcome! Let me first make a Japanese version of the entire system, and then let's discuss how to extend it to other languages.

About vocoder:
Yes, LPCNet is also a good candidate. I will add it to the TODO list.

About recipe:
At the moment I am thinking that a recipe would look like https://github.com/r9y9/wavenet_vocoder/blob/master/egs/mol/run.sh. It consists of a single shell script that involves core python implementations (e.g., train.py). A recipe can be written in python (if you want). I may consider C++ for performance heavy things, but for simplicity and maintainability, I will implement most of core features in python. Does it sound okay for you?

FYI, I don't want to add Kaldi requirement to the repo. I guess it would make users having installation issues...

Yablon · 2020-04-20T07:38:48Z

@r9y9 I agree with you and hope to see the entire system.

apeguero1 · 2020-04-21T23:17:59Z

Hi @r9y9, really excited to see where this project goes!

For training acoustic models that have a WORLD vocoder target perhaps it's a good idea to take a look at WGANSing.? In addition to the actual model used, I think their preprocessing gives some insight into how to predict WORLD vocoder features efficiently.

r9y9 · 2020-04-22T03:46:10Z

Hi @apeguero1, thanks for sharing your thoughts! I will look into their paper and code to find something useful.

Seems like they used https://smcnus.comp.nus.edu.sg/nus-48e-sung-and-spoken-lyrics-corpus/ for singing voice synthesis, but unfortunately there are no musicxml files and MIDI files available, which makes the task quite difficult. I guess the dataset was designed for speech-to-singing voice conversion.

Yablon · 2020-04-22T07:03:28Z

I can help with the DSP part too. When you publish your data process pipeline, I can help with building the LPCNET vocoder for the specific spectrogram and pitch or else.

r9y9 · 2020-04-22T07:38:39Z

That would be great! I am now working on refactoring data processing code for Kiritan database. After that, I will make a simple time-lag model and a duration model (those described in https://ieeexplore.ieee.org/document/8659797).

Once we complete

data preprocessing (musicxml to feature vectors, acoustic feature extraction, etc)
time-lag model
duration model
acoustic model (this is done already),

we can start experimenting with advanced ideas including neural vocoder integration, explicit vibrato consideration, end-to-end approach, GAN, Transformer, etc. I will keep posting progress here. I suppose that I will finish making a whole system in one or two weeks hopefully.

Yablon · 2020-04-22T08:48:57Z

Great !

r9y9 · 2020-04-24T02:11:36Z

A new paper for Chinese singing voice synthesis comes up on arxiv! It was submitted to INTERSPEECH 2020. Looks very interesting.

"ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders"

arxiv: https://arxiv.org/abs/2004.11012
samples: https://bytesings.github.io/paper1.html

Yablon · 2020-04-24T07:11:13Z

Yes, it is. Tacotron(2) structure could be used everywhere and perform well. Is your implemention performs as well ?

I think tacotron structure may needs more data, and dnn based may need less and performs more stable. What's your oppion ?

r9y9 · 2020-04-24T15:01:30Z

In TTS, we typically need more than 10 hours of data to build attention-based seq2seq models. However, in contrast to TTS, SVS is highly constrained by a musical score (e.g. pitch, note duration, tempo, etc), so I suppose that we can build Tacotron-like models even on a small dataset. For example, see https://arxiv.org/abs/1910.09989.

There are pros and cons in traditional parametric-based approaches and end-to-end approaches. I want to try the traditional one first, since it is simple and enables us to perform fast iterations of experiments, which I think is important at the early stage of prototyping.

As for the Tacotron implementation, I implemented it before (https://github.com/r9y9/tacotron_pytorch) but it is now outdated. I would use https://github.com/espnet/espnet for Tacotron 2 or Transformer implementations. The toolkit is a little bit complicated, but it is well tested and worth resuing the components.

r9y9 · 2020-04-24T15:04:11Z

I pushed the data preparation scripts for kiritan database: https://github.com/r9y9/kiritan_singing. I suppose I will finish making the entire system this weekend. Please wait for a few days!

Yablon · 2020-04-24T16:14:43Z

That is exciting!

apeguero1 · 2020-04-24T16:42:56Z

That's awesome can't wait to test it! :D

r9y9 · 2020-04-25T16:25:05Z

I have implemented the time-lag model and duration model as well as the acoustic model. Now that we can generate a singing voice from a musicxml file. A generated sample can be found at https://soundcloud.com/r9y9/kiritan-01-test-svs-7?in=r9y9/sets/dnn-based-singing-voice. The quality is not good but not bad.

I pushed lots of code including feature extraction, normalization, training, inference. The inference script is too dirty at the moment and needs to be refactored. I plan to do it tomorrow.

Also, I pushed a recipe so that anyone can (ideally) reproduce my experiments. https://github.com/r9y9/dnnsvs/tree/master/egs/kiritan_singing Note that this is still WIP and may subject to change.

r9y9 · 2020-04-25T16:27:31Z

I think the recipe is helpful for researchers but not very friendly for those who are not familiar with the internal of singing voice synthesis systems. I plan to make a Jupyter notebook to demonstrate the usage and how it works.

r9y9 · 2020-04-25T16:29:41Z

I realized that SVS systems are more complicated than I initially thought. There are lots of things we need to do!

ftshijt · 2020-04-30T02:02:06Z

Hi, just notice the project. It's awesome! There are not any open-source toolkit for singing voices out there.

I'm not sure but it seems there are systems directly using the singing and alignment for training (e.g. https://github.com/seaniezhao/torch_npss). A possible direction might be pre-trained on raw data (maybe some alignment) and then refined on data with musicxml (after all, those strictly aligned data is much hard to obtain).

BTW, do you have any intention to make the project with more general framework that not only confined to synthesis? As for ESPNet, it also has tasks including asr, speech translation, and speech enhancement.

r9y9 · 2020-04-30T11:56:45Z

Hi @ftshijt. Thanks :)

The paper "A Neural Parametric Singing Synthesizer" is very interesting. They propose a multi-stream autoregressive model for vocoder parameters; that's what I planned to do next! I was inspired by the paper "Autoregressive Neural F0 Model for Statistical Parametric Speech Synthesis" https://ieeexplore.ieee.org/abstract/document/8341752/.

As for alignment, yes, it is sometimes hard to obtain. Regarding the Japanese Kiritan database, they provide annotated alignments, so I am using it (with small corrections). If there are no manual alignments, we can take the learning-based approach. For example, similar to what the authors of the above paper have done, we can use HMM to obtain alignment in an unsupervised manner.

For this project direction, I want to focus on singing voice synthesis. ESPnet is an excellent tool for many speech tasks (I am one of the authors of ESPnet-TTS paper). However, it comes with complexity. Some of my friends in the TTS community told me it was difficult to use. To simplify the codebase and make it hackable, extensible, I want to focus on SVS. That said, I want to make a generic tool to support a broader range of models, from parametric to end-to-end.

r9y9 · 2020-04-30T12:05:41Z

Not planned yet, but the speech-to-singing voice conversion task may fit in ESPnet's unified approach.

apeguero1 · 2020-05-01T17:07:39Z

Whoa! seems like OpenAI just released the GPT2 of music! I wonder how hard it would be to reproduce this without a million songs or hundreds of gpus. And if it works for songs with instrumentation then maybe it would be easier to train on a purely vocal dataset? The paper doesn't mention much about finetuning but perhaps there's some transfer learning opportunities here?

r9y9 · 2020-05-02T13:47:32Z

I was so surprised that OpenAI's model is able to generate singing voices and instrumental simultaneously. It would be easier to train on a vocal dataset and transfer learning is definitely worth trying.

r9y9 · 2020-05-02T13:49:13Z

As a minor issue, let me rename the repo from dnnsvs to nnsvs.

r9y9 · 2020-05-03T02:08:44Z

I have created a jupyter notebook to demonstrate how we can use pre-trained models to generate singing voice samples.

Neural network-based singing voice synthesis demo using kiritan_singing database (Japanese)

Nbviewer: https://nbviewer.jupyter.org/gist/r9y9/79705665ed5a94f0028839ca40992751

Here goes if any of you are interested. If you want to just see the demo, check the pre-rendered nbviewer's page. If you want a interactive demo, look the google colab's one.

r9y9 · 2020-05-03T03:02:38Z

I pushed all the code for feature extraction, training, and inference as well. Models used in the above demo can be reproduced by running the following recipe:

https://github.com/r9y9/nnsvs/tree/master/egs/kiritan_singing/00-svs-world

apeguero1 · 2020-05-03T20:30:32Z

The notebook is great! The step by step approach makes it easier to follow (: voice sounds good so far!

r9y9 · 2020-05-21T15:32:06Z

I made a new recipe for nit-song070, which is a singing voice dataset provided by the HTS working group. The dataset contains 31 songs recorded by a female Japanese singer. Data size is not huge but it is good for testing.

r9y9 · 2020-05-25T14:17:43Z

I have added another recipe for jsut-song dataset.

r9y9 · 2020-06-03T23:39:23Z

Good news: the author of NSF published a pytorch implementation of NSF: https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts

It should be easy to integrate it with our codebase.

r9y9 · 2020-06-03T23:53:31Z

out_acoustic directory contains 1) acoustic features (*-feats.npy) and 2) waveform (*-wave.npy), which can be used for training neural vocoders.

ls -l dump/kiritan/norm/train_no_dev/out_acoustic/ | head
total 1254736
-rw-rw-r-- 1 ryuichi ryuichi 2315692  5月 27 00:28 03_seg0-feats.npy
-rw-rw-r-- 1 ryuichi ryuichi 2792768  5月 27 00:28 03_seg0-wave.npy
-rw-rw-r-- 1 ryuichi ryuichi 1161492  5月 27 00:28 03_seg1-feats.npy
-rw-rw-r-- 1 ryuichi ryuichi 1400768  5月 27 00:28 03_seg1-wave.npy
-rw-rw-r-- 1 ryuichi ryuichi 1567452  5月 27 00:28 03_seg2-feats.npy
-rw-rw-r-- 1 ryuichi ryuichi 1890368  5月 27 00:28 03_seg2-wave.npy
-rw-rw-r-- 1 ryuichi ryuichi 1624764  5月 27 00:28 03_seg3-feats.npy
-rw-rw-r-- 1 ryuichi ryuichi 1959488  5月 27 00:28 03_seg3-wave.npy
-rw-rw-r-- 1 ryuichi ryuichi 2060972  5月 27 00:28 03_seg4-feats.npy

poria-cat · 2020-06-18T02:40:59Z

I think transformer is not a good choice for svs. Songs' mel spectrogram length always large than text speech audio, so it will out of memmory (such as decoder mask and ).
Tacotron(2) is not a good choice too. it need too many time and data.
These are the answers I got on the kiritan database 😞

ak9250 · 2020-09-07T17:21:54Z

this paper looks interesting https://speechresearch.github.io/hifisinger/

prepare_featuress の高速化を ProcessPoolExecutor を使用するように変更

r9y9 · 2022-04-12T12:49:55Z

https://soundcloud.com/r9y9/sets/nnsvs-and-neutrino-comparison

I think the latest nnsvs is finally comparable with neutrino (that was the goal of this project). Still, there's a pretty of room for improvement of acoustic modeling though.

poria-cat · 2022-04-12T12:53:36Z

https://soundcloud.com/r9y9/sets/nnsvs-and-neutrino-comparison

I think the latest nnsvs is finally comparable with neutrino (that was the goal of this project). Still, there's a pretty of room for improvement of acoustic modeling though.

WoW, This sounds great already!

r9y9 · 2022-07-04T16:33:16Z

I moved this repository to https://github.com/nnsvs/nnsvs since I plan to add some related repositories in the future. Nothing is changed in terms of functionality.

r9y9 · 2022-11-27T04:52:13Z

I guess I can finally close this issue once #167 is ready.

r9y9 · 2022-11-30T18:13:32Z

NNSVS 2020/05: https://on.soundcloud.com/6912N
NNSVS 2022/12: https://on.soundcloud.com/JHBPs
NEUTRINO: https://on.soundcloud.com/EPf7T

I think I have finally done achieving NEUTRINO-level quality. Closing this issue, finally.

r9y9 added the discussion label Apr 19, 2020

r9y9 mentioned this issue Sep 7, 2020

Improved acoustic model support: introducing autoregressive structure #15

Closed

4 tasks

r9y9 pushed a commit that referenced this issue Sep 27, 2021

Merge pull request #1 from oatsu-gh/for_pull_request

059c157

prepare_featuress の高速化を ProcessPoolExecutor を使用するように変更

r9y9 mentioned this issue Mar 6, 2022

Experimental: Add support for explicit vibrato modeling #73

Merged

r9y9 self-assigned this May 21, 2022

r9y9 added this to the v0.1.0 release milestone Jun 9, 2022

r9y9 closed this as completed Nov 30, 2022

r9y9 mentioned this issue Nov 30, 2022

Discussion: NNSVS vs. NEUTRINO #101

Open

Implementation status and planned TODOs #1

Implementation status and planned TODOs #1

Comments

r9y9 commented Apr 19, 2020 • edited

MIlestones

Fundamental components

Demo

Dataset

Frontend

DSP

Acoustic model

Timing model & duration model

Vocoder

Command-line tools

Data loader

Design TODOs

Software quality

Recipes

Misc

References

Yablon commented Apr 20, 2020 • edited

r9y9 commented Apr 20, 2020

Yablon commented Apr 20, 2020

apeguero1 commented Apr 21, 2020

r9y9 commented Apr 22, 2020

Yablon commented Apr 22, 2020

r9y9 commented Apr 22, 2020

Yablon commented Apr 22, 2020

r9y9 commented Apr 24, 2020

Yablon commented Apr 24, 2020

r9y9 commented Apr 24, 2020

r9y9 commented Apr 24, 2020

Yablon commented Apr 24, 2020

apeguero1 commented Apr 24, 2020

r9y9 commented Apr 25, 2020

r9y9 commented Apr 25, 2020

r9y9 commented Apr 25, 2020

ftshijt commented Apr 30, 2020

r9y9 commented Apr 30, 2020

r9y9 commented Apr 30, 2020

apeguero1 commented May 1, 2020

r9y9 commented May 2, 2020

r9y9 commented May 2, 2020

r9y9 commented May 3, 2020 • edited

Neural network-based singing voice synthesis demo using kiritan_singing database (Japanese)

r9y9 commented May 3, 2020

apeguero1 commented May 3, 2020

r9y9 commented May 21, 2020

r9y9 commented May 25, 2020

r9y9 commented Jun 3, 2020

r9y9 commented Jun 3, 2020

poria-cat commented Jun 18, 2020

ak9250 commented Sep 7, 2020

r9y9 commented Apr 12, 2022

poria-cat commented Apr 12, 2022

r9y9 commented Jul 4, 2022

r9y9 commented Nov 27, 2022

r9y9 commented Nov 30, 2022

r9y9 commented Apr 19, 2020 •

edited

Yablon commented Apr 20, 2020 •

edited

r9y9 commented May 3, 2020 •

edited