Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rhythmizers for other languages #62

Closed
haru0l opened this issue Mar 3, 2023 · 3 comments
Closed

Rhythmizers for other languages #62

haru0l opened this issue Mar 3, 2023 · 3 comments

Comments

@haru0l
Copy link

haru0l commented Mar 3, 2023

Hi again, I was wondering if there were any documentation regarding the rhythmizers, as I would like to train them for Japanese...

Side question, would a rhythmizer work on CVVC languages such as English or Polish?

@yqzhishen
Copy link
Member

Rhythmizers are actually temporary solution of phoneme duration prediction for MIDI-less models. A rhythmizer contains the FastSpeech2 encoder module and the DurationPredictor module from MIDI-A mode.

Models of MIDI-A mode can predict phoneme durations well and generate nice spectrograms, but their datasets are hard to label (you need MIDI sequence and slurs), and they have poor ability to predict the pitch, although they do have PitchPredictor. That is why we are deprecating this mode in this forked repository.

To get a rhythmizer, you need to first choose or design a phoneme dictionary. Then you should label your dataset in the opencpop segments format. Please note that the MIDI duration transcriptions of opencpop is in consonant-vowel format, and you need to label your dataset in vowel-consonant format, which is to say, the beginning of note should be aligned with the beginning of vowels instead of consonants (see issue). Here is an example of the labels that we converted from the original opencpop transcriptions: transcriptions-strict-revised2.txt. The last step is to preprocess your dataset and train a MIDI-A model with this config. After that, you can export the part for duration prediction with this script.

For CVVC languages like English and Polish, the answer is no. Because we currently can only deal with two-phase (CV) phoneme systems like Chinese and Japanese. MIDI-A, MIDI-B, duration predictors, data labels and all other word-phoneme related stuff will be re-designed in the future, and for that time you can expect a full support to all universal languages. No rhythmizers will be needed then - everyone can train their own variance adaptors (containing duration and pitch models and much more) via standard pipelines as easy as that of preparing and training MIDI-less acoustic models for now.

By the way, members of our team are already preparing for a Japanese rhythmizer. When they finish the dictionary, rhythmizer and the MFA model, we will formally support Japanese MIDI-less mode preparation in our pipeline. If you really find difficulties preparing by your own, it is fine to just wait for our progress.

@yqzhishen
Copy link
Member

@haru0l Hi there, we started a discussion for the design of Japanese dictionary here: #68

We will be happy if you have interests taking part in that.

@haru0l
Copy link
Author

haru0l commented Mar 14, 2023

Will do! Also I should close this...

@haru0l haru0l closed this as completed Mar 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants