-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation: Audio + Text Feature Extraction #2
Comments
Specifically, what do each of these files look like? |
They are my prepared training list. |
Update:It seems for VCTK, the code needed for preprocessing is here: https://github.com/r9y9/deepvoice3_pytorch/tree/master/vctk_preprocess Original ReplyHow did you generate the training list? As suggested in the README, I downloaded VCTK and installed
This seemed to finish successfully, and now I have a dir with a
However, I don't see any |
A script can be write to walk through all training samples for generating list files.
|
For generating mean_std file, that's the file for mean and deviation normlization. You should |
Thanks for explaining the formation of the I understand how HMM phoneme forced-alignment works, and I understand how audio frames work (I've used HTK and Kaldi and other speech toolkits). However, after spending a few hours on this today I still have not succeeded in generating the alignments. In the needed format. Here's what I've done:
I performed all these steps, but still I'm not able to generate the phone alignments. I would like to replicate this alignment as you did (with deepvoice and CSTR tools), instead of using Kaldi or something else. Do you have preprocessing scripts you can share? I think this would be very good for your project. |
Alignments are not necessary. Despite I extracted the alignments, I actually didn't use it during training.
|
Also, the number of acoustic frames and number of phones in an utterances is compute by another litte program. Despite I prepared this two numbers in my list file, I actually didn't use this during training, as you can see from my data reader. |
why did you do this? did you find it helped?
|
it seems that phoneme alignments are used in this for-loop: Can you explain this some more, please? Thanks! |
Yes, these line does used the alignment to generate text alignments. However, during batching, which you can see from class TextMelIDCollate() function, the text targets, text ranks is not used. |
See the commit ab1f8d4 for cleaning the text alignments |
This line is for the feature mean and deviation normalization, i.e. |
does |
The paper reports using only log-scaled Mel-spectrograms for audio features, but the |
I think it's okay to keep linear spectrogram in case you want to use a Griffin-Lim vocoder. |
Yes, it is |
Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code. Thanks in advance! |
Basically, you should follow these steps:
If you have some more specific difficulties, you could provide more clues. |
Thanks for your support, I think I'm on the right track. |
Yes, you are right. If you don't want to predict linear spectrogram, the spectrogram features are |
@huukim136 - for feature extraction, I've been using this: https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/master/pre-train/reader/extract_features.py |
@JRMeyer thank you so much. You saved my day! |
this train and dev txt is generated by yourself? this repo do not include this code |
Yes, I generate something like a training list file to collect train samples together. |
How did you get the acosutic_frame_number (e.g. 135, 145, 365, 103, 57.... ), I ues the extract_mel_spec function in extract_features.py, but got the different numbers of frame, however the phone_number is the same. here is my result:
and my param is |
The usage instructions are missing some information on the feature preprocessing step.
.lab
files from Festival, as indicated in r9y9's deepvoice code? https://github.com/r9y9/deepvoice3_pytorch/tree/master/vctk_preprocessdeepvoice
code to get the exact features expected bynonparaSeq2SeqVC
? If so, which scripts are needed?I'm going to try out the code now, and I'll send PRs on documentation when I'm confident I can add something.
Thanks!
The text was updated successfully, but these errors were encountered: