Documentation: Audio + Text Feature Extraction #2

JRMeyer · 2019-12-16T23:40:58Z

The usage instructions are missing some information on the feature preprocessing step.

It would be helpful to give more exact instructions on how to extract spectrograms and phoneme features. Does the code expect phoneme .lab files from Festival, as indicated in r9y9's deepvoice code? https://github.com/r9y9/deepvoice3_pytorch/tree/master/vctk_preprocess
is it possible to use deepvoice code to get the exact features expected by nonparaSeq2SeqVC? If so, which scripts are needed?

I'm going to try out the code now, and I'll send PRs on documentation when I'm confident I can add something.

Thanks!

The text was updated successfully, but these errors were encountered:

JRMeyer · 2019-12-17T00:41:45Z

Specifically, what do each of these files look like?

https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/c977fe9856f4b1d2b494eea4baa320d3e6ecf5c6/pre-train/hparams.py#L23_L25

jxzhanggg · 2019-12-18T13:40:42Z

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

JRMeyer · 2019-12-18T17:30:15Z

Update:

It seems for VCTK, the code needed for preprocessing is here: https://github.com/r9y9/deepvoice3_pytorch/tree/master/vctk_preprocess

Original Reply

How did you generate the training list?

As suggested in the README, I downloaded VCTK and installed deepvoice from @r9y9, and ran the following preprocessing script:

python preprocess.py --preset=presets/deepvoice3_vctk.json vctk ~/vctk/DS_10283_2651/VCTK-Corpus/ ~/vctk/processed

This seemed to finish successfully, and now I have a dir with a train.txt file and many vctk-mel-*.npy and vctk-spec-*.npy files:

~/vctk/processed$ ls |  head
train.txt
vctk-mel-00001.npy
vctk-mel-00002.npy
vctk-mel-00003.npy
vctk-mel-00004.npy
vctk-mel-00005.npy
vctk-mel-00006.npy
vctk-mel-00007.npy
vctk-mel-00008.npy
vctk-mel-00009.npy

However, I don't see any *.list files. How can I generate such files?

jxzhanggg · 2019-12-19T01:49:56Z

A script can be write to walk through all training samples for generating list files.
The list file's format can be flexible. As long as the data reader could be modified to fits the list files and
your prepared training data.
Part of my code looks like this:

train_list = []
eval_list = []
test_list = []

for speaker in seen_speakers:
    files = [os.path.join('/home/jxzhang/Documents/DataSets/VCTK/spec/%s'%speaker, fn) for fn in os.listdir('../spec/%s'%speaker)]
    random.shuffle(files)
    test_files = files[:20]
    eval_files = files[20:30]
    train_files = files[30:]

    
    train_list.extend(train_files)
    eval_list.extend(eval_files)
    test_list.extend(test_files)

jxzhanggg · 2019-12-19T02:07:41Z

For generating mean_std file, that's the file for mean and deviation normlization. You should
go thourgh the extracted features for calulating the mean and standard deviation of features.

JRMeyer · 2019-12-19T02:40:02Z

Thanks for explaining the formation of the *.list files, but I'm still confused as how to generate the phoneme alignments, which are referenced as acoustic_frame_number and phone_number in the *.list files.

I understand how HMM phoneme forced-alignment works, and I understand how audio frames work (I've used HTK and Kaldi and other speech toolkits). However, after spending a few hours on this today I still have not succeeded in generating the alignments. In the needed format. Here's what I've done:

Install HTK / merlin / Edinburgh Speech Tools / deepvoice3
set all environmental vars (e.g. ESTDIR)
download VCTK
run the deepvoice preprocessing on VCTK with vctk_preprocess/extract.py

I performed all these steps, but still I'm not able to generate the phone alignments. I would like to replicate this alignment as you did (with deepvoice and CSTR tools), instead of using Kaldi or something else.

Do you have preprocessing scripts you can share? I think this would be very good for your project.

jxzhanggg · 2019-12-19T14:27:13Z

Alignments are not necessary. Despite I extracted the alignments, I actually didn't use it during training.
Therefore, all you need for preparing training data is phoneme sequences.
Anyway,
I used "https://github.com/r9y9/deepvoice3_pytorch/blob/master/vctk_preprocess/prepare_vctk_labels.py" to get my phoneme labels. Configuring all the paths and enviroments should get the cutted phoneme labels as

0 750000 p
750000 1050000 l
1050000 3150000 iy
3150000 5600000 z
5600000 5750000 k
5750000 8450000 ao
8450000 8600000 l
8600000 9950000 s
9950000 10249999 t
10249999 12950000 eh
12950000 13200000 l
13200000 14750000 ax

jxzhanggg · 2019-12-19T14:30:27Z

Also, the number of acoustic frames and number of phones in an utterances is compute by another litte program. Despite I prepared this two numbers in my list file, I actually didn't use this during training, as you can see from my data reader.

JRMeyer · 2019-12-19T22:22:52Z

For generating mean_std file, that's the file for mean and deviation normlization. You should
go thourgh the extracted features for calulating the mean and standard deviation of features.

why did you do this? did you find it helped?

nonparaSeq2seqVC_code/pre-train/reader/reader.py

Line 47 in c977fe9

mel = (mel - self.mel_mean_std[0])/ self.mel_mean_std[1]

JRMeyer · 2019-12-19T22:45:54Z

Alignments are not necessary. Despite I extracted the alignments, I actually didn't use it during training.
Therefore, all you need for preparing training data is phoneme sequences.
Anyway,
I used "https://github.com/r9y9/deepvoice3_pytorch/blob/master/vctk_preprocess/prepare_vctk_labels.py" to get my phoneme labels. Configuring all the paths and enviroments should get the cutted phoneme labels as
0 750000 p
750000 1050000 l
1050000 3150000 iy
3150000 5600000 z
5600000 5750000 k
5750000 8450000 ao
8450000 8600000 l
8600000 9950000 s
9950000 10249999 t
10249999 12950000 eh
12950000 13200000 l
13200000 14750000 ax

it seems that phoneme alignments are used in this for-loop:

https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/c977fe9856f4b1d2b494eea4baa320d3e6ecf5c6/pre-train/reader/reader.py#L79_L83

Can you explain this some more, please? Thanks!

jxzhanggg · 2019-12-20T01:49:27Z

Alignments are not necessary. Despite I extracted the alignments, I actually didn't use it during training.
Therefore, all you need for preparing training data is phoneme sequences.
Anyway,
I used "https://github.com/r9y9/deepvoice3_pytorch/blob/master/vctk_preprocess/prepare_vctk_labels.py" to get my phoneme labels. Configuring all the paths and enviroments should get the cutted phoneme labels as
0 750000 p
750000 1050000 l
1050000 3150000 iy
3150000 5600000 z
5600000 5750000 k
5750000 8450000 ao
8450000 8600000 l
8600000 9950000 s
9950000 10249999 t
10249999 12950000 eh
12950000 13200000 l
13200000 14750000 ax
it seems that phoneme alignments are used in this for-loop:

https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/c977fe9856f4b1d2b494eea4baa320d3e6ecf5c6/pre-train/reader/reader.py#L79_L83

Can you explain this some more, please? Thanks!

Yes, these line does used the alignment to generate text alignments. However, during batching, which you can see from class TextMelIDCollate() function, the text targets, text ranks is not used.
TextMelIDLoader() prepared text alignments while TextMelIDCollate drops this. I this these lines could be deleted to avoid confusion.

jxzhanggg · 2019-12-20T03:23:50Z

See the commit ab1f8d4 for cleaning the text alignments

jxzhanggg · 2019-12-20T03:27:36Z

For generating mean_std file, that's the file for mean and deviation normlization. You should
go thourgh the extracted features for calulating the mean and standard deviation of features.

why did you do this? did you find it helped?

nonparaSeq2seqVC_code/pre-train/reader/reader.py

Line 47 in c977fe9

mel = (mel - self.mel_mean_std[0])/ self.mel_mean_std[1]

This line is for the feature mean and deviation normalization, i.e.
$$\hat{x} = ( x - \mu ) / \sigma$$
The feature normalization gives a reasonable value range of log-Mel spectrograms and spectrograms. For my experience, feature normalization is quite useful and common way to process the inputs of the model.

JRMeyer · 2019-12-20T19:45:53Z

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

does phone_number in this case refer to the "number of phones in the text transcript"?

JRMeyer · 2019-12-20T20:50:45Z

The paper reports using only log-scaled Mel-spectrograms for audio features, but the reader.py expects both mel-spectrograms and regular spectrograms... can we remove the spectrograms from reader.py?

jxzhanggg · 2019-12-21T04:41:40Z

The paper reports using only log-scaled Mel-spectrograms for audio features, but the reader.py expects both mel-spectrograms and regular spectrograms... can we remove the spectrograms from reader.py?

I think it's okay to keep linear spectrogram in case you want to use a Griffin-Lim vocoder.
Linear spectrograms can recover higher quality audio if Griffin-Lim is used.
If WaveNet vocoder is used, Mel-spectrograms will be more suitable output features.
Therefore, it's a configuration in hparams for whether to predict linear spectrograms or not.
See hparams.predict_spectrogram

jxzhanggg · 2019-12-21T04:42:22Z

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

does phone_number in this case refer to the "number of phones in the text transcript"?

Yes, it is

huukim136 · 2020-01-07T07:48:04Z

Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code.

Thanks in advance!

jxzhanggg · 2020-01-07T14:29:21Z

Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code.

Thanks in advance!

Basically, you should follow these steps:

extract log-mel spectrograms, log-spectrograms and phonemes
calculate the mean, standard deviation and store them
modified the data reader function to read and normalize log mel log spec, read phonemes and speaker identity.
test your data reader to make sure it can read the data properly.
run the pre-train & fine-tune code

If you have some more specific difficulties, you could provide more clues.

huukim136 · 2020-01-08T04:38:37Z

Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code.
Thanks in advance!

Basically, you should follow these steps:

extract log-mel spectrograms, log-spectrograms and phonemes

calculate the mean, standard deviation and store them

modified the data reader function to read and normalize log mel log spec, read phonemes and speaker identity.

test your data reader to make sure it can read the data properly.

run the pre-train & fine-tune code

If you have some more specific difficulties, you could provide more clues.

Thanks for your support, I think I'm on the right track.
But, there's one more thing I'm confused is that why do we need to extract loc spec features? In your code, you set the parameter predict_spectrogram as False so log spec features are never used (that's my understanding, maybe I'm wrong somewhere).
Thank you.

jxzhanggg · 2020-01-08T06:34:53Z

Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code.
Thanks in advance!

Basically, you should follow these steps:

extract log-mel spectrograms, log-spectrograms and phonemes

calculate the mean, standard deviation and store them

modified the data reader function to read and normalize log mel log spec, read phonemes and speaker identity.

test your data reader to make sure it can read the data properly.

run the pre-train & fine-tune code

If you have some more specific difficulties, you could provide more clues.

Thanks for your support, I think I'm on the right track.
But, there's one more thing I'm confused is that why do we need to extract loc spec features? In your code, you set the parameter predict_spectrogram as False so log spec features are never used (that's my understanding, maybe I'm wrong somewhere).
Thank you.

Yes, you are right. If you don't want to predict linear spectrogram, the spectrogram features are
not necessary.

JRMeyer · 2020-01-11T05:05:20Z

@huukim136 - for feature extraction, I've been using this: https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/master/pre-train/reader/extract_features.py

huukim136 · 2020-01-12T04:28:05Z

@huukim136 - for feature extraction, I've been using this: https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/master/pre-train/reader/extract_features.py

@JRMeyer thank you so much. You saved my day!

Pydataman · 2020-02-18T03:43:40Z

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

this train and dev txt is generated by yourself? this repo do not include this code

jxzhanggg · 2020-03-06T17:07:05Z

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

this train and dev txt is generated by yourself? this repo do not include this code

Yes, I generate something like a training list file to collect train samples together.

JeffC0628 · 2020-12-09T07:55:13Z

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

How did you get the acosutic_frame_number (e.g. 135, 145, 365, 103, 57.... ), I ues the extract_mel_spec function in extract_features.py, but got the different numbers of frame, however the phone_number is the same. here is my result:

/VCTK/spec/p225/log-spec-p225_090.npy 346 22
/VCTK/spec/p225/log-spec-p225_118.npy 315 23
/VCTK/spec/p225/log-spec-p225_014.npy 533 52
/VCTK/spec/p225/log-spec-p225_179.npy 250 11

and my param is
y, sample_rate = librosa.load(filename, sr=16000)
spec = librosa.core.stft(y=y,n_fft=2048,hop_length=200, win_length=800,window='hann',center=True,pad_mode='reflect')
spec = librosa.magphase(spec)[0]
log_spectrogram = np.log(spec).astype(np.float32)
mel_spectrogram = librosa.feature.melspectrogram(S=spec, sr=sample_rate,n_mels=80,power=1.0, fmin=0.0, fmax=None, htk=False, norm=1)
log_mel_spectrogram = np.log(mel_spectrogram).astype(np.float32)

JRMeyer changed the title ~~Documentation: Feature Processing with DeepVoice3~~ Documentation: Audio + Text Feature Extraction Dec 19, 2019

jxzhanggg closed this as completed Mar 6, 2020

jxzhanggg mentioned this issue Mar 31, 2020

list file format #33

Closed

JeffC0628 mentioned this issue Dec 9, 2020

> They are my prepared training list. #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: Audio + Text Feature Extraction #2

Documentation: Audio + Text Feature Extraction #2

JRMeyer commented Dec 16, 2019

JRMeyer commented Dec 17, 2019

jxzhanggg commented Dec 18, 2019 •

edited

JRMeyer commented Dec 18, 2019 •

edited

jxzhanggg commented Dec 19, 2019 •

edited

jxzhanggg commented Dec 19, 2019

JRMeyer commented Dec 19, 2019

jxzhanggg commented Dec 19, 2019

jxzhanggg commented Dec 19, 2019

JRMeyer commented Dec 19, 2019

JRMeyer commented Dec 19, 2019

jxzhanggg commented Dec 20, 2019

jxzhanggg commented Dec 20, 2019 •

edited

jxzhanggg commented Dec 20, 2019 •

edited

JRMeyer commented Dec 20, 2019

JRMeyer commented Dec 20, 2019

jxzhanggg commented Dec 21, 2019 •

edited

jxzhanggg commented Dec 21, 2019

huukim136 commented Jan 7, 2020

jxzhanggg commented Jan 7, 2020

huukim136 commented Jan 8, 2020 •

edited

jxzhanggg commented Jan 8, 2020 •

edited

JRMeyer commented Jan 11, 2020

huukim136 commented Jan 12, 2020

Pydataman commented Feb 18, 2020

jxzhanggg commented Mar 6, 2020

JeffC0628 commented Dec 9, 2020 •

edited

Documentation: Audio + Text Feature Extraction #2

Documentation: Audio + Text Feature Extraction #2

Comments

JRMeyer commented Dec 16, 2019

JRMeyer commented Dec 17, 2019

jxzhanggg commented Dec 18, 2019 • edited

JRMeyer commented Dec 18, 2019 • edited

Update:

Original Reply

jxzhanggg commented Dec 19, 2019 • edited

jxzhanggg commented Dec 19, 2019

JRMeyer commented Dec 19, 2019

jxzhanggg commented Dec 19, 2019

jxzhanggg commented Dec 19, 2019

JRMeyer commented Dec 19, 2019

JRMeyer commented Dec 19, 2019

jxzhanggg commented Dec 20, 2019

jxzhanggg commented Dec 20, 2019 • edited

jxzhanggg commented Dec 20, 2019 • edited

JRMeyer commented Dec 20, 2019

JRMeyer commented Dec 20, 2019

jxzhanggg commented Dec 21, 2019 • edited

jxzhanggg commented Dec 21, 2019

huukim136 commented Jan 7, 2020

jxzhanggg commented Jan 7, 2020

huukim136 commented Jan 8, 2020 • edited

jxzhanggg commented Jan 8, 2020 • edited

JRMeyer commented Jan 11, 2020

huukim136 commented Jan 12, 2020

Pydataman commented Feb 18, 2020

jxzhanggg commented Mar 6, 2020

JeffC0628 commented Dec 9, 2020 • edited

jxzhanggg commented Dec 18, 2019 •

edited

JRMeyer commented Dec 18, 2019 •

edited

jxzhanggg commented Dec 19, 2019 •

edited

jxzhanggg commented Dec 20, 2019 •

edited

jxzhanggg commented Dec 20, 2019 •

edited

jxzhanggg commented Dec 21, 2019 •

edited

huukim136 commented Jan 8, 2020 •

edited

jxzhanggg commented Jan 8, 2020 •

edited

JeffC0628 commented Dec 9, 2020 •

edited