Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Audio + Text Feature Extraction #2

Closed
JRMeyer opened this issue Dec 16, 2019 · 26 comments
Closed

Documentation: Audio + Text Feature Extraction #2

JRMeyer opened this issue Dec 16, 2019 · 26 comments

Comments

@JRMeyer
Copy link
Contributor

JRMeyer commented Dec 16, 2019

The usage instructions are missing some information on the feature preprocessing step.

  • It would be helpful to give more exact instructions on how to extract spectrograms and phoneme features. Does the code expect phoneme .lab files from Festival, as indicated in r9y9's deepvoice code? https://github.com/r9y9/deepvoice3_pytorch/tree/master/vctk_preprocess
  • is it possible to use deepvoice code to get the exact features expected by nonparaSeq2SeqVC? If so, which scripts are needed?

I'm going to try out the code now, and I'll send PRs on documentation when I'm confident I can add something.

Thanks!

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Dec 17, 2019

@jxzhanggg
Copy link
Owner

jxzhanggg commented Dec 18, 2019

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Dec 18, 2019

Update:

It seems for VCTK, the code needed for preprocessing is here: https://github.com/r9y9/deepvoice3_pytorch/tree/master/vctk_preprocess

Original Reply

How did you generate the training list?

As suggested in the README, I downloaded VCTK and installed deepvoice from @r9y9, and ran the following preprocessing script:

python preprocess.py --preset=presets/deepvoice3_vctk.json vctk ~/vctk/DS_10283_2651/VCTK-Corpus/ ~/vctk/processed

This seemed to finish successfully, and now I have a dir with a train.txt file and many vctk-mel-*.npy and vctk-spec-*.npy files:

~/vctk/processed$ ls |  head
train.txt
vctk-mel-00001.npy
vctk-mel-00002.npy
vctk-mel-00003.npy
vctk-mel-00004.npy
vctk-mel-00005.npy
vctk-mel-00006.npy
vctk-mel-00007.npy
vctk-mel-00008.npy
vctk-mel-00009.npy

However, I don't see any *.list files. How can I generate such files?

@jxzhanggg
Copy link
Owner

jxzhanggg commented Dec 19, 2019

A script can be write to walk through all training samples for generating list files.
The list file's format can be flexible. As long as the data reader could be modified to fits the list files and
your prepared training data.
Part of my code looks like this:

train_list = []
eval_list = []
test_list = []

for speaker in seen_speakers:
    files = [os.path.join('/home/jxzhang/Documents/DataSets/VCTK/spec/%s'%speaker, fn) for fn in os.listdir('../spec/%s'%speaker)]
    random.shuffle(files)
    test_files = files[:20]
    eval_files = files[20:30]
    train_files = files[30:]

    
    train_list.extend(train_files)
    eval_list.extend(eval_files)
    test_list.extend(test_files)

@jxzhanggg
Copy link
Owner

For generating mean_std file, that's the file for mean and deviation normlization. You should
go thourgh the extracted features for calulating the mean and standard deviation of features.

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Dec 19, 2019

Thanks for explaining the formation of the *.list files, but I'm still confused as how to generate the phoneme alignments, which are referenced as acoustic_frame_number and phone_number in the *.list files.

I understand how HMM phoneme forced-alignment works, and I understand how audio frames work (I've used HTK and Kaldi and other speech toolkits). However, after spending a few hours on this today I still have not succeeded in generating the alignments. In the needed format. Here's what I've done:

  • Install HTK / merlin / Edinburgh Speech Tools / deepvoice3
  • set all environmental vars (e.g. ESTDIR)
  • download VCTK
  • run the deepvoice preprocessing on VCTK with vctk_preprocess/extract.py

I performed all these steps, but still I'm not able to generate the phone alignments. I would like to replicate this alignment as you did (with deepvoice and CSTR tools), instead of using Kaldi or something else.

Do you have preprocessing scripts you can share? I think this would be very good for your project.

@jxzhanggg
Copy link
Owner

Alignments are not necessary. Despite I extracted the alignments, I actually didn't use it during training.
Therefore, all you need for preparing training data is phoneme sequences.
Anyway,
I used "https://github.com/r9y9/deepvoice3_pytorch/blob/master/vctk_preprocess/prepare_vctk_labels.py" to get my phoneme labels. Configuring all the paths and enviroments should get the cutted phoneme labels as

0 750000 p
750000 1050000 l
1050000 3150000 iy
3150000 5600000 z
5600000 5750000 k
5750000 8450000 ao
8450000 8600000 l
8600000 9950000 s
9950000 10249999 t
10249999 12950000 eh
12950000 13200000 l
13200000 14750000 ax

@jxzhanggg
Copy link
Owner

Also, the number of acoustic frames and number of phones in an utterances is compute by another litte program. Despite I prepared this two numbers in my list file, I actually didn't use this during training, as you can see from my data reader.

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Dec 19, 2019

For generating mean_std file, that's the file for mean and deviation normlization. You should
go thourgh the extracted features for calulating the mean and standard deviation of features.

why did you do this? did you find it helped?

mel = (mel - self.mel_mean_std[0])/ self.mel_mean_std[1]

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Dec 19, 2019

Alignments are not necessary. Despite I extracted the alignments, I actually didn't use it during training.
Therefore, all you need for preparing training data is phoneme sequences.
Anyway,
I used "https://github.com/r9y9/deepvoice3_pytorch/blob/master/vctk_preprocess/prepare_vctk_labels.py" to get my phoneme labels. Configuring all the paths and enviroments should get the cutted phoneme labels as

0 750000 p
750000 1050000 l
1050000 3150000 iy
3150000 5600000 z
5600000 5750000 k
5750000 8450000 ao
8450000 8600000 l
8600000 9950000 s
9950000 10249999 t
10249999 12950000 eh
12950000 13200000 l
13200000 14750000 ax

it seems that phoneme alignments are used in this for-loop:

https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/c977fe9856f4b1d2b494eea4baa320d3e6ecf5c6/pre-train/reader/reader.py#L79_L83

Can you explain this some more, please? Thanks!

@JRMeyer JRMeyer changed the title Documentation: Feature Processing with DeepVoice3 Documentation: Audio + Text Feature Extraction Dec 19, 2019
@jxzhanggg
Copy link
Owner

Alignments are not necessary. Despite I extracted the alignments, I actually didn't use it during training.
Therefore, all you need for preparing training data is phoneme sequences.
Anyway,
I used "https://github.com/r9y9/deepvoice3_pytorch/blob/master/vctk_preprocess/prepare_vctk_labels.py" to get my phoneme labels. Configuring all the paths and enviroments should get the cutted phoneme labels as

0 750000 p
750000 1050000 l
1050000 3150000 iy
3150000 5600000 z
5600000 5750000 k
5750000 8450000 ao
8450000 8600000 l
8600000 9950000 s
9950000 10249999 t
10249999 12950000 eh
12950000 13200000 l
13200000 14750000 ax

it seems that phoneme alignments are used in this for-loop:

https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/c977fe9856f4b1d2b494eea4baa320d3e6ecf5c6/pre-train/reader/reader.py#L79_L83

Can you explain this some more, please? Thanks!

Yes, these line does used the alignment to generate text alignments. However, during batching, which you can see from class TextMelIDCollate() function, the text targets, text ranks is not used.
TextMelIDLoader() prepared text alignments while TextMelIDCollate drops this. I this these lines could be deleted to avoid confusion.

@jxzhanggg
Copy link
Owner

jxzhanggg commented Dec 20, 2019

See the commit ab1f8d4 for cleaning the text alignments

@jxzhanggg
Copy link
Owner

jxzhanggg commented Dec 20, 2019

For generating mean_std file, that's the file for mean and deviation normlization. You should
go thourgh the extracted features for calulating the mean and standard deviation of features.

why did you do this? did you find it helped?

mel = (mel - self.mel_mean_std[0])/ self.mel_mean_std[1]

This line is for the feature mean and deviation normalization, i.e.
$$\hat{x} = ( x - \mu ) / \sigma$$
The feature normalization gives a reasonable value range of log-Mel spectrograms and spectrograms. For my experience, feature normalization is quite useful and common way to process the inputs of the model.

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Dec 20, 2019

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

does phone_number in this case refer to the "number of phones in the text transcript"?

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Dec 20, 2019

The paper reports using only log-scaled Mel-spectrograms for audio features, but the reader.py expects both mel-spectrograms and regular spectrograms... can we remove the spectrograms from reader.py?

@jxzhanggg
Copy link
Owner

jxzhanggg commented Dec 21, 2019

The paper reports using only log-scaled Mel-spectrograms for audio features, but the reader.py expects both mel-spectrograms and regular spectrograms... can we remove the spectrograms from reader.py?

I think it's okay to keep linear spectrogram in case you want to use a Griffin-Lim vocoder.
Linear spectrograms can recover higher quality audio if Griffin-Lim is used.
If WaveNet vocoder is used, Mel-spectrograms will be more suitable output features.
Therefore, it's a configuration in hparams for whether to predict linear spectrograms or not.
See hparams.predict_spectrogram

@jxzhanggg
Copy link
Owner

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

does phone_number in this case refer to the "number of phones in the text transcript"?

Yes, it is

@huukim136
Copy link

Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code.

Thanks in advance!

@jxzhanggg
Copy link
Owner

Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code.

Thanks in advance!

Basically, you should follow these steps:

  1. extract log-mel spectrograms, log-spectrograms and phonemes
  2. calculate the mean, standard deviation and store them
  3. modified the data reader function to read and normalize log mel log spec, read phonemes and speaker identity.
  4. test your data reader to make sure it can read the data properly.
  5. run the pre-train & fine-tune code

If you have some more specific difficulties, you could provide more clues.

@huukim136
Copy link

huukim136 commented Jan 8, 2020

Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code.
Thanks in advance!

Basically, you should follow these steps:

  1. extract log-mel spectrograms, log-spectrograms and phonemes
  2. calculate the mean, standard deviation and store them
  3. modified the data reader function to read and normalize log mel log spec, read phonemes and speaker identity.
  4. test your data reader to make sure it can read the data properly.
  5. run the pre-train & fine-tune code

If you have some more specific difficulties, you could provide more clues.

Thanks for your support, I think I'm on the right track.
But, there's one more thing I'm confused is that why do we need to extract loc spec features? In your code, you set the parameter predict_spectrogram as False so log spec features are never used (that's my understanding, maybe I'm wrong somewhere).
Thank you.

@jxzhanggg
Copy link
Owner

jxzhanggg commented Jan 8, 2020

Could you please give us more information and code about feature extraction (mel and phonemes). It's been a few days but still I cannot generate the proper features to match your code.
Thanks in advance!

Basically, you should follow these steps:

  1. extract log-mel spectrograms, log-spectrograms and phonemes
  2. calculate the mean, standard deviation and store them
  3. modified the data reader function to read and normalize log mel log spec, read phonemes and speaker identity.
  4. test your data reader to make sure it can read the data properly.
  5. run the pre-train & fine-tune code

If you have some more specific difficulties, you could provide more clues.

Thanks for your support, I think I'm on the right track.
But, there's one more thing I'm confused is that why do we need to extract loc spec features? In your code, you set the parameter predict_spectrogram as False so log spec features are never used (that's my understanding, maybe I'm wrong somewhere).
Thank you.

Yes, you are right. If you don't want to predict linear spectrogram, the spectrogram features are
not necessary.

@JRMeyer
Copy link
Contributor Author

JRMeyer commented Jan 11, 2020

@huukim136 - for feature extraction, I've been using this: https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/master/pre-train/reader/extract_features.py

@huukim136
Copy link

@huukim136 - for feature extraction, I've been using this: https://github.com/jxzhanggg/nonparaSeq2seqVC_code/blob/master/pre-train/reader/extract_features.py

@JRMeyer thank you so much. You saved my day!

@Pydataman
Copy link

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

this train and dev txt is generated by yourself? this repo do not include this code

@jxzhanggg
Copy link
Owner

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

this train and dev txt is generated by yourself? this repo do not include this code

Yes, I generate something like a training list file to collect train samples together.

@JeffC0628
Copy link

JeffC0628 commented Dec 9, 2020

They are my prepared training list.
each line looks like this:
spectrogram_path acosutic_frame_number phone_number
for example:
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_090.npy 135 22
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_118.npy 145 23
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_014.npy 365 52
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_179.npy 103 11
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_309.npy 57 7
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_353.npy 142 24
/home/jxzhang/Documents/DataSets/VCTK/spec/p225/log-spec-p225_012.npy 310 49
...

How did you get the acosutic_frame_number (e.g. 135, 145, 365, 103, 57.... ), I ues the extract_mel_spec function in extract_features.py, but got the different numbers of frame, however the phone_number is the same. here is my result:

/VCTK/spec/p225/log-spec-p225_090.npy 346 22
/VCTK/spec/p225/log-spec-p225_118.npy 315 23
/VCTK/spec/p225/log-spec-p225_014.npy 533 52
/VCTK/spec/p225/log-spec-p225_179.npy 250 11

and my param is
y, sample_rate = librosa.load(filename, sr=16000)
spec = librosa.core.stft(y=y,n_fft=2048,hop_length=200, win_length=800,window='hann',center=True,pad_mode='reflect')
spec = librosa.magphase(spec)[0]
log_spectrogram = np.log(spec).astype(np.float32)
mel_spectrogram = librosa.feature.melspectrogram(S=spec, sr=sample_rate,n_mels=80,power=1.0, fmin=0.0, fmax=None, htk=False, norm=1)
log_mel_spectrogram = np.log(mel_spectrogram).astype(np.float32)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants