Keeping prosodic features of reference Speaker #41

jucasansao · 2020-11-02T17:09:20Z

I am trying to achieve Voice Conversion with this algorithm applied to prosody training. This means that I want to convert a reference audio (Speaker A) to the voice of a user (Speaker B), but maintaining the original phone durations and the pitch contour (different mean f0) of the reference speaker (Speaker A).

Right now I managed to pre-train and fine-tune the model, and the voice conversion works well, the output is very similar to the target. But all the prosodic features from the reference were lost.

Do you have any idea where I may need to tweak to achieve this result? Even if it is at a slight cost of the audio quality. Did you ever attempt this or have an idea which parameters need to be changed?

Thanks in advance!

Pedro Sousa

KunZhou9646 · 2020-11-06T06:15:50Z

Hi @jxzhanggg,

I am trying to achieve Voice Conversion with this algorithm applied to prosody training. This means that I want to convert a reference audio (Speaker A) to the voice of a user (Speaker B), but maintaining the original phone durations and the pitch contour (different mean f0) of the reference speaker (Speaker A).

Right now I managed to pre-train and fine-tune the model, and the voice conversion works well, the output is very similar to the target. But all the prosodic features from the reference were lost.

Do you have any idea where I may need to tweak to achieve this result? Even if it is at a slight cost of the audio quality. Did you ever attempt this or have an idea which parameters need to be changed?

Thanks in advance!

Pedro Sousa

Hi Pedro,

I recently read an Interspeech 2020 paper which aims to transfer the source style. I think it might be helpful for you.
https://www.isca-speech.org/archive/Interspeech_2020/pdfs/2412.pdf

Regards,
Kun

jucasansao · 2020-11-10T11:26:38Z

Thanks @KunZhou9646

Yeah, it is definitely useful.

I was trying to find a way of still using this algorithm, and I noticed that if the fine-tune is done with a small amount of audios from the target speaker, it does keep some prosodic aspects from the reference, so I was wondering if something else may be done in this code. But I'll definitely take a look at the paper you sent.

Thanks,
Pedro Sousa

zhuxiaoxuhit · 2022-07-15T12:08:50Z

@jucasansao Hi，did u try the algorithms and get perfect transformation of prosodic features of reference speakers? Now I'm try to do this work by this way.

KunZhou9646 · 2022-07-15T15:21:59Z

@jucasansao Hi，did u try the algorithms and get perfect transformation of prosodic features of reference speakers? Now I'm try to do this work by this way.

Hi, I did it on emotion style transfer using a strategy of pre-training & adaptation with this repo. I publish an Interspeech 2021 paper based on my results. You can find it here: https://arxiv.org/abs/2103.16809

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping prosodic features of reference Speaker #41

Keeping prosodic features of reference Speaker #41

jucasansao commented Nov 2, 2020

KunZhou9646 commented Nov 6, 2020

jucasansao commented Nov 10, 2020

zhuxiaoxuhit commented Jul 15, 2022

KunZhou9646 commented Jul 15, 2022

Keeping prosodic features of reference Speaker #41

Keeping prosodic features of reference Speaker #41

Comments

jucasansao commented Nov 2, 2020

KunZhou9646 commented Nov 6, 2020

jucasansao commented Nov 10, 2020

zhuxiaoxuhit commented Jul 15, 2022

KunZhou9646 commented Jul 15, 2022