Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping prosodic features of reference Speaker #41

Open
jucasansao opened this issue Nov 2, 2020 · 4 comments
Open

Keeping prosodic features of reference Speaker #41

jucasansao opened this issue Nov 2, 2020 · 4 comments

Comments

@jucasansao
Copy link

Hi @jxzhanggg,

I am trying to achieve Voice Conversion with this algorithm applied to prosody training. This means that I want to convert a reference audio (Speaker A) to the voice of a user (Speaker B), but maintaining the original phone durations and the pitch contour (different mean f0) of the reference speaker (Speaker A).

Right now I managed to pre-train and fine-tune the model, and the voice conversion works well, the output is very similar to the target. But all the prosodic features from the reference were lost.

Do you have any idea where I may need to tweak to achieve this result? Even if it is at a slight cost of the audio quality. Did you ever attempt this or have an idea which parameters need to be changed?

Thanks in advance!

Pedro Sousa

@KunZhou9646
Copy link

Hi @jxzhanggg,

I am trying to achieve Voice Conversion with this algorithm applied to prosody training. This means that I want to convert a reference audio (Speaker A) to the voice of a user (Speaker B), but maintaining the original phone durations and the pitch contour (different mean f0) of the reference speaker (Speaker A).

Right now I managed to pre-train and fine-tune the model, and the voice conversion works well, the output is very similar to the target. But all the prosodic features from the reference were lost.

Do you have any idea where I may need to tweak to achieve this result? Even if it is at a slight cost of the audio quality. Did you ever attempt this or have an idea which parameters need to be changed?

Thanks in advance!

Pedro Sousa

Hi Pedro,

I recently read an Interspeech 2020 paper which aims to transfer the source style. I think it might be helpful for you.
https://www.isca-speech.org/archive/Interspeech_2020/pdfs/2412.pdf

Regards,
Kun

@jucasansao
Copy link
Author

Thanks @KunZhou9646

Yeah, it is definitely useful.

I was trying to find a way of still using this algorithm, and I noticed that if the fine-tune is done with a small amount of audios from the target speaker, it does keep some prosodic aspects from the reference, so I was wondering if something else may be done in this code. But I'll definitely take a look at the paper you sent.

Thanks,
Pedro Sousa

@zhuxiaoxuhit
Copy link

@jucasansao Hi,did u try the algorithms and get perfect transformation of prosodic features of reference speakers? Now I'm try to do this work by this way.

@KunZhou9646
Copy link

@jucasansao Hi,did u try the algorithms and get perfect transformation of prosodic features of reference speakers? Now I'm try to do this work by this way.

Hi, I did it on emotion style transfer using a strategy of pre-training & adaptation with this repo. I publish an Interspeech 2021 paper based on my results. You can find it here: https://arxiv.org/abs/2103.16809

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants