This is an official pytorch implementation of the paper: Arbitrary Voice Conversion via Phoneme Attention
Demo:https://luckyluckyjl.github.io/TSDFVC-demo/
Abstract: Arbitrary voice conversion, which is also called zero-shot voice conversion, is a challenging task that involves transforming voices from one speaker to another. Most of the existing solutions either compress the speaker information of an utterance into a fixed-length vector and then directly fuse the deep content information without considering the ground content, or adaptively normalize deep content features with the style to match their global statistics. To overcome this problem, we design a novel module which refered Two Stride Style to Content Attention Net (TSCNet) to capture time-varying speaking-style embedding by using an attention mechanism. Considering both the global statistics and local information, we proposed the Two Scale Deep Fusion Voice Conversion (TSDF-VC) mdoel for more similar and style-adaptive voice conversion. The code and pre-trained model are available at luckyluckyjl/TSDFVC.