Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VitsSVC implementation #14

Merged
merged 18 commits into from
Dec 8, 2023
Merged

Add VitsSVC implementation #14

merged 18 commits into from
Dec 8, 2023

Conversation

viewfinder-annn
Copy link
Collaborator

Add implementation of VITS-based model for Singing Voice Conversion task

Copy link
Collaborator

@RMSnow RMSnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two additional general comments:

  1. Add Amphion's copyright information for all the newly added files.
  2. Add some comments (descriptions and detail instructions) for some functions.

config/vitssvc.json Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file the same as egs/_template/run.sh? You can create a soft link to it. It is easy for our future repair.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually not 🥲 vits module needs special Cython module initialization other than common run.sh for svc task:
https://github.com/viewfinder-annn/AmphionPublic/blob/main/egs/svc/VitsSVC/run.sh#L9

So is it feasible to split Cython module initialization into another .sh file? In this case we can indicate this in readme file and reuse egs/_template/run.sh.

models/svc/base/svc_dataset.py Outdated Show resolved Hide resolved
models/svc/vits/vits.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@RMSnow RMSnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Black for formatting your code. See this blog.

models/svc/vits/vits.py Outdated Show resolved Hide resolved

return z, m, logs, x_mask

class SynthesizerTrn(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this class same as

class SynthesizerTrn(nn.Module):

We need to merge them into a common one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the internal encoding is different, one for text and acoustic condition the other, which affects forward/infer functions either. So in my opinion it can not be merged.

models/svc/vits/vits.py Outdated Show resolved Hide resolved
models/svc/vits/vits.py Outdated Show resolved Hide resolved
models/svc/vits/vits.py Outdated Show resolved Hide resolved
models/svc/vits/vits_trainer.py Outdated Show resolved Hide resolved
models/svc/vits/vits_trainer.py Show resolved Hide resolved
models/svc/vits/vits_trainer.py Outdated Show resolved Hide resolved
models/svc/vits/vits_trainer.py Show resolved Hide resolved
.gitignore Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as discussed before, whisper extractor needs modules/whisper_extractor/assets/mel_filters.npz to extract feature properly, while this file is initially ignored by the universal *.npz configuration. Adding this line can make this file preserved by git version control.

@zhizhengwu
Copy link
Collaborator

Any audio samples to support this PR?

@viewfinder-annn
Copy link
Collaborator Author

viewfinder-annn commented Dec 8, 2023

@zhizhengwu
Any audio samples to support this PR?

Here are two samples which are converted from M4Singer dataset to Opencpop dataset, with original sample and SoVits4.1 model's output.
VitsSVC model uses ContentVec and Whisper feature and hifigan as generator, is trained from scratch for 110k steps.
SoVits4.1 model uses Whisper feature and nsf-hifigan as generator, is fine-tuned on a 330k steps-pretrained base model for 110k steps.

Tenor-3 -> opencpop_female1 Alto-6 -> opencpop_female1
Original Tenor-3 Alto-6
SoVits 4.1 Tenor-3_SoVits4.1_opencpop_female1 Alto-6_SoVits4.1_opencpop_female1
VitsSVC Tenor-3_VitsSVC_opencpop_female1 Alto-6_VitsSVC_opencpop_female1

egs/svc/VitsSVC/run.sh Outdated Show resolved Hide resolved
export PYTHONIOENCODING=UTF-8

# monotonic_align
cd $work_dir/modules/monotonic_align
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify the modules.vits. Let @lmxue know it.

@RMSnow RMSnow merged commit 554b791 into open-mmlab:main Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants