Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add core code of valle #4

Merged
merged 11 commits into from
Dec 2, 2023
Merged

add core code of valle #4

merged 11 commits into from
Dec 2, 2023

Conversation

lmxue
Copy link
Collaborator

@lmxue lmxue commented Dec 1, 2023

Vall-E is a zero-shot TTS architecture that uses a neural codec language model with discrete codes. This PR is to support Vall-E in Amphion.

  • bins/tts
  • config/valle.json
  • egs/tts/VALLE
  • models/tts/valle

Copy link
Collaborator

@RMSnow RMSnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave the comments as the next improvement

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace the complex name with a easy-to-understand one


from utils.tokenizer import G2PModule, tokenize_text
from utils.symbol_table import SymbolTable
from text.g2p import preprocess_english, read_lexicon

"""
Extractor for content features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provide the comments of extract_phoneme related code

@@ -539,3 +541,49 @@ def extract_utt_content_features_dataloader(cfg, metadata, num_workers):
)
for index, utt in enumerate(_metadata):
extractor.save_feature(utt, batch_content_features[index])

if cfg.preprocess.extract_phoneme:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code will make SVC and TTS entangle unnecessarily. Move Line545-589 to a new function.

@lmxue lmxue force-pushed the vall_dev1 branch 2 times, most recently from 02bff49 to d333659 Compare December 1, 2023 18:56
print("args: ", args)

parser = build_parser()
VALLEInference.add_arguments(parser)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like VALLEInference is used no matter what type of the model.

if 'test' not in types:
types.append('test')
if "eval" in dataset:
types = ["test"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeating lines: 32 - 39

metadata = []
for dataset_type in types:
dataset_output = os.path.join(output_path, dataset)
# dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
dataset_file = os.path.join(dataset_output, "{}.json".format(dataset_type))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicating line 78?

@@ -77,12 +93,13 @@ def main():
new_datasets_list.extend(filter(None, new_datasets))
cfg.dataset.extend(new_datasets_list)

# CUDA settings
# # CUDA settings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add one more '#'

@zhizhengwu
Copy link
Collaborator

We should provide demos/samples in a PR

config/base.json Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move all the configs about TTS into a TTS's config base json

@@ -211,6 +212,11 @@ def __extract_utt_acoustic_features(dataset_output, cfg, utt):
label = audio_to_label(wav, cfg.preprocess.bits)
save_feature(dataset_output, cfg.preprocess.label_dir, uid, label)

if cfg.preprocess.extract_acoustic_token:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not modify __extract_utt_acoustic_features anymore. It is not a common extraction pipeline now. See extract_utt_acoustic_features_tts (line221) and extract_utt_acoustic_features_vocoder(line233) as reference. Please move all the functions of TTS's acoustic feature extraction into line221.

@@ -75,9 +73,9 @@ def build_parser():
)
parser.add_argument(
"--text",
help="Text to be synthesized",
help="Text",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Text to be synthesized' is more informative than 'Text'

@RMSnow RMSnow merged commit eea8473 into open-mmlab:main Dec 2, 2023
@RMSnow RMSnow mentioned this pull request Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants