Skip to content

Add MOSS-TTS-v1.5 (delay) and MOSS-Audio-Tokenizer#46447

Open
ted-mosi wants to merge 18 commits into
huggingface:mainfrom
ted-mosi:add-mosstts-v1-5
Open

Add MOSS-TTS-v1.5 (delay) and MOSS-Audio-Tokenizer#46447
ted-mosi wants to merge 18 commits into
huggingface:mainfrom
ted-mosi:add-mosstts-v1-5

Conversation

@ted-mosi
Copy link
Copy Markdown

@ted-mosi ted-mosi commented Jun 5, 2026

What does this PR do?

This PR adds native Transformers support for:

Architecture notes

MOSS-TTS-v1.5 uses a delay-pattern TTS generation format with text tokens plus multiple audio VQ channels. The processor handles user/assistant message
construction, reference-audio tokenization, delay/de-delay conversion, and waveform decoding through the native MOSS audio tokenizer.

MOSS-Audio-Tokenizer is added as a native audio-tokenizer model so MOSS-TTS no longer depends on Hub remote code for audio tokenization.

Before submitting

AI assistance disclosure

AI assistance was used while drafting and testing parts of this PR. I reviewed the changed files, checked the implementation against existing Transformers
model patterns, and ran the relevant local tests listed below. I am responsible for the final code and for responding to review feedback.

Who can review?

@eustlb @ebezzam @vasqu

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, moss_audio_tokenizer, moss_tts_delay

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant