Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: xcodec2 |
|
This comment contains models: ["models/xcodec2"] |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, dac, higgs_audio_v2_tokenizer, pe_audio, qwen2_5_omni, seamless_m4t, wav2vec2_bert, xcodec, xcodec2 |
ebezzam
left a comment
There was a problem hiding this comment.
@eustlb a self-review review for X-Codec2!
Main things:
- Unique feature extraction for DAC-like and SeamlessM4T-like input processing, as the model needs both padded audio and spectrogram inputs.
- New type of components in modular:
Xcodec2FiniteScalarQuantizationandXcodec2ISTFTHead(similar to what we saw in the Vocos PR) - Small tweaks/fixes for models that Xcodec2 depended on for modular
Draft model page: https://huggingface.co/bezzam/xcodec2
| main_input_name = "input_features" | ||
| input_modalities = "audio" | ||
| supports_gradient_checkpointing = True | ||
| _no_split_modules = ["Wav2Vec2BertEncoderLayer"] |
There was a problem hiding this comment.
To allow loading with device_map="auto"
| @torch.no_grad() | ||
| def _init_weights(self, module): | ||
| """Initialize the weights""" | ||
| super()._init_weights(module) |
There was a problem hiding this comment.
XCodec2 uses a pretrained checkpoint of Wav2Vec2-BERT, but Xcodec2's test test_can_init_all_missing_weights was failing because Embedding wasn't initialized. We can rely on the base _init_weights and also remove some initialization from below
| self.norm1 = nn.GroupNorm(num_groups=32, num_channels=config.hidden_size, eps=1e-6, affine=True) | ||
| self.activation1 = nn.SiLU() | ||
| self.conv1 = nn.Conv1d(config.hidden_size, config.hidden_size, kernel_size=3, stride=1, padding=1) |
There was a problem hiding this comment.
Similar to PeAudioVideoConvBlock1d but slight differences that don't make modular direct here?
| class SnakeBeta(SnakeBeta): | ||
| pass | ||
|
|
||
|
|
||
| class AntiAliasedActivation1d(AntiAliasedActivation1d): | ||
| pass |
There was a problem hiding this comment.
I thought just importing above would have been enough, but it wasn't generating the classes without this 🤔
| # Back to audio (ISTFT with "same" padding) | ||
| time_frames = torch.fft.irfft(spectrogram_complex, self.n_fft, dim=1, norm="backward") | ||
| time_frames = time_frames * self.window[None, :, None] | ||
| num_frames = spectrogram_complex.shape[-1] | ||
| output_size = (num_frames - 1) * self.hop_length + self.win_length | ||
| audio = F.fold( | ||
| time_frames, | ||
| output_size=(1, output_size), | ||
| kernel_size=(1, self.win_length), | ||
| stride=(1, self.hop_length), | ||
| )[:, 0, 0, self.padding : -self.padding] |
There was a problem hiding this comment.
torch.istft doesn't support the custom padding needed here for integrations tests to match expected output
| hidden_states = self.finite_scalar_quantization.bound( | ||
| hidden_states | ||
| ) # For consistency with original checkpoint | ||
| quantized_out, indices = self.finite_scalar_quantization(hidden_states) |
There was a problem hiding this comment.
calling self.finite_scalar_quantization.bound is a bit redundant, as it's called within self.finite_scalar_quantization(hidden_states). But the original modeling did it and it is needed to match expected outputs.
| return hidden_states + residual | ||
|
|
||
|
|
||
| class Xcodec2FiniteScalarQuantization(nn.Module): |
| return codes, indices | ||
|
|
||
|
|
||
| class Xcodec2ISTFTHead(nn.Module): |
There was a problem hiding this comment.
Similar to what we saw in the Vocos PR
What does this PR do?
Re-opening #37868
TODO
Original checkpoint: https://huggingface.co/HKUSTAudio/xcodec2
Original modeling code: https://huggingface.co/HKUSTAudio/xcodec2/blob/main/modeling_xcodec2.py