fix bug for float duration src_audio#1421
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the AceStepPipeline to improve the handling of float durations by transitioning from floor division to rounding for frame calculations. It also introduces a logic change in the cover task to concatenate source latents when the source audio is shorter than the generated latents. Additionally, the default shift parameter in the documentation has been updated from 1.0 to 3.0. Feedback was provided to ensure consistency by using round() instead of int() across all frame conversions to prevent off-by-one errors, and to add a null check for source_latents to avoid a potential AttributeError.
| if src_latents.shape[1] > source_latents.shape[1]: | ||
| source_latents = torch.cat([source_latents, src_latents[:, source_latents.shape[1]:]], dim=1) |
There was a problem hiding this comment.
If the cover task is initiated using audio_code_string instead of src_audio, source_latents will be None. Accessing .shape[1] on a None object will raise an AttributeError. A check should be added to ensure source_latents is not None before performing the shape comparison and concatenation.
| if src_latents.shape[1] > source_latents.shape[1]: | |
| source_latents = torch.cat([source_latents, src_latents[:, source_latents.shape[1]:]], dim=1) | |
| if source_latents is not None and src_latents.shape[1] > source_latents.shape[1]: | |
| source_latents = torch.cat([source_latents, src_latents[:, source_latents.shape[1]:]], dim=1) |
| padding_frames_left = int(pad_left * pipe.vae.sampling_rate) | ||
| padding_frames_right = max(int(max_right * pipe.vae.sampling_rate) - src_audio.shape[-1], 0) |
There was a problem hiding this comment.
Using int() for converting seconds to audio frames can lead to off-by-one errors due to floating-point precision (e.g., 0.1 * 48000 might result in 4799.999999999999 which int() floors to 4799). Since this PR aims to fix float duration issues, using round() is more appropriate and consistent with the changes made elsewhere in this file.
| padding_frames_left = int(pad_left * pipe.vae.sampling_rate) | |
| padding_frames_right = max(int(max_right * pipe.vae.sampling_rate) - src_audio.shape[-1], 0) | |
| padding_frames_left = round(pad_left * pipe.vae.sampling_rate) | |
| padding_frames_right = max(round(max_right * pipe.vae.sampling_rate) - src_audio.shape[-1], 0) |
| start_frame = int(start * pipe.vae.sampling_rate / 1920) | ||
| end_frame = int(end * pipe.vae.sampling_rate / 1920) |
There was a problem hiding this comment.
For consistency with the calculation of max_latent_length (line 513), these frame indices should be calculated using round() instead of int(). This ensures that float durations are mapped to the same latent frames across different parts of the pipeline.
| start_frame = int(start * pipe.vae.sampling_rate / 1920) | |
| end_frame = int(end * pipe.vae.sampling_rate / 1920) | |
| start_frame = round(start * pipe.vae.sampling_rate / 1920) | |
| end_frame = round(end * pipe.vae.sampling_rate / 1920) |
| pad_left_frames = int(padding_frames_left / 1920) | ||
| pad_right_frames = int(padding_frames_right / 1920) |
There was a problem hiding this comment.
Using round() here is more consistent with the mapping of audio duration to latent frames used in the rest of the pipeline. This helps avoid potential artifacts at the boundaries when dealing with float durations.
| pad_left_frames = int(padding_frames_left / 1920) | |
| pad_right_frames = int(padding_frames_right / 1920) | |
| pad_left_frames = round(padding_frames_left / 1920) | |
| pad_right_frames = round(padding_frames_right / 1920) |
No description provided.