Skip to content
Discussion options

You must be logged in to vote

The main reason for L122 was that during training we used librosa.power_to_db which does this with the default argument top_db=80.0, and we wanted to replicate this without having to depend on librosa.

L123 was to put the numbers roughly roughly into [-1, 1], but as you noticed it doesn't strictly put those within [-1, 1]. In practice, this shouldn't matter much as long as we do the same between training and inference.

whisper/whisper/audio.py

Lines 122 to 123 in 4e635c6

log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0

As for why that became the default argument for librosa.power_to_db, maybe @bmcfee (hi!!) can answer?

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@bmcfee
Comment options

Answer selected by jongwook
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants