-
|
I am using uncoverV2.mp4whisper to get segments but I see the end time of the final segment is the wrong duration {'text': " Nobody sees, nobody knows We are a secret, can't be exposed That's how it is, that's how it goes", 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 4.5600000000000005, 'text': ' Nobody sees, nobody knows', 'tokens': [50364, 9297, 8194, 11, 5079, 3255, 50592, 50592, 492, 366, 257, 4054, 11, 393, 380, 312, 9495, 50864, 50864, 663, 311, 577, 309, 307, 11, 300, 311, 577, 309, 1709, 51140], 'temperature': 0.0, 'avg_logprob': -0.39771583676338196, 'compression_ratio': 1.1428571428571428, 'no_speech_prob': 0.1816209852695465}, {'id': 1, 'seek': 0, 'start': 4.5600000000000005, 'end': 10.0, 'text': " We are a secret, can't be exposed", 'tokens': [50364, 9297, 8194, 11, 5079, 3255, 50592, 50592, 492, 366, 257, 4054, 11, 393, 380, 312, 9495, 50864, 50864, 663, 311, 577, 309, 307, 11, 300, 311, 577, 309, 1709, 51140], 'temperature': 0.0, 'avg_logprob': -0.39771583676338196, 'compression_ratio': 1.1428571428571428, 'no_speech_prob': 0.1816209852695465}, {'id': 2, 'seek': 1000, 'start': 10.0, 'end': 31.0, 'text': " That's how it is, that's how it goes", 'tokens': [50364, 663, 311, 577, 309, 307, 11, 300, 311, 577, 309, 1709, 51414], 'temperature': 0.0, 'avg_logprob': -0.247916613306318, 'compression_ratio': 1.0909090909090908, 'no_speech_prob': 0.004447056911885738}], 'language': 'en'} The audio durations are 15s |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
|
Empirically, I found that this tends to go away with the large model. You can also just add some lines to clamp the max timestamp to the duration of the audio. Another way is to suppress any timestamp tokens that is greater than the audio duration at the decoding stage. |
Beta Was this translation helpful? Give feedback.
-
|
I am having the same problem |
Beta Was this translation helpful? Give feedback.
Empirically, I found that this tends to go away with the large model. You can also just add some lines to clamp the max timestamp to the duration of the audio. Another way is to suppress any timestamp tokens that is greater than the audio duration at the decoding stage.