Timestamps precision in milliseconds? #303

mirix · 2023-06-15T13:08:21Z

Hello,

I am using the sample code provided:

from faster_whisper import WhisperModel

model_size = 'large-v2'
model = WhisperModel(model_size, device='cpu', compute_type='int8')

segments, info = model.transcribe('Michael, Jim, Dwight epic scene [qHrN5Mf5sgo].mp3', beam_size=5)

print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
	print('[%.2fs -> %.2fs] %s' % (segment.start, segment.end, segment.text))

And the timestamp precision seems to be one second:

Detected language 'en' with probability 0.988083
[0.00s -> 7.00s]  Here's what's going to happen. I am going to have to fix you, manage you to, on a more
[7.00s -> 13.00s]  personal scale, a more micro form of management. Jim, what is that called?
[13.00s -> 14.00s]  Micro Jimin.
[14.00s -> 19.00s]  Boom. Yes. Now Jim is going to be the client. Dwight, you're going to have to sell to him
[19.00s -> 24.00s]  without being aggressive, hostile, or difficult. Let's go.
[24.00s -> 28.00s]  All right, fine. Ring, ring.
[28.00s -> 29.00s]  Hello?

Would it be possible to report milliseconds?

Another, unrelated, question, if I wished to perform an analysis per segment (say, gender, sentiment, emotion), how should I use the segment object?

Furthermore, I have tried numerous approaches for speaker diarization but all (I could not try Nemo-based ones because I do not have an adequate GPU) and all yields very bad results in certain scenarios when it comes to speaker attribution. I am considering a brute-force approach, any recommendations for a library I could use to compare a segment with the previous one in order to determine whether or not it is the same speaker.

Best,

Ed

The text was updated successfully, but these errors were encountered:

hoonlight · 2023-06-15T13:15:06Z

Setting word_timestamps=True causes the timestamps of segments to be displayed in milliseconds.
Even if you don't use word timestamps, setting this option will make the timestamps of all segments more precise in milliseconds.

For diarization you can try the method implemented in the repo below.

https://github.com/JaesungHuh/SimpleDiarization

guillaumekln · 2023-06-15T13:19:26Z

Even without word timestamps, the Whisper model could predict timestamps to a 10 milliseconds precision but one of the Whisper author said that "the predicted timestamps tend to be biased towards integers" (source).

mirix · 2023-06-15T13:51:53Z

Setting word_timestamps=True causes the timestamps of segments to be displayed in milliseconds. Even if you don't use word timestamps, setting this option will make the timestamps of all segments more precise in milliseconds.

For diarization you can try the method implemented in the repo below.

https://github.com/JaesungHuh/SimpleDiarization

Thanks for the tips. Indeed adding the word_timestamps keyword produces a precision of 10 milliseconds.

I tried the library you suggested, but it seems it does not work for more than two speakers. Or perhaps I am wrong. We will see:

JaesungHuh/SimpleDiarization#1

I have tried many diarization strategies, but, so, far everything based upon pyannote fails:

pyannote/pyannote-audio#1406

guillaumekln closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timestamps precision in milliseconds? #303

Timestamps precision in milliseconds? #303

mirix commented Jun 15, 2023

hoonlight commented Jun 15, 2023 •

edited

guillaumekln commented Jun 15, 2023

mirix commented Jun 15, 2023

Timestamps precision in milliseconds? #303

Timestamps precision in milliseconds? #303

Comments

mirix commented Jun 15, 2023

hoonlight commented Jun 15, 2023 • edited

guillaumekln commented Jun 15, 2023

mirix commented Jun 15, 2023

hoonlight commented Jun 15, 2023 •

edited