need timestamps for the decoded text #985

npovey · 2024-06-08T07:09:02Z

npovey
Jun 8, 2024

Hi,
Is there a method to produce timestamps in sherpa-onnx for offline model ?

I am currently running offline version sherpa-onnx and my model is producing decoded text. I am sending a wav file and getting the transcript for the audio.
I also want to output the timestamps per word or per sentence.
Here is the whisper output when I tested it:

[00:00.000 --> 00:12.280] Okay. All right. Well, good evening, everybody. Welcome. Elon, thanks for being here.
[00:12.280 --> 00:13.920] Thank you for having me.

I want to be able to do something similar.
Thanks

csukuangfj · 2024-06-08T07:31:24Z

csukuangfj
Jun 8, 2024
Maintainer

could you describe which model you are using?

1 reply

npovey Jun 8, 2024
Author

I am using https://huggingface.co/csukuangfj/sherpa-onnx-zipformer-en-2023-06-26

csukuangfj · 2024-06-08T07:51:21Z

csukuangfj
Jun 8, 2024
Maintainer

Could you try https://huggingface.co/spaces/k2-fsa/generate-subtitles-for-videos

0 replies

npovey · 2024-06-15T06:24:10Z

npovey
Jun 15, 2024
Author

It looks like the model above has timestamp info with it [csukuangfj/sherpa-onnx-telespeech-ctc-int8-zh-2024-06-04 ], but the file here: https://github.com/k2-fsa/sherpa-onnx/blob/1a43d1e37f2a65a7326e75be4607b4996f9737a8/sherpa-onnx/python/sherpa_onnx/offline_recognizer.py
doesn't have the way to output timestamps even if model has that info.

It would be a great help if we could output timestamps.

Q1: Is there a recipe in icefall project that I can use to produce a model that can output timestamps info ?

For example the openAI endpoint lets you adjust granularity on the word level: https://platform.openai.com/docs/guides/speech-to-text/timestamps. My ultimate goal is to have an endpoint that would let user to change granularity of timestamps.

Q2: For the last few months I was training zipformer models. Is there a way to activate a flag or change the settings in the code so trained zipformer can return timestamps?
Thanks

0 replies

csukuangfj · 2024-06-15T06:28:43Z

csukuangfj
Jun 15, 2024
Maintainer

All models from icefall support timestamps.

Just follow

sherpa-onnx/python-api-examples/offline-decode-files.py

Line 474 in 1a43d1e

results = [s.result.text for s in streams]

to use

s.result.timestamps

and then you will get the timestamps for each token.

Hint: (s.result.tokens can output a list of tokens)

Please use

print(help(s.result))

to see what available fields are in s.result.

0 replies

csukuangfj · 2024-06-15T06:32:22Z

csukuangfj
Jun 15, 2024
Maintainer

My ultimate goal is to have an endpoint that would let user to change granularity of timestamps.

Do you want timestamps at the token level, word level, or sentence level?

4 replies

npovey Jun 15, 2024
Author

If you look at the example, it shows that the timestamp is on some audio chunk, it is not exactly per sentence. The OpenAI people are calling it on segment level. So OpenAI is doing it on segment level and word level or both. I want timestamps on segment and word level. Below is segment level output from whisper model.

[00:00.000 --> 00:04.480] Hello, we're here with Daniel Povey.
[00:04.480 --> 00:11.520] And today's question is about your PhD interview at Cambridge.
[00:11.520 --> 00:16.080] Did you have an interview to get into PhD program at Cambridge?
[00:19.840 --> 00:24.760] So I had an interview when I went
[00:24.760 --> 00:27.640] to get in at the undergraduate level.
[00:27.680 --> 00:29.800] An interview at that time, at least,
[00:29.800 --> 00:32.040] was quite important for undergraduate admissions
[00:32.040 --> 00:32.680] to Cambridge.
[00:32.680 --> 00:35.760] I don't know whether that's still the case.
[00:35.760 --> 00:38.320] It's possible that they're trying
[00:38.320 --> 00:40.240] to get rid of that because they feel it somehow
[00:40.240 --> 00:41.800] disadvantages certain groups.
[00:41.800 --> 00:43.840] But at that time, for undergraduate,
[00:43.840 --> 00:45.000] the interview was important.

csukuangfj Jun 15, 2024
Maintainer

Have you tried https://huggingface.co/spaces/k2-fsa/generate-subtitles-for-videos

We can use VAD to get the timestamp for the segment level and use a non-streaming ASR model to get the transcribed text.

csukuangfj Jun 15, 2024
Maintainer

https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/generate-subtitles.py

the above file is able to generate similar output as you posted.

csukuangfj Jun 15, 2024
Maintainer

Please see

sherpa-onnx/python-api-examples/generate-subtitles.py

Lines 35 to 48 in 99a9da1

    
           (2) For transducer models from icefall 
        
               ./python-api-examples/generate-subtitles.py  \ 
        
                 --silero-vad-model=/path/to/silero_vad.onnx \ 
        
                 --tokens=/path/to/tokens.txt \ 
        
                 --encoder=/path/to/encoder.onnx \ 
        
                 --decoder=/path/to/decoder.onnx \ 
        
                 --joiner=/path/to/joiner.onnx \ 
        
                 --num-threads=2 \ 
        
                 --decoding-method=greedy_search \ 
        
                 --debug=false \ 
        
                 --sample-rate=16000 \ 
        
                 --feature-dim=80 \ 
        
                 /path/to/test.mp4

Please use a non-streaming transducer from icefall for the above file.

qindazhu · 2024-06-15T07:23:20Z

qindazhu
Jun 15, 2024

Do you want timestamps at the token level, word level, or sentence level?

Yeah, I think this is what Nadira wants (as she has an idea to highlight word by word, sentence/chunk by sentence). I noticed that with the example of subtitles above, it's super easy for Xin to get the chunk timestamps. I wonder if our model output word level timestamps directly so they can just use that.

3 replies

csukuangfj Jun 15, 2024
Maintainer

Currently, we can only get the start time of a word.
Not possible to get the stop time of a word at present.

qindazhu Jun 15, 2024

OK, can you show the code piece to get the start time (sorry I didn't go into the codebase)? I think for Nadira's scenario, she can just use the start time to highlight for now (though not accurate, but acceptable for end users)

csukuangfj Jun 15, 2024
Maintainer

The returned result has several fields. Two of them are

tokens: List[str]
timestamps: List[float]

len(tokens) == len(timestamps)

timestamps[i] is the start time of tokens[i].

A word consists of several tokens. If tokens[k][0] is a space, then tokens[k] is the start of a word .

To get the start and end time of a segment, please follow our example
https://github.com/k2-fsa/sherpa-onnx/blob/master/python-api-examples/generate-subtitles.py

which first uses a VAD to get a segment and then uses a non-streaming ASR model to decode the segment.
The time information about the segment is from the VAD, not from the ASR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

need timestamps for the decoded text #985

{{title}}

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

need timestamps for the decoded text #985

npovey Jun 8, 2024

Replies: 6 comments · 8 replies

csukuangfj Jun 8, 2024 Maintainer

npovey Jun 8, 2024 Author

csukuangfj Jun 8, 2024 Maintainer

npovey Jun 15, 2024 Author

csukuangfj Jun 15, 2024 Maintainer

csukuangfj Jun 15, 2024 Maintainer

npovey Jun 15, 2024 Author

csukuangfj Jun 15, 2024 Maintainer

csukuangfj Jun 15, 2024 Maintainer

csukuangfj Jun 15, 2024 Maintainer

qindazhu Jun 15, 2024

csukuangfj Jun 15, 2024 Maintainer

qindazhu Jun 15, 2024

csukuangfj Jun 15, 2024 Maintainer

npovey
Jun 8, 2024

Replies: 6 comments 8 replies

csukuangfj
Jun 8, 2024
Maintainer

npovey Jun 8, 2024
Author

csukuangfj
Jun 8, 2024
Maintainer

npovey
Jun 15, 2024
Author

csukuangfj
Jun 15, 2024
Maintainer

csukuangfj
Jun 15, 2024
Maintainer

npovey Jun 15, 2024
Author

csukuangfj Jun 15, 2024
Maintainer

csukuangfj Jun 15, 2024
Maintainer

csukuangfj Jun 15, 2024
Maintainer

qindazhu
Jun 15, 2024

csukuangfj Jun 15, 2024
Maintainer

csukuangfj Jun 15, 2024
Maintainer