
# Vosk Colab Demo

Vosk is an open source offline speech recognition toolkit. Vosk
contains more than 20 languages and dialects, such as English, German, Russian, Chinese, Czech, etc. The sizes of language models vary from tens of megabytes to several gigabytes. Big models are more accurate. For more information see https://alphacephei.com/vosk/.



This notebook demonstrates Vosk recognition capabilities.

# Install module and prepare the file

First, you have to install vosk module using the following code:

In [1]:
!pip3 install vosk



## Importing the necessary modules

Secondly, we import here the necessary modules required for all the examples below:

In [2]:
from vosk import Model, KaldiRecognizer
import wave
import json

## Download example audio file

You can upload your audio file and listen it by replacing the URL of our example with your own using the code below.

In [3]:
import IPython
IPython.display.Audio("/content/tmp.wav")

In [4]:
model = Model(lang="en-us")

Open downloaded file in 'read bytes' mode as wave object:

In [5]:
wf = wave.open('/content/tmp.wav', 'rb')

The KaldiRecognizer class contains the configuration methods needed here, such as SetWords, SetPartialWords, AcceptWaveform, and others.

The model object is the first parameter for KaldiRecognizer. The second parameter passed to KaldiRecognizer is the sample rate, which can be passed directly as a number like 8000 or 16000 Hz, which will be demonstrated below or using getframerate method shown in the following code fragment.

Creating a KaldiRecognizer object with model and sample rate arguments:

In [6]:
rec = KaldiRecognizer(model, wf.getframerate())

The previous commands are the same for the most of examples, but the following are different.

Activating timestamps for recognized words (partial result and result attributes in recognized result) using methods `SetWords` and `SetPartialWords`:

In [7]:
rec.SetWords(True)
#rec.SetPartialWords(True)

In [8]:
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(rec.Result())
    else:
        print(rec.PartialResult())

print(rec.FinalResult())

{
  "partial" : ""
}
{
  "partial" : "so it's"
}
{
  "partial" : "so it is"
}
{
  "partial" : "so it is so"
}
{
  "partial" : "so it is so nice"
}
{
  "partial" : "so it is so nice that"
}
{
  "partial" : "so it is so nice that you"
}
{
  "partial" : "so it is so nice that you came here"
}
{
  "partial" : "so it is so nice that you came here and"
}
{
  "partial" : "so it is so nice that you came here and i"
}
{
  "partial" : "so it is so nice that you came here and i just"
}
{
  "partial" : "so it is so nice that you came here and i just want to"
}
{
  "partial" : "so it is so nice that you came here and i just want to"
}
{
  "partial" : "so it is so nice that you came here and i just want to"
}
{
  "partial" : "so it is so nice that you came here and i just want to"
}
{
  "partial" : "so it is so nice that you came here and i just want to i just"
}
{
  "partial" : "so it is so nice that you came here and i just want to i just have to"
}
{
  "partial" : "so it is so nice that you came 

In [9]:
text = "so it is so nice that you came here and i just want to i just have to brag a bit"

In [40]:
wf = wave.open('/content/tmp.wav', "rb")
rec = KaldiRecognizer(model, wf.getframerate(), '["so it is so nice that you came here and i just want to i just have to brag a bit"]')

In [41]:
rec.SetWords(True)
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        print(json.loads(rec.Result()))
    else:
        print(json.loads(rec.PartialResult()))

final = rec.FinalResult()
print(json.loads(final))

{'partial': ''}
{'partial': 'so it'}
{'partial': 'so it is'}
{'partial': 'so it is so nice'}
{'partial': 'so it is so nice'}
{'partial': 'so it is so nice that you'}
{'partial': 'so it is so nice that you came'}
{'partial': 'so it is so nice that you came here'}
{'partial': 'so it is so nice that you came here and'}
{'partial': 'so it is so nice that you came here and i just'}
{'partial': 'so it is so nice that you came here and i just'}
{'partial': 'so it is so nice that you came here and i just want to'}
{'partial': 'so it is so nice that you came here and i just want to'}
{'partial': 'so it is so nice that you came here and i just want to'}
{'partial': 'so it is so nice that you came here and i just want to'}
{'partial': 'so it is so nice that you came here and i just want to i just have'}
{'partial': 'so it is so nice that you came here and i just want to i just have to brag'}
{'partial': 'so it is so nice that you came here and i just want to i just have to brag a'}
{'partial': 's

In [42]:
result = json.loads(final)
result["result"]

[{'conf': 1.0, 'end': 0.18, 'start': 0.06, 'word': 'so'},
 {'conf': 1.0, 'end': 0.24, 'start': 0.18, 'word': 'it'},
 {'conf': 1.0, 'end': 0.48, 'start': 0.3, 'word': 'is'},
 {'conf': 1.0, 'end': 0.69, 'start': 0.48, 'word': 'so'},
 {'conf': 1.0, 'end': 1.02, 'start': 0.69, 'word': 'nice'},
 {'conf': 1.0, 'end': 1.2, 'start': 1.05, 'word': 'that'},
 {'conf': 1.0, 'end': 1.29, 'start': 1.2, 'word': 'you'},
 {'conf': 1.0, 'end': 1.59, 'start': 1.29, 'word': 'came'},
 {'conf': 1.0, 'end': 1.71, 'start': 1.59, 'word': 'here'},
 {'conf': 1.0, 'end': 1.95, 'start': 1.71, 'word': 'and'},
 {'conf': 1.0, 'end': 2.13, 'start': 2.01, 'word': 'i'},
 {'conf': 1.0, 'end': 2.43, 'start': 2.13, 'word': 'just'},
 {'conf': 1.0, 'end': 2.55, 'start': 2.43, 'word': 'want'},
 {'conf': 1.0, 'end': 2.94, 'start': 2.55, 'word': 'to'},
 {'conf': 1.0, 'end': 3.51, 'start': 3.33, 'word': 'i'},
 {'conf': 1.0, 'end': 3.75, 'start': 3.51, 'word': 'just'},
 {'conf': 1.0, 'end': 3.9, 'start': 3.75, 'word': 'have'},
 {

In [43]:
def format_ctm(data, filename, offset):
    for res in result["result"]:
        start = res["start"] + offset
        end = res["end"] + offset
        word = res["word"]
        conf = res["conf"]
        print(f"{filename} 1 {start:.02f} {end:.02f} {conf} {word} 1.0 {word} cor")

In [44]:
format_ctm(result, "hsi_5_0718_209_001_main", 27.00)

hsi_5_0718_209_001_main 1 27.06 27.18 1.0 so 1.0 so cor
hsi_5_0718_209_001_main 1 27.18 27.24 1.0 it 1.0 it cor
hsi_5_0718_209_001_main 1 27.30 27.48 1.0 is 1.0 is cor
hsi_5_0718_209_001_main 1 27.48 27.69 1.0 so 1.0 so cor
hsi_5_0718_209_001_main 1 27.69 28.02 1.0 nice 1.0 nice cor
hsi_5_0718_209_001_main 1 28.05 28.20 1.0 that 1.0 that cor
hsi_5_0718_209_001_main 1 28.20 28.29 1.0 you 1.0 you cor
hsi_5_0718_209_001_main 1 28.29 28.59 1.0 came 1.0 came cor
hsi_5_0718_209_001_main 1 28.59 28.71 1.0 here 1.0 here cor
hsi_5_0718_209_001_main 1 28.71 28.95 1.0 and 1.0 and cor
hsi_5_0718_209_001_main 1 29.01 29.13 1.0 i 1.0 i cor
hsi_5_0718_209_001_main 1 29.13 29.43 1.0 just 1.0 just cor
hsi_5_0718_209_001_main 1 29.43 29.55 1.0 want 1.0 want cor
hsi_5_0718_209_001_main 1 29.55 29.94 1.0 to 1.0 to cor
hsi_5_0718_209_001_main 1 30.33 30.51 1.0 i 1.0 i cor
hsi_5_0718_209_001_main 1 30.51 30.75 1.0 just 1.0 just cor
hsi_5_0718_209_001_main 1 30.75 30.90 1.0 have 1.0 have cor
hsi_5_0718_209_0

In [46]:
INPUT = """
hsi_5_0718_209_001_main 1 32.47 31.68 ah 1.0 Ah, sub
hsi_5_0718_209_001_main 1 32.46 32.58 you 1.0 you cor
hsi_5_0718_209_001_main 1 32.58 32.76 see 1.0 see cor
hsi_5_0718_209_001_main 1 32.76 33.0 this 1.0 this cor
hsi_5_0718_209_001_main 1 33.0 33.42 couch 1.0 couch cor
hsi_5_0718_209_001_main 1 33.48 33.611924 and 0.582101 and cor
hsi_5_0718_209_001_main 1 33.611924 33.81 and 1.0 and cor
hsi_5_0718_209_001_main 1 33.81 34.23 the_what 1.0 the_what sub
hsi_5_0718_209_001_main 1 34.92 35.37 do_you_call 1.0 do_you_call sub
hsi_5_0718_209_001_main 1 35.37 35.58 this 1.0 this cor
hsi_5_0718_209_001_main 1 35.58 35.76 one 1.0 one cor
hsi_5_0718_209_001_main 1 35.76 35.85 in 1.0 in cor
hsi_5_0718_209_001_main 1 35.85 36.27 swedish 1.0 Swedish cor
hsi_5_0718_209_001_main 1 36.45 36.6 or 1.0 or cor
hsi_5_0718_209_001_main 1 36.6 36.780527 in 1.0 in cor
hsi_5_0718_209_001_main 1 36.780527 37.23 english 0.975392 English sub
hsi_5_0718_209_001_main 1 37.23 37.8 fåtölj 1.0 fåtölj sub
hsi_5_0718_209_001_main 1 38.94 39.33 chair 1.0 chair cor
hsi_5_0718_209_001_main 1 39.33 39.72 yes 1.0 yes cor
hsi_5_0718_209_001_main 1 39.72 39.96 these 1.0 these- ins
hsi_5_0718_209_001_main 1 39.99 40.56 armchair 0.419268 armchair, sub
hsi_5_0718_209_001_main 1 40.56 40.92 yes 1.0 yes, cor
hsi_5_0718_209_001_main 1 40.95 41.16 this 1.0 this cor
hsi_5_0718_209_001_main 1 41.16 41.73 armchair 1.0 armchair cor
hsi_5_0718_209_001_main 1 42.48 43.05 ah 1.0 ah sub
hsi_5_0718_209_001_main 1 43.32 43.83 is 1.0 is cor
hsi_5_0718_209_001_main 1 43.89 44.49 really 1.0 really cor
hsi_5_0718_209_001_main 1 44.49 45.18 really 1.0 really cor
hsi_5_0718_209_001_main 1 45.24 45.84 amazing 1.0 amazing cor
hsi_5_0718_209_001_main 1 45.84 46.2 because 1.0 because cor
hsi_5_0718_209_001_main 1 46.2 46.35 it's 1.0 it's cor
hsi_5_0718_209_001_main 1 46.35 46.41 a 1.0 a cor
hsi_5_0718_209_001_main 1 46.41 46.83 danish 1.0 Danish cor
hsi_5_0718_209_001_main 1 46.83 47.49 designer 1.0 designer cor
hsi_5_0718_209_001_main 1 48.0 48.81 who's 0.919197 who's cor
hsi_5_0718_209_001_main 1 49.14 49.44 really 0.997848 really cor
hsi_5_0718_209_001_main 1 49.44 49.98 famous 1.0 famous. cor
"""

In [49]:
def get_ffmpeg_cmd(ctmeditlines):
    startline = ctmeditlines[0]
    endline = ctmeditlines[-1]
    start = startline.split()[2]
    end = endline.split()[3]
    filename = startline.split()[0]
    return f"ffmpeg -i {filename}.wav -ss {start} -t {float(end) - float(start)} -acodec pcm_s16le -ac 1 -ar 16000 /tmp/tmp.wav"

In [None]:
ctmlines = [x for x in INPUT.split("\n") if x != ""]

In [50]:
get_ffmpeg_cmd([x for x in INPUT.split("\n") if x != ""])

'ffmpeg -i hsi_5_0718_209_001_main.wav -ss 32.47 -t 17.509999999999998 -acodec pcm_s16le -ac 1 -ar 16000 /tmp/tmp.wav'