# VOSK Speech Recognition
In this notebook we will use the [vosk](https://alphacephei.com/vosk/) speech recognition toolkit to perform a speech recognition analysis on a video or audio file.

## 1. Setup
First, let's import all of our dependencies.

In [3]:
!pip cache purge > /dev/null 2>&1
!pip install vosk > /dev/null 2>&1
!pip install librosa > /dev/null 2>&1
!pip uninstall -y gcu > /dev/null 2>&1
!pip install -q git+https://github.com/jdchart/gcu.git > /dev/null 2>&1
!apt install ffmpeg
import vosk
import gcu
import os
import wave
import json
import shutil
import librosa
import numpy as np
from scipy.io import wavfile
from scipy.signal import wiener

Next, let's get the vosk model we want to use. The model we use will depend of the language of the content we wish to analyse - [here is a list](https://alphacephei.com/vosk/models) of models that vosk have avaiable. Change the `MODEL_NAME` variable to the model you wish to download and use.

The `MODEL_PATH` variable allows you to choose where you would like to save the model. If the folder doesn't exist, it will get created. Even if you have already downloaded the model run this cell anyway as it will let the rest of the notebook know where to find the model.

In [None]:
# Change these variables if needed:
MODEL_NAME = "vosk-model-small-en-us-0.15"
MODEL_PATH = os.path.join(os.path.abspath('../..'), "models")

# Create folder if needed:
if os.path.isdir(MODEL_PATH) == False:
    os.makedirs(MODEL_PATH)

# Download model if it doesn't already exist:
if os.path.isdir(os.path.join(MODEL_PATH, MODEL_NAME)) == False:
    jlu.files.download_zip(os.path.join("https://alphacephei.com/vosk/models", MODEL_NAME) + ".zip", MODEL_PATH)

# Load the model:
model = vosk.Model(os.path.join(MODEL_PATH, MODEL_NAME))

## 2. Get media
Now we need to tell our code where to find the media we wish to process. Change the `SOURCE_FOLDER` variable to give the path to a folder of audio or video files.

In [9]:
# Give the path to a folder of audio or video files:
SOURCE_FOLDER = os.path.join(os.path.abspath('../..'), "sources")

We need to make sure that the files are the right shape and format for vosk's model. For this, we shall use `ffmpeg` to convert everything to wav, to mono, and to a sample rate of 16000. We shall save the resulting audio in a temporary folder that will get created at `os.path.join(os.getcwd(), "temp_src")`.

Note that only files with an extension that is in the `ACCEPTED_FORMATS` list will be treated. Feel free to add different formats if needed.

Note that even if you have already converted the files, run this cell again as it will tell the rest of the notebook where to find the sources (without converting them again).

In [None]:
ACCEPTED_FORMATS = ["wav", "mp4", "mp3"]

# Create a temporary source folder:
if os.path.isdir(os.path.join(os.getcwd(), "temp_src")) == False:
    os.makedirs(os.path.join(os.getcwd(), "temp_src"))

# Get a list of all the files to process:
file_list = jlu.files.collect_files(SOURCE_FOLDER, ACCEPTED_FORMATS)
source_list = []

# Convert each file:
for file in file_list:
    print(f"Converting \"{file}\"...")
    new_name = os.path.join(os.getcwd(), "temp_src", os.path.splitext(os.path.basename(file))[0] + '.wav')
    if os.path.isfile(new_name) == False:
        !ffmpeg -i "{file}" -ar {16000} -ac 1 "{new_name}" > /dev/null 2>&1
    source_list.append(new_name)

## 3. Audio pre-processing
Next we can manipulate the audio a bit so that it is easier for the vosk model to handle. 

In [None]:
for file in source_list:
    print(f"Processing {file}...")
    
    # Load the audio file with librosa:
    audio_data, sample_rate = librosa.load(file, sr = None)
    
    # Perform noise reduction and normalization:
    noise_reduction = wiener(audio_data)
    normalized = librosa.util.normalize(noise_reduction)
    
    # Scale to 16 bit depth for vosk:
    scaled = np.int16(normalized * 32767)
    
    # Output file:
    wavfile.write(file, sample_rate, scaled)

## 4. Speech recognition
Finally, we can run the speech recognition model on each audio file. First, let's just create a place to output our results by updating the `PROCESS_OUTPUT` variable.



In [22]:
# Give a path to a folder to output the results:
PROCESS_OUTPUT = os.path.join(os.path.abspath('../..'), "output")

# Create folder if needed:
if os.path.isdir(PROCESS_OUTPUT) == False:
    os.makedirs(PROCESS_OUTPUT)

Next, we run the recognition for each audio file, and save the results as a json file.

In [None]:
for file in source_list:
   print(f"Processing {file}...")

   # Create the vosk recognizer:
   recognizer = vosk.KaldiRecognizer(model, 16000)
   recognizer.SetWords(True)

   # Open the audio file:
   with wave.open(os.path.join("media_for_analysis", file), 'rb') as wf:
      audio_data = wf.readframes(wf.getnframes())
   
   # Run the model:
   result = recognizer.AcceptWaveform(audio_data)

   # Export the results as a json file
   if result:
      res = json.loads(recognizer.Result())
      result_file_name = os.path.join(PROCESS_OUTPUT, os.path.splitext(os.path.basename(file))[0] + "_SPEECH_RECOGNITION.json")
      with open(result_file_name, 'w') as json_file:
         json.dump(res, json_file, indent = 2)
   else:
      print("Speech recognition failed...")

## 5. Cleanup
Finally, we can remove the temporary source folder we created now that we are finished.

In [27]:
shutil.rmtree(os.path.join(os.getcwd(), "temp_src"))