# Audio Processing

In this notebook several STT models are tested on the test dataset, generated in the Audio Generation notebook (`audio_generation.ipynb`).

For transcription let's have to steps:
1. Define the language
2. Use the corresponding Vosk model for transcription

Metric: `WER` metric is used. This metric has values from 0 (for the same text) to infinity. For (en, pl, ru) languages corresponding Vosk models have `WER` in [0.18, 0.20]. These values could be used as baselines for the future improvements. 

## 1. Configure

* `faster-whisper` model is used for the language detection
* `vosk` model is used for the language trascription
* `jiwer` library contains WER metric, which is used for the language trascription 

In [None]:
!pip install faster-whisper -q

In [None]:
!pip install vosk -q
!pip install soundfile -q

In [None]:
!pip install jiwer

### 1.1 Consts

In [None]:
TESTSET_PATH = "../tests/audio_data/audio_testing_data.json"
AUDIO_ROOT = "../tests/audio_data/audio/"

# how many seconds of audio is used for the lang detection
HEAD_DURATION = 5.0 

## 2. Language Detection

For Language Detection `whisper` model is used 


### 2.1 Model init

In [None]:
from faster_whisper import WhisperModel

model = WhisperModel("tiny", device="cpu", compute_type="int8")

In [None]:
# One test file
test_file = wav_path = f"{AUDIO_ROOT}en_003.wav"
segments, info = model.transcribe(
    test_file,
    language=None,
    clip_timestamps=[0.0, 1.0],
    vad_filter=True,
)
print(info.language, info.language_probability)

### 2.2 Run on test data

In [None]:
import json

# Load the test data
with open(TESTSET_PATH, "r", encoding="utf-8") as f:
    test_set = json.load(f)["data"]

In [None]:
import tqdm

# Main loop
correct_count = 0
for entry in tqdm.tqdm(test_set):
    expected_lang = entry["language"]
    wav_path = f"{AUDIO_ROOT}{entry["id"]}.wav"
    segments, info = model.transcribe(
        wav_path,
        language=None,
        clip_timestamps=[0.0, HEAD_DURATION],  # first HEAD_DURATION seconds
        vad_filter=True,
    )
    correct_count += (info.language == expected_lang)

In [None]:

total_count = len(test_set)
lang_detection_accuracy = correct_count / total_count
lang_detection_accuracy

All the test data language detected correctly, so we can use `whisper` model further.

## 3. Transcription

### 3.1 Metrics for the trascription

`WER` metric is used. This metric has values from 0 (for the same text) to infinity.

Vosk API allow to add custom words: `rec = KaldiRecognizer(model, samplerate, json.dumps(custom_words))`. But such run didn't return correct result (just an empty string is returned).

That's why we calculate WER of the normalized texts (without punctuation and in the lower case). 
But this problem should be fixed in the future

In [None]:
import jiwer  # WER metric
import numpy as np
import soundfile as sf
import re

# read the audio
def read_audio(file_path): 
    data, sample_rate = sf.read(file_path)
    data = np.int16(data * 32767)
    data = data.tobytes()
    return data


# transcript audio
def transcript_audio(file_path, recognizer):
    audio_data = read_audio(file_path)
    recognizer.AcceptWaveform(audio_data)
    result = recognizer.Result()
    res_text = json.loads(result).get("text", "")
    return res_text


# text normalization for WER
def normalize_for_wer(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^\w\s]", " ", text) 
    text = re.sub(r"\s+", " ", text).strip()
    return text


# calculate wer for several texts (not just a mean)
def calculate_wer(actual_texts, expected_texts):
    # normalization
    actual_texts = [normalize_for_wer(actual) for actual in actual_texts]
    expected_texts = [normalize_for_wer(expected) for expected in expected_texts]
    
    wer_value = jiwer.wer(actual_texts, expected_texts)

    return wer_value

def get_wer(test_df, language, rec, is_debug=False):
    test_df_lang = [item for item in test_df if item["language"] == language]
    actual_texts = [transcript_audio(f"{AUDIO_ROOT}{item["id"]}.wav", rec) for item in test_df_lang]
    expected_texts = [item["text"] for item in test_df_lang]
    if is_debug:
        print(actual_texts[0], expected_texts[0])
    wer = calculate_wer(actual_texts, expected_texts)
    return wer

In [None]:
# test data
DATASET_PATH = "../tests/audio_data/audio_testing_data.json"
AUDIO_ROOT = "../tests/audio_data/audio/"

# Load the test data
with open(DATASET_PATH, "r", encoding="utf-8") as f:
    test_df = json.load(f)["data"]


In [None]:
# test case 
test_id = "en_001"
test_data = next(item for item in test_df if item["id"] == test_id)
expected_text = test_data["text"]

print(expected_text)
actual_text = "you hello what seems to be the problem our dog is limping all right let me take a look have you given any medication just some iodine i see i recommend car profaned and a cooling gel keep the dog come thankyou goodbye"

wer = jiwer.wer(actual_text, expected_text)
wer

### 3.2 Vosk English

In [None]:
from vosk import Model, KaldiRecognizer, SetLogLevel

# Disable Vosk logs
SetLogLevel(-1)

MODEL_DIR = "../models/audio_processing/vosk-model-small-en-us-0.15"

sample_rate = 16000
language = "en"

# 1. Model loading
model = Model(MODEL_DIR)

# 2. Recognizer
rec = KaldiRecognizer(model, sample_rate)

# 3. Calculate WER on English data set
wer = get_wer(test_df, language, rec)
print(f"WER for {language} is {wer}")

WER value 0.19 could be a baseline for English language.

### 3.3 Vosk Polish

In [None]:
from vosk import Model, KaldiRecognizer, SetLogLevel

# Disable Vosk logs
SetLogLevel(-1)

MODEL_DIR = "../models/audio_processing/vosk-model-small-pl-0.22"

sample_rate = 16000
language = "pl"

# 1. Model loading
model = Model(MODEL_DIR)

# 2. Recognizer
rec = KaldiRecognizer(model, sample_rate)

# 3. Calculate WER on Polish data set
wer = get_wer(test_df, language, rec)
print(f"WER for {language} is {wer}")

### 3.4 Vosk Russian

In [None]:
from vosk import Model, KaldiRecognizer, SetLogLevel

# Disable Vosk logs
SetLogLevel(-1)

MODEL_DIR = "../models/audio_processing/vosk-model-small-ru-0.22"

sample_rate = 16000
language = "ru"

# 1. Model loading
model = Model(MODEL_DIR)

# 2. Recognizer
rec = KaldiRecognizer(model, sample_rate)

# 3. Calculate WER on Russian data set
wer = get_wer(test_df, language, rec, is_debug=False)
print(f"WER for {language} is {wer}")

## 4. Apply in real recording 

In [None]:

MODEL_DIR = "../models/audio_processing/vosk-model-small-en-us-0.15"

recording_path = "../data/Recordings/recording_21_20251230_160503.wav"

sample_rate = 16000
language = "en"

# 1. Model loading
model = Model(MODEL_DIR)

# 2. Recognizer
rec = KaldiRecognizer(model, sample_rate)

text = transcript_audio(recording_path, rec)
text