# Audio Generation

The Speech-to-Text (STT) model is one of the components of the project. For testing purposes, it is useful to have a test dataset consisting of (audio, text) pairs. This allows us to feed audio samples into the model and compare the generated text with the expected output.

While text data can be generated using ChatGPT, generating a large set of audio files with it is not currently feasible (at least for now). Therefore, in this notebook, the `gTTS` (Google Text-to-Speech) library is used to generate audio files from the test texts produced by the ChatGPT model.

## 1. Configure

In [None]:
# !pip install gTTS pydub

`ffmpeg` is also needed. On MacOS it could be installed with brew:

In [None]:
# !brew install ffmpeg

## 2. Text to Speech

In [None]:
import json
import os
from pathlib import Path
from gtts import gTTS
from pydub import AudioSegment


def text_to_wav(text, lang, wav_path):
    mp3_path = wav_path.replace(".wav", ".mp3")

    # TTS to mp3
    tts = gTTS(text=text, lang=lang)
    tts.save(mp3_path)

    # MP3 -> WAV 16 kHz mono
    audio = AudioSegment.from_mp3(mp3_path)
    audio = audio.set_frame_rate(16000).set_channels(1)
    audio.export(wav_path, format="wav")

    # remove tmp MP3
    # os.remove(mp3_path)

In [None]:

DATASET_PATH = "../tests/audio_data/audio_testing_data.json"
AUDIO_ROOT = "../tests/audio_data/audio/"

# Load the test data
with open(DATASET_PATH, "r", encoding="utf-8") as f:
    entries = json.load(f)["data"]

In [None]:

import tqdm 
LANG_MAP = {
    "ru": "ru",
    "en": "en",
    "pl": "pl"
}

# Main loop
for entry in tqdm.tqdm(entries):
    wav_path = f"{AUDIO_ROOT}{entry["id"]}.wav"
    lang = entry["language"]
    text = entry["text"]

    Path(os.path.dirname(wav_path)).mkdir(parents=True, exist_ok=True)

    if os.path.exists(wav_path):
        print(f"[SKIP] {wav_path}")
        continue

    print(f"[GEN] {wav_path}  lang={lang}")

    text_to_wav(text, LANG_MAP[lang], wav_path)

print("\nâœ” Done! All WAV files generated.")