# [Speech to Text] Create Custom Speech Model for Vietnamese Language
This sample demonstrates how to create Custom Speech model calling REST API. 

> ✨ ***Note*** <br>
> Please check the custom speech support for each language before you get started - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt#:~:text=Custom%20speech%20support 

## Prerequisites
Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

```bash
pip install -r requirements.txt
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## 1. Test STT(Speech to Text) of Azure AI Speech by the synthetic dataset 

In [1]:
import azure.cognitiveservices.speech as speechsdk
import os
import json
from openai import AzureOpenAI
import requests
from dotenv import load_dotenv
from utils.common import *

load_dotenv()

speech_key = os.getenv("AZURE_AI_SPEECH_API_KEY")
speech_region = os.getenv("AZURE_AI_SPEECH_REGION")

train_dataset_path = ""
%store -r train_dataset_path
try:
    train_dataset_path
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the previous notebook again.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

train_dataset_path

{'train_dataset/train_vi-VN_20241129075917.zip'}

In [3]:
import requests
import time
import json

# Base URL for the Speech Services REST API
base_url = f'https://{speech_region}.api.cognitive.microsoft.com/speechtotext'

# Headers for authentication
headers = {
    'Ocp-Apim-Subscription-Key': speech_key,
    'Content-Type': 'application/json'
}

In [4]:
def speech_recognition_from_file(file_path: str, lang:str):
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region, speech_recognition_language=lang)
    audio_config = speechsdk.AudioConfig(filename=file_path)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    return speech_recognition_result.text

### Get the sorted wav files from the dataset folder

In [5]:
import os
from IPython.display import Audio, display

output_folder = 'synthetic_data'
files = os.listdir(output_folder)
wav_files = [file for file in files if file.endswith('.wav')]

# Sort wav_files by 'no' in ascending order
wav_files.sort(key=lambda x: int(x.split('_')[0]))
wav_files

['1_vi-VN_20241129075815.wav',
 '2_vi-VN_20241129075815.wav',
 '3_vi-VN_20241129075816.wav',
 '4_vi-VN_20241129075816.wav',
 '5_vi-VN_20241129075816.wav',
 '6_vi-VN_20241129075817.wav',
 '7_vi-VN_20241129075817.wav',
 '8_vi-VN_20241129075817.wav',
 '9_vi-VN_20241129075818.wav',
 '10_vi-VN_20241129075818.wav']

In [6]:
for wav_file in wav_files[0:3]:
    print(speech_recognition_from_file(os.path.join(output_folder,wav_file), "vi-VN"))

Chào bộ phận chăm sóc khách hàng RJ, tôi muốn hỏi về sản phẩm của công ty.
Tôi gặp vấn đề với máy giặt lg, có thể bạn giúp tôi khắc phục không?
Làm thế nào để tôi tắt chế độ chạy tự động trên máy giặt lg?


## 2. Upload training datasets 
- You can upload datasets for training, qualitative inspection, and quantitative measurement. 
- This lab covers two types (Acoustic and Plain text) of training and testing data that you can use for custom speech. 
- Check the other options on this link - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-test-and-train

### Create a project

In [7]:
display_name = "My Custom Model Training And Evaluation Project"
description = "Project for training and evaluating the Vietnamese base model"
locale = "vi-VN"
project_id = create_project(base_url, headers, display_name, description, locale)

Project created with ID: ddd89716-eb27-4e95-868e-91588348a699


In [8]:
# Store the project_id for later use
%store project_id

Stored 'project_id' (str)


### Upload the acoustic dataset to storage (zip files)

In [9]:
data_folder = "train_dataset"
account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
account_key = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

Files uploaded successfully.
uploaded_files: ['4-HD', '3-HD', '7-HD', 'train_vi-VN_20241111152003', '5-HY', '6-HD', 'train_vi-VN_20241129075917', '2-HY', '2-HD', '6-KA', '7-KA', '8-HD', '9-HD', '8-KA', '4-HY', '5-HD', '3-HY', '9-KA', '1-HY', '1-HD']
url: {'4-HD': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/4-HD.zip?se=2024-11-29T16%3A02%3A48Z&sp=r&sv=2025-01-05&sr=b&sig=CZ9pHu8ikbSyufIfNBrygBHzfEeifa3qo1dJmf0WWnQ%3D', '3-HD': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/3-HD.zip?se=2024-11-29T16%3A02%3A48Z&sp=r&sv=2025-01-05&sr=b&sig=AMt5hqyDAo9VuAqj9BEfKnMwOQaXhtKbdx/smZ0WJ88%3D', '7-HD': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/7-HD.zip?se=2024-11-29T16%3A02%3A48Z&sp=r&sv=2025-01-05&sr=b&sig=v4w%2B8uLEFeH/zizQ2m7yCdrFtlmo38QZW5tTCKbJncQ%3D', 'train_vi-VN_20241111152003': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/train_vi-VN_20241111152003.zip?se=2024-11-29T16%3A02%3A49Z&sp=r&sv=2025-01

### Create datasets with the uploaded acoustic dataset

In [10]:
kind="Acoustic"
display_name = "acoustic dataset(zip) for training"
description = "Dataset for training the Vietnamese base model"
locale = "vi-VN"

zip_dataset_dict = {}

for display_name in uploaded_files:
    zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, locale)

Dataset created with ID: adc1a790-9a33-4ec1-96db-fa8be194d5fe
Dataset created with ID: 6410655d-2d52-4f01-b281-fa3e35329493
Dataset created with ID: 40c292f6-6722-42d1-8d64-c0d5dbb68583
Dataset created with ID: 66b8215c-1044-4d13-b92c-ae1137298b0d
Dataset created with ID: f399f516-90d3-4656-b2ca-312a0d451056
Dataset created with ID: ada37ee3-5b0e-4dd6-a33d-7bfd137d0a11
Dataset created with ID: 72f422b1-7f0d-46b0-836c-9d980c296edc
Dataset created with ID: b054b3a9-0972-4996-89d4-84a02d9a4df1
Dataset created with ID: a9835575-ab58-4c4b-91df-e16166fb1028
Dataset created with ID: 43b7e8fc-0399-4404-a7fd-e0a834d1dc29
Dataset created with ID: 03f02dbf-c899-4cce-a90c-fb589f73a692
Dataset created with ID: cc2aa145-2152-4259-8680-d97a32d3fc78
Dataset created with ID: 2deaadab-52e4-42e2-aa36-37550b4ff663
Dataset created with ID: 691eb575-db84-4f5c-b93c-9da8fd894f0c
Dataset created with ID: 0045e41b-7016-4dc3-a26d-20b838aea3f0
Dataset created with ID: 9a5f2f05-a532-492f-b79b-6e4aa0ef7067
Dataset 

### Upload the plain text dataset to storage (text files)

In [11]:
data_folder = "plain_text"
account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
account_key = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

Files uploaded successfully.
uploaded_files: ['vi-VN_20241129075732']
url: {'vi-VN_20241129075732': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/vi-VN_20241129075732.txt?se=2024-11-29T16%3A02%3A57Z&sp=r&sv=2025-01-05&sr=b&sig=NJiR8wPcUyZ6130%2B8IS8OjTdFFciGo0sxEPB2/9xFvo%3D'}


### Create datasets with the uploaded plain text dataset

In [12]:
kind="Language"
display_name = "plain text dataset for training"
description = "Dataset for training the Vietnamese base model"
locale = "vi-VN"

plain_dataset_dict = {}

for display_name in uploaded_files:
    plain_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, locale)

Dataset created with ID: 035d7e26-7698-40a3-aeee-8f692fd73be0


## 3. Train Custom Speech Models with the uploaded datasets

> ✨ ***Note*** <br>
> Please check which version of base model support for adaptation from baseline model information <br>
> for example, Italian language model 20230111, 2e5e70f1-960b-4509-a7c5-102b29227c0b supports 'Language', 'LanguageMarkdown', 'Pronunciation', 'OutputFormatting' adaptation.<br>
> check the supportsAdaptationsWith feature of the base_model object.

In [13]:
# check the model id from the train a new model (UI) in the Azure Speech Studio. 
# The base model ids are vary from each language 
base_model_id = "8066b5fb-0114-4837-90b6-0c245928a896"  # Vietnamese base model id
base_model = get_base_model(base_url, headers, base_model_id)
base_model

{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896',
 'links': {'manifest': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896/manifest'},
 'properties': {'deprecationDates': {'adaptationDateTime': '2025-01-15T00:00:00Z',
   'transcriptionDateTime': '2025-04-15T00:00:00Z'},
  'features': {'supportsTranscriptions': True,
   'supportsEndpoints': True,
   'supportsTranscriptionsOnSpeechContainers': False,
   'supportsAdaptationsWith': ['Language'],
   'supportedOutputFormats': ['Display', 'Lexical']},
  'chargeForAdaptation': False},
 'lastActionDateTime': '2023-01-31T13:08:53Z',
 'status': 'Succeeded',
 'createdDateTime': '2023-01-31T12:16:46Z',
 'locale': 'vi-VN',
 'displayName': '20230111',
 'description': 'vi-VN base model'}

### Train the custom speech model with plain text datasets (txt)

In [14]:
display_name = "vi_custom_model_with_plain_text"
description = "Custom model training with plain text dataset"
locale = "vi-VN"
custom_model_with_plain_id = create_custom_model(base_url, headers, project_id, base_model_id, list(plain_dataset_dict.values()), display_name, description, locale)

{'displayName': 'vi_custom_model_with_plain_text', 'description': 'Custom model training with plain text dataset', 'locale': 'vi-VN', 'baseModel': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896'}, 'datasets': [{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/035d7e26-7698-40a3-aeee-8f692fd73be0'}], 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/ddd89716-eb27-4e95-868e-91588348a699'}}


custom model job created with ID: cc932fe1-f7a2-49d4-b8e8-c7c836d67297


### Train the custom speech model with acoustic datasets (zip)

In [15]:
display_name = "vi_custom_model_with_aocustic_dataset"
description = "Custom model training with acoustic dataset"
locale = "vi-VN"
custom_model_with_acoustic_id = create_custom_model(base_url, headers, project_id, base_model_id, list(zip_dataset_dict.values()), display_name, description, locale)

{'displayName': 'vi_custom_model_with_aocustic_dataset', 'description': 'Custom model training with acoustic dataset', 'locale': 'vi-VN', 'baseModel': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896'}, 'datasets': [{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/adc1a790-9a33-4ec1-96db-fa8be194d5fe'}, {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/6410655d-2d52-4f01-b281-fa3e35329493'}, {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/40c292f6-6722-42d1-8d64-c0d5dbb68583'}, {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/66b8215c-1044-4d13-b92c-ae1137298b0d'}, {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/f399f516-90d3-4656-b2ca-312a0d451056'}, {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotex

In [16]:
from tqdm import tqdm

# Monitor the status of the run_result
def monitor_training_status(custom_model_id):
    with tqdm(total=3, desc="Running Status", unit="step") as pbar:
        status = get_custom_model_status(base_url, headers, custom_model_id)
        if status == "NotStarted":
            pbar.update(1)
        while status != "Succeeded" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = get_custom_model_status(base_url, headers, custom_model_id)
        while(pbar.n < 3):
            pbar.update(1)
        print("Training Completed")

### monitor training status for each job

In [17]:
monitor_training_status(custom_model_with_plain_id)
monitor_training_status(custom_model_with_acoustic_id)

Running Status:   0%|          | 0/3 [00:00<?, ?step/s]

Current Status: Running


Running Status:  67%|██████▋   | 2/3 [00:10<00:05,  5.05s/step]

Current Status: Running
Current Status: Running
Current Status: Running


Running Status: 100%|██████████| 3/3 [00:40<00:00, 13.41s/step]


Training Completed


Running Status: 100%|██████████| 3/3 [00:00<00:00, 100.03step/s]

Training Completed





In [19]:
%store custom_model_with_plain_id
%store custom_model_with_acoustic_id

Stored 'custom_model_with_plain_id' (str)
Stored 'custom_model_with_acoustic_id' (str)
