# [TTS] Create Custom Speech Model for Vietnamese Language
This sample demonstrates how to create Custom Speech model calling REST API. 

> ✨ ***Note*** <br>
> Please check the custom speech support for each language before you get started - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt#:~:text=Custom%20speech%20support 

## Prerequisites
Git clone the repository to your local machine. 

```bash
git clone https://github.com/hyogrin/Azure_OpenAI_samples.git
```

* A subscription key for the Speech service. See [Try the speech service for free](https://docs.microsoft.com/azure/cognitive-services/speech-service/get-started).
* Python 3.5 or later needs to be installed. Downloads are available [here](https://www.python.org/downloads/).
* The Python Speech SDK package is available for Windows (x64 or x86) and Linux (x64; Ubuntu 16.04 or Ubuntu 18.04).
* On Ubuntu 16.04 or 18.04, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.0 libasound2
  ```
* On Debian 9, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.2 libasound2
  ```
* On Windows you need the [Microsoft Visual C++ Redistributable for Visual Studio 2017](https://support.microsoft.com/help/2977003/the-latest-supported-visual-c-downloads) for your platform.

Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

```bash
pip install -r requirements.txt
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## Setup the environment

In [32]:
import azure.cognitiveservices.speech as speechsdk
import os
import json
from openai import AzureOpenAI
import requests
from dotenv import load_dotenv
from utils.rest_common import *

load_dotenv()

speech_key = os.getenv("AZURE_AI_SPEECH_API_KEY")
speech_region = os.getenv("AZURE_AI_SPEECH_REGION")

## Test STT(Speech to Text) capabilities in Azure AI Speech by the synthetic dataset 

In [3]:
def from_file(file_path: str, lang:str):
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region, speech_recognition_language=lang)
    audio_config = speechsdk.AudioConfig(filename=file_path)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    return speech_recognition_result.text

In [4]:
import os
from IPython.display import Audio, display

output_folder = 'output'
files = os.listdir(output_folder)
wav_files = [file for file in files if file.endswith('.wav')]

# Sort wav_files by 'no' in ascending order
wav_files.sort(key=lambda x: int(x.split('_')[0]))
wav_files

['1_vi-VN_20241105153542.wav',
 '2_vi-VN_20241105153547.wav',
 '3_vi-VN_20241105153551.wav',
 '4_vi-VN_20241105153556.wav',
 '5_vi-VN_20241105153600.wav',
 '6_vi-VN_20241105153605.wav',
 '7_vi-VN_20241105153609.wav',
 '8_vi-VN_20241105153614.wav',
 '9_vi-VN_20241105153619.wav',
 '10_vi-VN_20241105153624.wav']

In [5]:
for wav_file in wav_files[0:3]:
    print(from_file(os.path.join(output_folder,wav_file), "vi-VN"))

Khi nào trung tâm liên lạc của lg mở cửa?
Tôi cần giúp đở với sản phẩm lg của mình.
Làm thế nào để liên hệ với dịch vụ khách hàng của lg?


## Test accuracy of the base speech model
- In order to learn how to quantitatively measure and improve the accuracy of the base speech to text model or your own custom models check this link
- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-evaluate-data?pivots=speech-cli#create-a-test

To evaluate the word error rate (WER) of a base model in Azure AI’s Speech service, follow these steps:

Sign in to the Speech Studio:
Go to the Azure Speech Studio.
Create a Test:
Navigate to Custom speech and select your project.
Go to Test models and click on Create new test.
Select Evaluate accuracy and click Next.
Choose an audio + human-labeled transcription dataset. If you don’t have any datasets, upload them in the Speech datasets menu.
Select up to two models to evaluate, then click Next.
Enter the test name and description, then click Next.
Review the test details and click Save and close.
Get Test Results:
After the test is complete, indicated by the status set to Succeeded, you will see the results, including the WER for each tested model.
Evaluate WER:
WER is calculated as the sum of insertion, deletion, and substitution errors divided by the total number of words in the reference transcript, multiplied by 100 to get a percentage1.
For more detailed instructions, you can refer to this link - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-evaluate-data?pivots=rest-api.


In [10]:
import requests
import time
import json

# Base URL for the Speech Services REST API
base_url = f'https://{speech_region}.api.cognitive.microsoft.com/speechtotext'

# Headers for authentication
headers = {
    'Ocp-Apim-Subscription-Key': speech_key,
    'Content-Type': 'application/json'
}

In [11]:
display_name = "My Evaluation Project1"
description = "Project for evaluating the Vietnamese base model"
locale = "vi-VN"
project_id = create_project(base_url, headers, display_name, description, locale)

Project created with ID: 1e33b3db-3382-448c-a54f-791f35cd18b7


In [12]:
# go to storage account, update the zip file and generated SAS(Shared Access Signature) URL
zip_content_url = "https://aoaihub1storageaccount.blob.core.windows.net/stt-container/output_files.zip?sp=r&st=2024-11-05T06:46:06Z&se=2024-11-05T14:46:06Z&spr=https&sv=2022-11-02&sr=b&sig=AMEvZvQC8bd5QpQEXIDgPnbZpYZkqsXxPy1%2FSvl3Fmw%3D"
kind="Acoustic"
display_name = "acoustic dataset(zip) for training"
description = "Dataset for training the Vietnamese base model"
locale = "vi-VN"
zip_dataset_id = create_evaluation_dataset(base_url, headers, project_id, zip_content_url, kind, display_name, description, locale)

Dataset created with ID: 2adb4f6c-129b-43b3-869e-b2fffaf78149


In [24]:
plain_text_url = "https://aoaihub1storageaccount.blob.core.windows.net/stt-container/vi-VN_20241105115739.txt?sp=r&st=2024-11-05T08:15:47Z&se=2024-11-05T16:15:47Z&spr=https&sv=2022-11-02&sr=b&sig=HWNkj1akLcKScGEiDDzSZwoz%2Bmudn6LMVeLa5sXgyOQ%3D"
kind="Language"
display_name = "plain text dataset for training"
description = "Dataset for training the Vietnamese base model"
locale = "vi-VN"
plain_dataset_id = create_evaluation_dataset(base_url, headers, project_id, plain_text_url, kind, display_name, description, locale)

Dataset created with ID: f5fda48b-8467-40e1-89f9-1df20eea23ad


## Train Custom Speech Model with the plain text dataset 

In [14]:
base_model_id = "8066b5fb-0114-4837-90b6-0c245928a896"  ##check the model id from the train a new model ()
base_model = get_base_model(base_url, headers, base_model_id)
base_model

{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896',
 'links': {'manifest': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896/manifest'},
 'properties': {'deprecationDates': {'adaptationDateTime': '2025-01-15T00:00:00Z',
   'transcriptionDateTime': '2025-04-15T00:00:00Z'},
  'features': {'supportsTranscriptions': True,
   'supportsEndpoints': True,
   'supportsTranscriptionsOnSpeechContainers': False,
   'supportsAdaptationsWith': ['Language'],
   'supportedOutputFormats': ['Display', 'Lexical']},
  'chargeForAdaptation': False},
 'lastActionDateTime': '2023-01-31T13:08:53Z',
 'status': 'Succeeded',
 'createdDateTime': '2023-01-31T12:16:46Z',
 'locale': 'vi-VN',
 'displayName': '20230111',
 'description': 'vi-VN base model'}

### create custom model with plain text dataset

In [25]:
display_name = "vi_custom_model_with_plain_text"
description = "Custom model for evaluating the Vietnamese base model"
locale = "vi-VN"
custom_model_with_plain_id = create_custom_model(base_url, headers, project_id, base_model_id, plain_dataset_id, display_name, description, locale)
custom_model_with_plain_id

{'displayName': 'vi_custom_model_with_plain_text', 'description': 'Custom model for evaluating the Vietnamese base model', 'locale': 'vi-VN', 'baseModel': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896'}, 'datasets': [{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/f5fda48b-8467-40e1-89f9-1df20eea23ad'}], 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/1e33b3db-3382-448c-a54f-791f35cd18b7'}}
custom model job created with ID: 921627ab-fa4a-4581-9484-48ba1021da2d


'921627ab-fa4a-4581-9484-48ba1021da2d'

### create custom model with accustic dataset

In [16]:
display_name = "vi_custom_model_with_aocustic_dataset"
description = "Custom model for evaluating the Vietnamese base model"
locale = "vi-VN"
custom_model_with_zip_id = create_custom_model(base_url, headers, project_id, base_model_id, zip_dataset_id, display_name, description, locale)
custom_model_with_zip_id

{'displayName': 'vi_custom_model_with_aocustic_dataset', 'description': 'Custom model for evaluating the Vietnamese base model', 'locale': 'vi-VN', 'baseModel': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896'}, 'datasets': [{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/2adb4f6c-129b-43b3-869e-b2fffaf78149'}], 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/1e33b3db-3382-448c-a54f-791f35cd18b7'}}
custom model job created with ID: dc07376d-84f0-495d-a3bf-4055e5646cd9


'dc07376d-84f0-495d-a3bf-4055e5646cd9'

## Test accuracy of the trained Custom Speech model

In [26]:
from tqdm import tqdm

# Monitor the status of the run_result
def monitor_custom_model_training_status(model_id):
    with tqdm(total=3, desc="Running Status", unit="step") as pbar:
        status = get_custom_model_status(base_url, headers, model_id)
        if status == "NotStarted":
            pbar.update(1)
        while status != "Succeeded" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = get_custom_model_status(base_url, headers, model_id)
        while(pbar.n < 3):
            pbar.update(1)
        print("Custom Model Training Completed")

In [27]:
# Ensure that model1 and model2 IDs are correctly specified and created successfully
monitor_custom_model_training_status(custom_model_with_plain_id)
monitor_custom_model_training_status(custom_model_with_zip_id)

Running Status:  33%|███▎      | 1/3 [00:01<00:02,  1.15s/step]

Current Status: Running


Running Status: 100%|██████████| 3/3 [00:12<00:00,  4.11s/step]


Custom Model Training Completed


Running Status: 100%|██████████| 3/3 [00:01<00:00,  2.67step/s]

Custom Model Training Completed





In [20]:
# go to storage account, update the zip file and generated SAS(Shared Access Signature) URL
zip_content_url = "https://aoaihub1storageaccount.blob.core.windows.net/stt-container/train_vi-VN_20241105200114.zip?sp=r&st=2024-11-05T11:05:22Z&se=2024-11-05T19:05:22Z&spr=https&sv=2022-11-02&sr=b&sig=03%2FWOVUE%2FufqNLhEOBi585XNQELEf5cNv8RbOVSUVMA%3D"
kind="Acoustic"
display_name = "acoustic dataset(zip) for evaluation"
description = "Dataset for evaluation the Vietnamese base model"
locale = "vi-VN"
evaluation_dataset_id = create_evaluation_dataset(base_url, headers, project_id, zip_content_url, kind, display_name, description, locale)


Dataset created with ID: 170acd56-0374-4ec0-8740-aa6ce9bce978


In [28]:
print(custom_model_with_plain_id, custom_model_with_zip_id)
display_name = "vi_evaluation_with_acoustic_dataset"
description = "evaluate the Vietnamese base model"
locale = "vi-VN"

# Create the evaluation job adding zip_dataset to evaluate the custom model with acoustic dataset
evaluation_id = create_evaluation(base_url, headers, project_id, evaluation_dataset_id, custom_model_with_plain_id, custom_model_with_zip_id, display_name, description, locale)

921627ab-fa4a-4581-9484-48ba1021da2d dc07376d-84f0-495d-a3bf-4055e5646cd9
https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/evaluations
1e33b3db-3382-448c-a54f-791f35cd18b7 170acd56-0374-4ec0-8740-aa6ce9bce978 921627ab-fa4a-4581-9484-48ba1021da2d dc07376d-84f0-495d-a3bf-4055e5646cd9
{'model1': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/921627ab-fa4a-4581-9484-48ba1021da2d'}, 'model2': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/dc07376d-84f0-495d-a3bf-4055e5646cd9'}, 'dataset': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/170acd56-0374-4ec0-8740-aa6ce9bce978'}, 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/1e33b3db-3382-448c-a54f-791f35cd18b7'}, 'displayName': 'vi_evaluation_with_acoustic_dataset', 'description': 'evaluate the Vietnamese base model', 'locale': 'vi-VN'}
Evaluation job created wi

In [29]:
from tqdm import tqdm

# Monitor the status of the run_result
def monitor_evaluation_status(evaluation_id):
    with tqdm(total=3, desc="Running Status", unit="step") as pbar:
        status = get_evaluation_status(base_url, headers, evaluation_id)
        if status == "NotStarted":
            pbar.update(1)
        while status != "Succeeded" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = get_evaluation_status(base_url, headers, evaluation_id)
        while(pbar.n < 3):
            pbar.update(1)
        print("Evaluation Completed")

In [30]:
monitor_evaluation_status(evaluation_id)

Running Status:  33%|███▎      | 1/3 [00:01<00:02,  1.16s/step]

Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted


Running Status:  67%|██████▋   | 2/3 [01:08<00:39, 39.86s/step]

Current Status: Running


Running Status: 100%|██████████| 3/3 [01:19<00:00, 26.51s/step]

Evaluation Completed





In [None]:
print_evaluation_results(base_url, headers, evaluation_id)

WER not available.
