# [Speech to Text] Create Custom Speech Model for Vietnamese Language
This sample demonstrates how to create Custom Speech model calling REST API. 

> ✨ ***Note*** <br>
> Please check the custom speech support for each language before you get started - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt#:~:text=Custom%20speech%20support 

## Prerequisites
Git clone the repository to your local machine. 

```bash
git clone https://github.com/hyogrin/Azure_OpenAI_samples.git
```

* A subscription key for the Speech service. See [Try the speech service for free](https://docs.microsoft.com/azure/cognitive-services/speech-service/get-started).
* Python 3.5 or later needs to be installed. Downloads are available [here](https://www.python.org/downloads/).
* The Python Speech SDK package is available for Windows (x64 or x86) and Linux (x64; Ubuntu 16.04 or Ubuntu 18.04).
* On Ubuntu 16.04 or 18.04, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.0 libasound2
  ```
* On Debian 9, run the following commands for the installation of required packages:
  ```sh
  sudo apt-get update
  sudo apt-get install libssl1.0.2 libasound2
  ```
* On Windows you need the [Microsoft Visual C++ Redistributable for Visual Studio 2017](https://support.microsoft.com/help/2977003/the-latest-supported-visual-c-downloads) for your platform.

Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

```bash
pip install -r requirements.txt
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## 1. Test STT(Speech to Text) of Azure AI Speech by the synthetic dataset 

In [18]:
import azure.cognitiveservices.speech as speechsdk
import os
import json
from openai import AzureOpenAI
import requests
from dotenv import load_dotenv
from utils.common import *

load_dotenv()

speech_key = os.getenv("AZURE_AI_SPEECH_API_KEY")
speech_region = os.getenv("AZURE_AI_SPEECH_REGION")

train_dataset_path = ""
%store -r train_dataset_path
try:
    train_dataset_path
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the previous notebook again.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

train_dataset_path

{'train_dataset\\train_vi-VN_20241110220748.zip'}

In [19]:
import requests
import time
import json

# Base URL for the Speech Services REST API
base_url = f'https://{speech_region}.api.cognitive.microsoft.com/speechtotext'

# Headers for authentication
headers = {
    'Ocp-Apim-Subscription-Key': speech_key,
    'Content-Type': 'application/json'
}

In [20]:
def speech_recognition_from_file(file_path: str, lang:str):
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region, speech_recognition_language=lang)
    audio_config = speechsdk.AudioConfig(filename=file_path)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    return speech_recognition_result.text

### Get the sorted wav files from the dataset folder

In [21]:
import os
from IPython.display import Audio, display

output_folder = 'synthetic_data'
files = os.listdir(output_folder)
wav_files = [file for file in files if file.endswith('.wav')]

# Sort wav_files by 'no' in ascending order
wav_files.sort(key=lambda x: int(x.split('_')[0]))
wav_files

['1_vi-VN_20241110220608.wav',
 '2_vi-VN_20241110220612.wav',
 '3_vi-VN_20241110220616.wav',
 '4_vi-VN_20241110220620.wav',
 '5_vi-VN_20241110220624.wav',
 '6_vi-VN_20241110220629.wav',
 '7_vi-VN_20241110220632.wav',
 '8_vi-VN_20241110220636.wav',
 '9_vi-VN_20241110220640.wav',
 '10_vi-VN_20241110220644.wav',
 '11_vi-VN_20241110220648.wav',
 '12_vi-VN_20241110220652.wav',
 '13_vi-VN_20241110220655.wav',
 '14_vi-VN_20241110220659.wav',
 '15_vi-VN_20241110220704.wav',
 '16_vi-VN_20241110220708.wav',
 '17_vi-VN_20241110220711.wav',
 '18_vi-VN_20241110220715.wav',
 '19_vi-VN_20241110220719.wav',
 '20_vi-VN_20241110220722.wav']

In [22]:
for wav_file in wav_files[0:3]:
    print(speech_recognition_from_file(os.path.join(output_folder,wav_file), "vi-VN"))

Tôi có thể được trợ giúp bằng tiếng anh không?
Máy giặt của tôi không hoạt động, tôi phải làm gì?
Lg có chính sách bảo hành bao lâu?


## 2. Upload training datasets 
- You can upload datasets for training, qualitative inspection, and quantitative measurement. 
- This lab covers two types (Acoustic and Plain text) of training and testing data that you can use for custom speech. 
- Check the other options on this link - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-test-and-train

### Create a project

In [None]:
display_name = "My Custom Model Training And Evaluation Project"
description = "Project for training and evaluating the Vietnamese base model"
locale = "vi-VN"
project_id = create_project(base_url, headers, display_name, description, locale)

Project created with ID: 8b0eb25b-38f1-4f7b-8cfe-641753cd9cd1


In [24]:
# Store the project_id for later use
%store project_id

Stored 'project_id' (str)


### Upload the acoustic dataset to storage (zip files)

In [25]:
data_folder = "train_dataset"
account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
account_key = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

Files uploaded successfully.
uploaded_files: ['train_vi-VN_20241110220748']
url: {'train_vi-VN_20241110220748': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/train_vi-VN_20241110220748.zip?se=2024-11-10T21%3A09%3A11Z&sp=r&sv=2024-11-04&sr=b&sig=UP3EYV%2BJ/1h4XrVy9rlTc0LCSEYwW4PQIu7RES/vhls%3D'}


### Create datasets with the uploaded acoustic dataset

In [26]:
kind="Acoustic"
display_name = "acoustic dataset(zip) for training"
description = "Dataset for training the Vietnamese base model"
locale = "vi-VN"

zip_dataset_dict = {}

for display_name in uploaded_files:
    zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, locale)

Dataset created with ID: 379fc89a-e577-466c-b34b-0eca564cdffb


### Upload the plain text dataset to storage (text files)

In [27]:
data_folder = "plain_text"
account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
account_key = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

Files uploaded successfully.
uploaded_files: ['vi-VN_20241108095916', 'vi-VN_20241110220522']
url: {'vi-VN_20241108095916': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/vi-VN_20241108095916.txt?se=2024-11-10T21%3A09%3A18Z&sp=r&sv=2024-11-04&sr=b&sig=2scFGsEA4pcxR2jsUVdH61LZ/Q95xAq1Nl2KRQyRnYw%3D', 'vi-VN_20241110220522': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/vi-VN_20241110220522.txt?se=2024-11-10T21%3A09%3A18Z&sp=r&sv=2024-11-04&sr=b&sig=bLoXwVzMzPq3k9onNnArTZesfpJwhkuPt/hOfVBlqbA%3D'}


### Create datasets with the uploaded plain text dataset

In [28]:
kind="Language"
display_name = "plain text dataset for training"
description = "Dataset for training the Vietnamese base model"
locale = "vi-VN"

plain_dataset_dict = {}

for display_name in uploaded_files:
    plain_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, locale)

Dataset created with ID: 9b4d0d9d-2fbe-4f25-8a41-9ebc049aa6d9
Dataset created with ID: c9023a6b-fd40-48b3-b40c-7f99a592d83f


## 3. Train Custom Speech Models with the uploaded datasets

In [None]:
# check the model id from the train a new model (UI) in the Azure Speech Studio. 
# The base model ids are vary from each language 
base_model_id = "8066b5fb-0114-4837-90b6-0c245928a896"  # Vietnamese base model id
base_model = get_base_model(base_url, headers, base_model_id)
base_model

{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896',
 'links': {'manifest': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896/manifest'},
 'properties': {'deprecationDates': {'adaptationDateTime': '2025-01-15T00:00:00Z',
   'transcriptionDateTime': '2025-04-15T00:00:00Z'},
  'features': {'supportsTranscriptions': True,
   'supportsEndpoints': True,
   'supportsTranscriptionsOnSpeechContainers': False,
   'supportsAdaptationsWith': ['Language'],
   'supportedOutputFormats': ['Display', 'Lexical']},
  'chargeForAdaptation': False},
 'lastActionDateTime': '2023-01-31T13:08:53Z',
 'status': 'Succeeded',
 'createdDateTime': '2023-01-31T12:16:46Z',
 'locale': 'vi-VN',
 'displayName': '20230111',
 'description': 'vi-VN base model'}

### Train the custom speech model with plain text datasets (txt)

In [30]:
display_name = "vi_custom_model_with_plain_text"
description = "Custom model training with plain text dataset"
locale = "vi-VN"
custom_model_with_plain_id = create_custom_model(base_url, headers, project_id, base_model_id, list(plain_dataset_dict.values()), display_name, description, locale)

{'displayName': 'vi_custom_model_with_plain_text', 'description': 'Custom model training with plain text dataset', 'locale': 'vi-VN', 'baseModel': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896'}, 'datasets': [{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/9b4d0d9d-2fbe-4f25-8a41-9ebc049aa6d9'}, {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/c9023a6b-fd40-48b3-b40c-7f99a592d83f'}], 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/8b0eb25b-38f1-4f7b-8cfe-641753cd9cd1'}}
custom model job created with ID: 762281d5-7387-45cf-8a1b-13f2670b15f3


### Train the custom speech model with acoustic datasets (zip)

In [31]:
display_name = "vi_custom_model_with_aocustic_dataset"
description = "Custom model training with acoustic dataset"
locale = "vi-VN"
custom_model_with_acoustic_id = create_custom_model(base_url, headers, project_id, base_model_id, list(zip_dataset_dict.values()), display_name, description, locale)

{'displayName': 'vi_custom_model_with_aocustic_dataset', 'description': 'Custom model training with acoustic dataset', 'locale': 'vi-VN', 'baseModel': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896'}, 'datasets': [{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/379fc89a-e577-466c-b34b-0eca564cdffb'}], 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/8b0eb25b-38f1-4f7b-8cfe-641753cd9cd1'}}
custom model job created with ID: 84bebf4d-b198-437f-bef5-619ee288bb36


In [32]:
from tqdm import tqdm

# Monitor the status of the run_result
def monitor_training_status(custom_model_id):
    with tqdm(total=3, desc="Running Status", unit="step") as pbar:
        status = get_custom_model_status(base_url, headers, custom_model_id)
        if status == "NotStarted":
            pbar.update(1)
        while status != "Succeeded" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = get_custom_model_status(base_url, headers, custom_model_id)
        while(pbar.n < 3):
            pbar.update(1)
        print("Evaluation Completed")

### monitor training status for each job

In [33]:
monitor_training_status(custom_model_with_plain_id)
monitor_training_status(custom_model_with_acoustic_id)

Running Status:  33%|███▎      | 1/3 [00:01<00:02,  1.03s/step]

Current Status: Running


Running Status:  67%|██████▋   | 2/3 [00:12<00:06,  6.91s/step]

Current Status: Running
Current Status: Running


Running Status: 100%|██████████| 3/3 [00:34<00:00, 11.37s/step]


Evaluation Completed


Running Status: 100%|██████████| 3/3 [00:01<00:00,  2.96step/s]

Evaluation Completed





In [34]:
%store custom_model_with_plain_id
%store custom_model_with_zip_id

Stored 'custom_model_with_plain_id' (str)
Stored 'custom_model_with_zip_id' (str)
