# [Speech to Text] Create Custom Speech Model for Italian Language
This sample demonstrates how to create Custom Speech model calling REST API. 

> ✨ ***Note*** <br>
> Please check the custom speech support for each language before you get started - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=stt#:~:text=Custom%20speech%20support 

## Prerequisites
Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

```bash
pip install -r requirements.txt
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## 1. Test STT(Speech to Text) of Azure AI Speech by the synthetic dataset 

In [2]:
import azure.cognitiveservices.speech as speechsdk
import os
import json
from openai import AzureOpenAI
import requests
from dotenv import load_dotenv
from utils.common import *

load_dotenv()

speech_key = os.getenv("AZURE_AI_SPEECH_API_KEY")
speech_region = os.getenv("AZURE_AI_SPEECH_REGION")

train_dataset_path = ""
%store -r train_dataset_path
try:
    train_dataset_path
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the previous notebook again.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

train_dataset_path

{'train_dataset/train_it-IT_20241203083636.zip'}

In [3]:
import requests
import time
import json

# Base URL for the Speech Services REST API
base_url = f'https://{speech_region}.api.cognitive.microsoft.com/speechtotext'

# Headers for authentication
headers = {
    'Ocp-Apim-Subscription-Key': speech_key,
    'Content-Type': 'application/json'
}

In [6]:
def speech_recognition_from_file(file_path: str, lang:str):
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=speech_region, speech_recognition_language=lang)
    audio_config = speechsdk.AudioConfig(filename=file_path)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    speech_recognition_result = speech_recognizer.recognize_once_async().get()
    return speech_recognition_result.text

### Get the sorted wav files from the dataset folder

In [7]:
import os
from IPython.display import Audio, display

output_folder = 'synthetic_data'
files = os.listdir(output_folder)
wav_files = [file for file in files if file.endswith('.wav')]

# Sort wav_files by 'no' in ascending order
wav_files.sort(key=lambda x: int(x.split('_')[0]))
wav_files

['1_it-IT_20241203083554.wav',
 '2_it-IT_20241203083555.wav',
 '3_it-IT_20241203083557.wav',
 '4_it-IT_20241203083558.wav',
 '5_it-IT_20241203083559.wav',
 '6_it-IT_20241203083600.wav',
 '7_it-IT_20241203083601.wav',
 '8_it-IT_20241203083602.wav',
 '9_it-IT_20241203083603.wav',
 '10_it-IT_20241203083604.wav']

In [8]:
for wav_file in wav_files[0:3]:
    print(speech_recognition_from_file(os.path.join(output_folder,wav_file), "it-IT"))

Come posso attivare il servizio di assistenza clienti LG?
Qual è il numero di telefono per il servizio clienti LG?
Ho bisogno di assistenza per il mio televisore LG, chi posso contattare?


## 2. Upload training datasets 
- You can upload datasets for training, qualitative inspection, and quantitative measurement. 
- This lab covers two types (Acoustic and Plain text) of training and testing data that you can use for custom speech. 
- Check the other options on this link - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-test-and-train

### Create a project

In [9]:
display_name = "My Custom Model Training And Evaluation Project"
description = "Project for training and evaluating the Italian base model"
locale = "it-IT"
project_id = create_project(base_url, headers, display_name, description, locale)

Project created with ID: d5c175ce-1550-45dd-8132-c1c18fd5d3e2


In [10]:
# Store the project_id for later use
%store project_id

Stored 'project_id' (str)


### Upload the acoustic dataset to storage (zip files)

In [11]:
data_folder = "train_dataset"
account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
account_key = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

Files uploaded successfully.
uploaded_files: ['train_it-IT_20241203083636']
url: {'train_it-IT_20241203083636': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/train_it-IT_20241203083636.zip?se=2024-12-03T17%3A18%3A13Z&sp=r&sv=2025-01-05&sr=b&sig=%2BIkaXF/4CkDa2hZSWJRqQy3TuEiSywp7eMOXG%2B0oRxw%3D'}


### Create datasets with the uploaded acoustic dataset

In [12]:
kind="Acoustic"
display_name = "acoustic dataset(zip) for training"
description = "Dataset for training the Italian base model"

zip_dataset_dict = {}

for display_name in uploaded_files:
    zip_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, locale)

Dataset created with ID: 683c1a42-1530-481a-843b-d897093585d7


### Upload the plain text dataset to storage (text files)

In [13]:
data_folder = "plain_text"
account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
account_key = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

Files uploaded successfully.
uploaded_files: ['it-IT_20241203083540']
url: {'it-IT_20241203083540': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/it-IT_20241203083540.txt?se=2024-12-03T17%3A18%3A16Z&sp=r&sv=2025-01-05&sr=b&sig=HOsWjTC5Ggsuup38t8YKogDbI5kZ9zyfVrU7aE0dFPI%3D'}


### Create datasets with the uploaded plain text dataset

In [14]:
kind="Language"
display_name = "plain text dataset for training"
description = "Dataset for training the Italian base model"

plain_dataset_dict = {}

for display_name in uploaded_files:
    plain_dataset_dict[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, locale)

Dataset created with ID: 5408be63-be37-4b33-b19d-49546b6ae4c1


## 3. Train Custom Speech Models with the uploaded datasets

> ✨ ***Note*** <br>
> Please check which version of base model support for adaptation from baseline model information <br>
> for example, Italian language model 20230111, 2e5e70f1-960b-4509-a7c5-102b29227c0b supports 'Language', 'LanguageMarkdown', 'Pronunciation', 'OutputFormatting' adaptation.<br>
> check the supportsAdaptationsWith feature of the base_model object.

In [20]:
# check the model id from the train a new model (UI) in the Azure Speech Studio. 
# The base model ids are vary from each language 
base_model_id = "2e5e70f1-960b-4509-a7c5-102b29227c0b"  # Italian base model id d822ba77-3a5b-460d-a19c-fd1919026148
base_model = get_base_model(base_url, headers, base_model_id)
base_model

{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/2e5e70f1-960b-4509-a7c5-102b29227c0b',
 'links': {'manifest': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/2e5e70f1-960b-4509-a7c5-102b29227c0b/manifest'},
 'properties': {'deprecationDates': {'adaptationDateTime': '2025-01-15T00:00:00Z',
   'transcriptionDateTime': '2025-04-15T00:00:00Z'},
  'features': {'supportsTranscriptions': True,
   'supportsEndpoints': True,
   'supportsTranscriptionsOnSpeechContainers': False,
   'supportsAdaptationsWith': ['Language',
    'LanguageMarkdown',
    'Pronunciation',
    'OutputFormatting'],
   'supportedOutputFormats': ['Display', 'Lexical']},
  'chargeForAdaptation': False},
 'lastActionDateTime': '2023-02-06T10:32:49Z',
 'status': 'Succeeded',
 'createdDateTime': '2023-02-06T10:01:46Z',
 'locale': 'it-IT',
 'displayName': '20230111',
 'description': 'it-IT base model'}

### Train the custom speech model with plain text datasets (txt)

In [None]:
display_name = "it_custom_model_with_plain_text"
description = "Custom model training with plain text dataset"
try:
	custom_model_with_plain_id = create_custom_model(base_url, headers, project_id, base_model_id, list(plain_dataset_dict.values()), display_name, description, locale)
except requests.exceptions.HTTPError as e:
	print(f"HTTP error occurred: {e}")
	print(f"Response content: {e.response.content}")
	raise e

{'displayName': 'it_custom_model_with_plain_text', 'description': 'Custom model training with plain text dataset', 'locale': 'it-IT', 'baseModel': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/d822ba77-3a5b-460d-a19c-fd1919026148'}, 'datasets': [{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/5408be63-be37-4b33-b19d-49546b6ae4c1'}], 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/d5c175ce-1550-45dd-8132-c1c18fd5d3e2'}}
HTTP error occurred: 400 Client Error: Bad Request for url: https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models
Response content: b'{\n  "code": "InvalidRequest",\n  "message": "The selected end-to-end model isn\'t valid for adaptation",\n  "innerError": {\n    "code": "InvalidPayload",\n    "message": "The selected end-to-end model isn\'t valid for adaptation"\n  }\n}'


### Train the custom speech model with acoustic datasets (zip)

In [None]:
display_name = "it_custom_model_with_aocustic_dataset"
description = "Custom model training with acoustic dataset"
try:
	custom_model_with_acoustic_id = create_custom_model(base_url, headers, project_id, base_model_id, list(zip_dataset_dict.values()), display_name, description, locale)
except requests.exceptions.HTTPError as e:
	print(f"HTTP error occurred: {e}")
	print(f"Response content: {e.response.content}")
	raise e

{'displayName': 'it_custom_model_with_aocustic_dataset', 'description': 'Custom model training with acoustic dataset', 'locale': 'it-IT', 'baseModel': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/d822ba77-3a5b-460d-a19c-fd1919026148'}, 'datasets': [{'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/f1f13383-f97e-480e-b603-d5b16f6ea6ac'}], 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/2d47dabb-6c19-4f57-bab3-85aeb806408b'}}


HTTPError: 400 Client Error: Bad Request for url: https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models

In [19]:
from tqdm import tqdm

# Monitor the status of the run_result
def monitor_training_status(custom_model_id):
    with tqdm(total=3, desc="Running Status", unit="step") as pbar:
        status = get_custom_model_status(base_url, headers, custom_model_id)
        if status == "NotStarted":
            pbar.update(1)
        while status != "Succeeded" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = get_custom_model_status(base_url, headers, custom_model_id)
        while(pbar.n < 3):
            pbar.update(1)
        print("Training Completed")

### monitor training status for each job

In [20]:
monitor_training_status(custom_model_with_plain_id)
monitor_training_status(custom_model_with_acoustic_id)

Running Status:   0%|          | 0/3 [00:00<?, ?step/s]

Current Status: Running


Running Status:  67%|██████▋   | 2/3 [00:10<00:05,  5.04s/step]

Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running
Current Status: Running


Running Status: 100%|██████████| 3/3 [02:10<00:00, 43.53s/step]


Training Completed


Running Status: 100%|██████████| 3/3 [00:00<00:00, 100.35step/s]

Training Completed





In [21]:
%store custom_model_with_plain_id
%store custom_model_with_acoustic_id

Stored 'custom_model_with_plain_id' (str)
Stored 'custom_model_with_acoustic_id' (str)
