# [Eval] Evaluate Trained Custom Speech Models
This sample demonstrates how to evaluate Trained Custom Speech models calling REST API. 

> ✨ ***Note*** <br>
> You can test the accuracy of your custom model by creating a test. A test requires a collection of audio files and their corresponding transcriptions. You can compare a custom model's accuracy with a speech to text base model or another custom model. After you get the test results, evaluate the word error rate (WER) compared to speech recognition results. 

## Prerequisites
Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

```bash
pip install -r requirements.txt
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## Setup the environment

In [1]:
import azure.cognitiveservices.speech as speechsdk
import os
import json
from openai import AzureOpenAI
import requests
from dotenv import load_dotenv
from utils.common import *

load_dotenv()

speech_key = os.getenv("AZURE_AI_SPEECH_API_KEY")
speech_region = os.getenv("AZURE_AI_SPEECH_REGION")

# Get the project and custom model IDs from the previous notebook
project_id = ""
custom_model_with_plain_id = ""
custom_model_with_acoustic_id = ""
%store -r project_id
%store -r custom_model_with_plain_id
%store -r custom_model_with_acoustic_id

try:
    project_id, custom_model_with_plain_id, custom_model_with_acoustic_id
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] Please run the previous notebook again.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

print(project_id, custom_model_with_plain_id, custom_model_with_acoustic_id)

ddd89716-eb27-4e95-868e-91588348a699 cc932fe1-f7a2-49d4-b8e8-c7c836d67297 2845aa88-8eca-4142-8035-e3393ad7f29c


## 1. Test based speech models
- In order to learn how to quantitatively measure and improve the accuracy of the base speech to text model or your own custom models check this link
- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-evaluate-data?pivots=speech-cli#create-a-test

To evaluate the word error rate (WER) of a base model in Azure AI’s Speech service, follow these steps:

Sign in to the Speech Studio:
Go to the Azure Speech Studio.
Create a Test:
Navigate to Custom speech and select your project.
Go to Test models and click on Create new test.
Select Evaluate accuracy and click Next.
Choose an audio + human-labeled transcription dataset. If you don’t have any datasets, upload them in the Speech datasets menu.
Select up to two models to evaluate, then click Next.
Enter the test name and description, then click Next.
Review the test details and click Save and close.
Get Test Results:
After the test is complete, indicated by the status set to Succeeded, you will see the results, including the WER for each tested model.
Evaluate WER:
WER is calculated as the sum of insertion, deletion, and substitution errors divided by the total number of words in the reference transcript, multiplied by 100 to get a percentage1.
For more detailed instructions, you can refer to this link - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-evaluate-data?pivots=rest-api.


In [2]:
import requests
import time
import json

# Base URL for the Speech Services REST API
base_url = f'https://{speech_region}.api.cognitive.microsoft.com/speechtotext'

# Headers for authentication
headers = {
    'Ocp-Apim-Subscription-Key': speech_key,
    'Content-Type': 'application/json'
}

### Check the custom speech model ids to evaluate

In [3]:
# check the model id from the train a new model (UI) in the Azure Speech Studio. 
# The base model ids are vary from each language 
base_model_id = "8066b5fb-0114-4837-90b6-0c245928a896"  # Vietnamese base model id
print("base_model_id: ", base_model_id)
print("custom_model_with_plain_id: ", custom_model_with_plain_id)
print("custom_model_with_acoustic_id: ", custom_model_with_acoustic_id)

base_model_id:  8066b5fb-0114-4837-90b6-0c245928a896
custom_model_with_plain_id:  cc932fe1-f7a2-49d4-b8e8-c7c836d67297
custom_model_with_acoustic_id:  2845aa88-8eca-4142-8035-e3393ad7f29c


### Upload zip files to a storage account and generate content urls

In [5]:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, generate_blob_sas, BlobSasPermissions
import os
import datetime
data_folder = "eval_dataset"
account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
account_key = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

Files uploaded successfully.
uploaded_files: ['Call23', 'Call30', 'Call19', 'Call27', 'Call2', 'Call6', 'Call18', 'Call25', 'Call17']
url: {'Call23': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/Call23.zip?se=2024-11-29T16%3A07%3A49Z&sp=r&sv=2025-01-05&sr=b&sig=vWe6u%2BKhuAiq4H/yKys1tD22/E6X95Jcdy5Yr9fC07I%3D', 'Call30': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/Call30.zip?se=2024-11-29T16%3A07%3A49Z&sp=r&sv=2025-01-05&sr=b&sig=bcFRynnbV/bbPkpULkEM60SxdDMZztIhdhRUEWB80Z0%3D', 'Call19': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/Call19.zip?se=2024-11-29T16%3A07%3A49Z&sp=r&sv=2025-01-05&sr=b&sig=x7cxBH9tz%2ByajrweSTUl%2Bs3y6SjHOKjO2hmhxOWUbpU%3D', 'Call27': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container/Call27.zip?se=2024-11-29T16%3A07%3A49Z&sp=r&sv=2025-01-05&sr=b&sig=9yg5PstJH6anI6uD%2BJMiDTNLjmV39ZhOoY0bU4lMkFM%3D', 'Call2': 'https://aoaihub1storageaccount.blob.core.windows.net/stt-container

### Create datasets for evaluation

In [6]:
kind="Acoustic"
description = "Evaluation Dataset for Vietnamese Speech Recognition"
locale = "vi-VN"
dataset_id = {}

for display_name in uploaded_files:
    dataset_id[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, locale)

Dataset created with ID: 4b398307-a1fb-4379-ad3c-a30690de4a43
Dataset created with ID: 5c857663-0749-4111-bb98-ca45f8fbe18f
Dataset created with ID: 52f09d11-668b-4953-a5c5-aa1b0bbaa7a9
Dataset created with ID: 16b0f54f-777f-484b-88a8-4fedeb710835
Dataset created with ID: f2240746-51ee-425e-bd90-21443595cf70
Dataset created with ID: d9a69f06-3c6e-4b68-8983-dfa306c69baa
Dataset created with ID: 1ca6619b-4603-49e0-8ab3-e4d867d74649
Dataset created with ID: b28a138f-8ce1-464f-b5fd-10796ad3213f
Dataset created with ID: 4b0d2071-f48d-4d15-a5ec-3a1f67424a70


### Test accuracy of the trained Custom Speech model creating evaluations (tests)

In [16]:
description = "evaluate the Vietnamese models"
locale = "vi-VN"
evaluation_id={}
for display_name in uploaded_files:
    evaluation_id[display_name] = create_evaluation(base_url, headers, project_id, dataset_id[display_name], base_model_id, custom_model_with_acoustic_id, f'vi_eval_base_vs_custom_{display_name}', description, locale)

https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/evaluations
ddd89716-eb27-4e95-868e-91588348a699 4b398307-a1fb-4379-ad3c-a30690de4a43 8066b5fb-0114-4837-90b6-0c245928a896 2845aa88-8eca-4142-8035-e3393ad7f29c
{'model1': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/8066b5fb-0114-4837-90b6-0c245928a896'}, 'model2': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/2845aa88-8eca-4142-8035-e3393ad7f29c'}, 'dataset': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/4b398307-a1fb-4379-ad3c-a30690de4a43'}, 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/ddd89716-eb27-4e95-868e-91588348a699'}, 'displayName': 'vi_eval_base_vs_custom_Call23', 'description': 'evaluate the Vietnamese models', 'locale': 'vi-VN'}


Evaluation job created with ID: 9d3f0664-c32f-474c-a9d9-72a692d398cb
https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/evaluations
ddd89716-eb27-4e95-868e-91588348a699 5c857663-0749-4111-bb98-ca45f8fbe18f 8066b5fb-0114-4837-90b6-0c245928a896 2845aa88-8eca-4142-8035-e3393ad7f29c
{'model1': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/8066b5fb-0114-4837-90b6-0c245928a896'}, 'model2': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/2845aa88-8eca-4142-8035-e3393ad7f29c'}, 'dataset': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/datasets/5c857663-0749-4111-bb98-ca45f8fbe18f'}, 'project': {'self': 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/projects/ddd89716-eb27-4e95-868e-91588348a699'}, 'displayName': 'vi_eval_base_vs_custom_Call30', 'description': 'evaluate the Vietnamese models', 'locale': 'vi-VN'}
Evaluation job created with ID: 27a41bd6

In [17]:
from tqdm import tqdm

# Monitor the status of the run_result
def monitor_evaluation_status(evaluation_id):
    with tqdm(total=3, desc="Running Status", unit="step") as pbar:
        status = get_evaluation_status(base_url, headers, evaluation_id)
        if status == "NotStarted":
            pbar.update(1)
        while status != "Succeeded" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = get_evaluation_status(base_url, headers, evaluation_id)
        while(pbar.n < 3):
            pbar.update(1)
        print("Evaluation Completed")

In [18]:
for display_name in uploaded_files:
    monitor_evaluation_status(evaluation_id[display_name])

Running Status:   0%|          | 0/3 [00:00<?, ?step/s]

Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted
Current Status: NotStarted


Running Status:  67%|██████▋   | 2/3 [01:30<00:45, 45.21s/step]

Current Status: Running


Running Status: 100%|██████████| 3/3 [01:40<00:00, 33.49s/step]


Evaluation Completed


Running Status: 100%|██████████| 3/3 [00:00<00:00, 99.53step/s]


Evaluation Completed


Running Status:   0%|          | 0/3 [00:00<?, ?step/s]

Current Status: Running


Running Status: 100%|██████████| 3/3 [00:10<00:00,  3.36s/step]


Evaluation Completed


Running Status:   0%|          | 0/3 [00:00<?, ?step/s]

Current Status: Running


Running Status: 100%|██████████| 3/3 [00:10<00:00,  3.36s/step]


Evaluation Completed


Running Status: 100%|██████████| 3/3 [00:00<00:00, 69.08step/s]


Evaluation Completed


Running Status: 100%|██████████| 3/3 [00:00<00:00, 107.70step/s]


Evaluation Completed


Running Status: 100%|██████████| 3/3 [00:00<00:00, 102.66step/s]


Evaluation Completed


Running Status:   0%|          | 0/3 [00:00<?, ?step/s]

Current Status: Running


Running Status: 100%|██████████| 3/3 [00:10<00:00,  3.35s/step]


Evaluation Completed


Running Status: 100%|██████████| 3/3 [00:00<00:00, 112.48step/s]

Evaluation Completed





In [19]:

import pandas as pd

# Collect WER results for each dataset
wer_results = []
eval_title = "Evaluation Results for base model and custom model: "
for display_name in uploaded_files:
    eval_info = get_evaluation_results(base_url, headers, evaluation_id[display_name])
    eval_title = eval_title + display_name + " "
    wer_results.append({
            'Dataset': display_name,
            'WER_base_model': eval_info['properties']['wordErrorRate1'],
            'WER_custom_model': eval_info['properties']['wordErrorRate2'],
            
    })
# Create a DataFrame to display the results
wer_df = pd.DataFrame(wer_results)
print(eval_title)
print(wer_df)




Evaluation Results for base model and custom model: Call23 Call30 Call19 Call27 Call2 Call6 Call18 Call25 Call17 
  Dataset  WER_base_model  WER_custom_model
0  Call23          0.0805            0.0738
1  Call30          0.0839            0.0872
2  Call19          0.1457            0.1809
3  Call27          0.1230            0.0802
4   Call2          0.1375            0.1417
5   Call6          0.1492            0.1694
6  Call18          0.0702            0.0726
7  Call25          0.2005            0.2137
8  Call17          0.1905            0.1769
