# [Eval] Evaluate gpt-4o-transcribe model
This sample demonstrates how to evaluate gpt-4o-transcribe model. 

> ✨ ***Note*** <br>
> You can test the accuracy of your custom model by creating a test. A test requires a collection of audio files and their corresponding transcriptions. You can compare a custom model's accuracy with a speech to text base model or another custom model. After you get the test results, evaluate the word error rate (WER) compared to speech recognition results. 

## Prerequisites
Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.
 1. ***Prepare huggingface user access token and login with your hugging face account.*** Please check the [reference document](https://huggingface.co/docs/hub/datasets-usage)"


## Setup the environment

In [None]:
import os
from openai import AzureOpenAI
    
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2025-03-01-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)

deployment_id = "gpt-4o-mini-transcribe" #This will correspond to the custom name you chose for your deployment when you deployed a model."
audio_test_file = "./gpt-4o-mini-tts-test-korean1.mp4"

result = client.audio.transcriptions.create(
    file=open(audio_test_file, "rb"),            
    model=deployment_id
)

print(result)

NotFoundError: Error code: 404 - {'error': {'code': 'DeploymentNotFound', 'message': 'The API deployment for this resource does not exist. If you created the deployment within the last 5 minutes, please wait a moment and try again.'}}

In [None]:
import os
from azure.ai.inference.models import SystemMessage
from azure.ai.inference.models import UserMessage

NUM_SAMPLES = 2

topic = f"""
Contoso Electronics call center QnA related expected spoken utterances for {CUSTOM_SPEECH_LANG} and English languages.
"""
question = f"""
create {NUM_SAMPLES} lines of jsonl of the topic in {CUSTOM_SPEECH_LANG} and english. jsonl format is required. use 'no' as number and '{CUSTOM_SPEECH_LOCALE}', 'en-US' keys for the languages.
only include the lines as the result. Do not include ```jsonl, ``` and blank line in the result. 
"""

system_message = """
Generate plain text sentences of #topic# related text to improve the recognition of domain-specific words and phrases.
Domain-specific words can be uncommon or made-up words, but their pronunciation must be straightforward to be recognized. 
Use text data that's close to the expected spoken utterances. The nummber of utterances per line should be 1. 
Here is examples of the expected format:
{"no": 1, "string": "string", "string": "string"}
{"no": 2, "string": "string", "string": "string"}
"""

user_message = f"""
#topic#: {topic}
Question: {question}
"""

# Simple API Call
response = client.chat.completions.create(
    model=aoai_deployment_name,
    messages=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message},
    ],
    temperature=0.8,
    max_tokens=1024,
    top_p=0.1
)

content = response.choices[0].message.content
print(content)
print("Usage Information:")
#print(f"Cached Tokens: {response.usage.prompt_tokens_details.cached_tokens}") #only o1 models support this
print(f"Completion Tokens: {response.usage.completion_tokens}")
print(f"Prompt Tokens: {response.usage.prompt_tokens}")
print(f"Total Tokens: {response.usage.total_tokens}")

## 🧪 Test based speech models
- In order to learn how to quantitatively measure and improve the accuracy of the base speech to text model or your own custom models check this link
- https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-evaluate-data?pivots=speech-cli#create-a-test

To evaluate the word error rate (WER) of a base model in Azure AI’s Speech service, follow these steps:

Sign in to the Speech Studio:
Go to the Azure Speech Studio.
Create a Test:
Navigate to Custom speech and select your project.
Go to Test models and click on Create new test.
Select Evaluate accuracy and click Next.
Choose an audio + human-labeled transcription dataset. If you don’t have any datasets, upload them in the Speech datasets menu.
Select up to two models to evaluate, then click Next.
Enter the test name and description, then click Next.
Review the test details and click Save and close.
Get Test Results:
After the test is complete, indicated by the status set to Succeeded, you will see the results, including the WER for each tested model.
Evaluate WER:
WER is calculated as the sum of insertion, deletion, and substitution errors divided by the total number of words in the reference transcript, multiplied by 100 to get a percentage1.
For more detailed instructions, you can refer to this link - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-custom-speech-evaluate-data?pivots=rest-api.


In [135]:
# Base URL for the Speech Services REST API
base_url = f'https://{SPEECH_REGION}.api.cognitive.microsoft.com/speechtotext'

# Headers for authentication
headers = {
    'Ocp-Apim-Subscription-Key': SPEECH_KEY,
    'Content-Type': 'application/json'
}

### Check the custom speech model ids to evaluate

In [None]:
#option1. check the model id from the train a new model (UI) in the Azure Speech Studio. 
base_model_id = "8066b5fb-0114-4837-90b6-0c245928a896"  # Vietnamese base model id

#option2. check the model id from the API call
base_model = get_latest_base_model(base_url, headers, f"locale eq '{CUSTOM_SPEECH_LOCALE}' and status eq 'Succeeded'")

# Filter the base models to find the ones that support 'Language' adaptations and have the latest lastActionDateTime
filtered_models = [model for model in base_model['values'] if 'properties' in model and 'Language' in model['properties']['features'].get('supportsAdaptationsWith', [])]
if filtered_models:
	latest_model = max(filtered_models, key=lambda x: x['createdDateTime'])
	print("Latest model supporting 'Language' adaptations:")
	print(latest_model)
else:
	print("No models found that support 'Language' adaptations.")

# Get the latest model ID from the self link for example 8066b5fb-0114-4837-90b6-0c245928a896 is the model id in 'https://swedencentral.api.cognitive.microsoft.com/speechtotext/v3.2/models/base/8066b5fb-0114-4837-90b6-0c245928a896' 
base_model_id = latest_model['self'].split('/')[-1]

In [None]:
# check the model id from the train a new model (UI) in the Azure Speech Studio. 
# The base model ids are vary from each language 
print("Latest base model id:", base_model_id)
print("custom_model_with_plain_id: ", custom_model_with_plain_id)
print("custom_model_with_acoustic_id: ", custom_model_with_acoustic_id)

In [150]:
import os
import zipfile
from datetime import datetime
import tqdm as notebook_tqdm
import librosa
import numpy as np
import soundfile as sf
from datasets import load_dataset

def create_commonvoice_zip(language: str,
                           subset: str = "test",
                           output_sample_rate: int = 16000,
                           num_samples: int = None):
    """
    Create a ZIP file containing:
      - WAV files (PCM 16-bit, mono, `output_sample_rate` Hz)
      - A manifest.txt file listing filename + transcript.

    :param language:            Language code, e.g. "hi", "en", "vi", ...
    :param subset:              Subset of the Common Voice dataset, e.g. "test", "train", etc.
    :param output_sample_rate:  Target sample rate (Hz) for the output WAV files.
    :param num_samples:         Number of samples to process (useful for sampling). 
                                If None, process the entire subset.
    """

    # 1. Load the specified subset of the dataset
    dataset = load_dataset("mozilla-foundation/common_voice_15_0", language, split=subset)
    
    # 2. If sampling is requested, shuffle and select
    if num_samples is not None and num_samples < len(dataset):
        dataset = dataset.shuffle(seed=42)           # Shuffle for random selection
        dataset = dataset.select(range(num_samples)) # Take first `num_samples` items
    
    total_items = len(dataset)
    print(f"Processing {total_items} items from '{language}' subset='{subset}'.")

    # 3. Create a folder to store intermediate WAV files + manifest
    folder_name = f"{language}_commonvoice_wavs"
    os.makedirs(folder_name, exist_ok=True)

    manifest_path = os.path.join(folder_name, "manifest.txt")

    # 4. Convert each MP3 to WAV (resampled) and write the manifest
    timestamp_str = datetime.now().strftime("%Y%m%d%H%M%S")
    with open(manifest_path, "w", encoding="utf-8") as manifest_file:
        for idx, sample in enumerate(dataset):
            
            # get the filename
            wav_filename = f"{idx+1}_{language}_{timestamp_str}.wav"
            wav_path = os.path.join(folder_name, wav_filename)

            audio_array = sample["audio"]["array"]
            original_sr = sample["audio"]["sampling_rate"]

            # Convert to float32 if not already
            audio_array = audio_array.astype(np.float32)
            # Use librosa to resample from original_sr to output_sample_rate
            if original_sr != output_sample_rate:
                audio_array = librosa.resample(
                    audio_array, 
                    orig_sr=original_sr,
                    target_sr=output_sample_rate
                )
            # --------------------------------

            # 5. Write the audio to a mono, 16-bit WAV at the desired sample rate
            sf.write(
                wav_path,
                audio_array,
                samplerate=output_sample_rate,
                subtype='PCM_16'
            )

            # 6. Append the line to the manifest (filename <tab> sentence)
            text = sample.get("sentence", "").replace("\t", " ")
            manifest_file.write(f"{wav_filename}\t{text}\n")

    # 7. Create the ZIP archive
    timestamp_str = datetime.now().strftime("%Y%m%d%H%M%S")
    zip_filename = f"{language}_commonvoice_{subset}_{timestamp_str}.zip"
    
    with zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED) as zipf:
        # Add manifest.txt
        zipf.write(manifest_path, arcname="manifest.txt")

        # Add all WAV files
        for root, dirs, files in os.walk(folder_name):
            for file in files:
                if file.endswith(".wav"):
                    full_path = os.path.join(root, file)
                    zipf.write(full_path, arcname=file)

    print(f"Created {zip_filename}, containing {total_items} samples from '{language}' subset='{subset}'.")
    return zip_filename

In [None]:
import shutil

data_folder = "eval_dataset"

#create_commonvoice_zip("en", subset="test", output_sample_rate=16000, num_samples=100)
#create_commonvoice_zip("ko", subset="test", output_sample_rate=16000, num_samples=100)
zip_filename = create_commonvoice_zip("ko", subset="test", num_samples=100)

shutil.move(zip_filename, os.path.join(data_folder, zip_filename))

### Upload zip files to a storage account and generate content urls

In [None]:

account_name = os.getenv("AZURE_STORAGE_ACCOUNT_NAME")
account_key = os.getenv("AZURE_STORAGE_ACCOUNT_KEY")
container_name = os.getenv("AZURE_STORAGE_CONTAINER_NAME")

uploaded_files, url = upload_dataset_to_storage(data_folder, container_name, account_name, account_key)

In [None]:
uploaded_files

### Create datasets for evaluation

In [None]:
kind="Acoustic"
description = f"[eval] Dataset for evaluation the {CUSTOM_SPEECH_LANG} base model"
dataset_ids = {}

for display_name in uploaded_files:
    dataset_ids[display_name] = create_dataset(base_url, headers, project_id, url[display_name], kind, display_name, description, CUSTOM_SPEECH_LOCALE)

In [155]:
def dataset_upload_monitor_status(base_url, headers, get_method, id):
    with tqdm(total=4, desc="Running Status", unit="step") as pbar:
        status = get_method(base_url, headers, id)
        if status == "NotStarted":
            pbar.update(1)
        while status != "Succeeded" and status != "Failed":
            if status == "Running" and pbar.n < 2:
                pbar.update(1)
            print(f"Current Status: {status}")
            time.sleep(10)
            status = get_method(base_url, headers, id)
        while(pbar.n < 3):
            pbar.update(1)
        print(f"Current Status: {status}, Operation Completed")    

In [None]:
for display_name in uploaded_files:
    dataset_upload_monitor_status(base_url, headers, get_dataset_status, dataset_ids[display_name])

### Test accuracy of the trained Custom Speech model creating evaluations (tests)

> ✨ ***Note*** <br>
> If you find an error in the evaluation, check the status of data files uploaded. 

In [163]:
def new_create_evaluation(base_url, headers, project_id, dataset_id, model1_id, model2_id, display_name, description, locale):
    """
    Creates an evaluation job using the dataset and Vietnamese base model.
    """
    evaluations_url = f'{base_url}/evaluations?api-version=2024-11-15'
    print(evaluations_url)
    print(project_id, dataset_id, model1_id, model2_id)

    evaluation_body = {
        "model1": {
            "self": f'{base_url}/models/{model1_id}'
        },
        "model2": {
            "self": f'{base_url}/models/{model2_id}'
        },
        "dataset": {
            "self": f'{base_url}/datasets/{dataset_id}'
        },
        "project": {
            "self": f'{base_url}/projects/{project_id}'
        },
        "displayName": display_name,
        "description": description,
        "locale": locale,
        
    }
    
    print(evaluation_body)

    response = requests.post(evaluations_url, headers=headers, json=evaluation_body)
    response.raise_for_status()
    evaluation = response.json()
    evaluation_id = evaluation['self'].split('/')[-1]
    print(f'Evaluation job created with ID: {evaluation_id}')
    return evaluation_id

In [None]:
description = f"{CUSTOM_SPEECH_LOCALE} Evaluation of the {CUSTOM_SPEECH_LANG} base and custom model"
evaluation_ids = {}
for display_name in uploaded_files:
    evaluation_ids[display_name] = new_create_evaluation(
        base_url=base_url,
        headers=headers,
        project_id=project_id,
        dataset_id=dataset_ids[display_name],
        model1_id=base_model_id,
        model2_id=custom_model_with_acoustic_id,
        display_name=f'vi_eval_base_vs_custom_{display_name}',
        description=description,
        locale=CUSTOM_SPEECH_LOCALE
    )

In [None]:
for display_name in uploaded_files:
    monitor_status(base_url, headers, get_evaluation_status, evaluation_ids[display_name])

## 🧪 Print evaluation result with WER

In [None]:
# Collect WER results for each dataset
wer_results = []
eval_title = "Evaluation Results for base model and custom model: "
for display_name in uploaded_files:
    eval_info = get_evaluation_results(base_url, headers, evaluation_ids[display_name])
    print(eval_info)
    eval_title = eval_title + display_name + " "
    wer_results.append({
            'Dataset': display_name,
            'WER_base_model': eval_info['properties']['wordErrorRate1'],
            'WER_custom_model': eval_info['properties']['wordErrorRate2'],
            
    })
# Create a DataFrame to display the results
print(eval_info)
wer_df = pd.DataFrame(wer_results)
print(eval_title)
print(wer_df)

In [None]:
# Create a markdown file for table scoring results
md_table_scoring_result(base_url, headers, evaluation_ids, uploaded_files)