# Commercial STT Models Reference Notebook

This notebook provides working examples for calling commercial Speech-to-Text APIs available on Azure, GCP, and AWS.

## Models Covered (based on [Artificial Analysis STT Benchmark](https://artificialanalysis.ai/speech-to-text))

### Available on Cloud Platforms:
- **AWS**: Amazon Transcribe (WER: 14.01%)
- **GCP**: Google Chirp 2 (WER: 11.58%), Google Chirp 3 (WER: 14.97%)
- **Azure**: Azure Speech Services (not in benchmark but widely used)

### Third-Party APIs (accessible from any cloud):
- **AssemblyAI**: Universal-2 (WER: 14.49%)
- **Rev AI** (WER: 15.21%)
- **Speechmatics**: Enhanced (WER: 14.41%, $6.70/1000 min)
- **ElevenLabs**: Scribe (WER: 15.07%)
- **Deepgram**: Nova-2 (commonly used, check current benchmark)

## Setup

This notebook assumes you have credentials configured for:
1. AWS (via boto3/environment variables)
2. GCP (via service account JSON)
3. Azure (via environment variables)
4. API keys for third-party services

In [None]:
# Install required packages via terminal
# uv add boto3 google-cloud-storage google-cloud-speech azure-cognitiveservices-speech assemblyai deepgram-sdk python-dotenv

In [None]:
import os
import json
from pathlib import Path
from typing import Dict, Optional

# For testing - point to a sample audio file
SAMPLE_AUDIO_PATH = "../data/audio/testing/transcript_audio_sample.mp3"

In [None]:
import os
from dotenv import load_dotenv
load_dotenv(dotenv_path='../credentials/creds.env')

## Credentials Setup

Load credentials from environment variables or a secure config file.

In [None]:
# Load credentials (update paths as needed)
# For production, use environment variables or secret management services

# AWS - uses default credential chain (IAM role, ~/.aws/credentials, env vars)
AWS_REGION = os.getenv("AWS_REGION", "us-east-1")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID", "access_key_id")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY", "secret_key")
# GCP - set path to service account JSON
GCP_CREDENTIALS_PATH = os.getenv("GCP_CREDENTIALS_PATH", "path/to/service-account.json")

# Azure
AZURE_SPEECH_KEY = os.getenv("AZURE_SPEECH_KEY", "your-key-here")
AZURE_REGION = os.getenv("AZURE_REGION", "eastus")

# Third-party API keys
ASSEMBLYAI_API_KEY = os.getenv("ASSEMBLYAI_API_KEY", "your-key-here")
DEEPGRAM_API_KEY = os.getenv("DEEPGRAM_API_KEY", "your-key-here")
REVAI_API_KEY = os.getenv("REVAI_API_KEY", "your-key-here")
SPEECHMATICS_API_KEY = os.getenv("SPEECHMATICS_API_KEY", "your-key-here")
ELEVENLABS_API_KEY = os.getenv("ELEVENLABS_API_KEY", "your-key-here")

---
# GCP - Google Cloud Speech-to-Text

In [None]:
import os
from google.cloud import storage
from google.cloud import speech_v2
from google.cloud.speech_v2 import SpeechClient
from google.cloud.speech_v2.types import cloud_speech
from google.api_core.client_options import ClientOptions

# in case no bucket is created
# storage_client = storage.Client()
# bucket = storage_client.create_bucket("memoirji-amia-2025-temp", location = "us-central1")
# helper to delete bucket
# bucket = storage_client.get_bucket("memoirji-amia-2025-temp")
# bucket.delete()

def upload_file(storage_client, bucket_name, source_file_path, target_filename):
    """Uploads a file to GCS and returns the GCS URI."""
    try:
        bucket = storage_client.bucket(bucket_name)
        print("Using bucket")
        blob = bucket.blob(target_filename)
        blob.upload_from_filename(source_file_path)
        print(f"File {source_file_path} uploaded.")
        gcs_uri = f"gs://{bucket_name}/{target_filename}"
        return gcs_uri
    except Exception as e:
        print(f"Upload error: {e}")
        return False
    
def delete_file(storage_client, bucket_name, target_filename):
    """Deletes a blob from the bucket."""
    try:
        bucket = storage_client.bucket(bucket_name)
        blob = bucket.blob(target_filename)
        blob.delete()
        print(f"Blob {target_filename} deleted from {bucket_name}.")
        return True
    except Exception as e:
        print(f"Delete error: {e}")
        return False

def transcribe_with_gcp_chirp(audio_uri: str, speech_client, model: str, recognizer_location: str):
    """
    Transcribe audio using Google Cloud Speech-to-Text with Chirp models.
    
    Args:
        audio_uri: GCS URI (gs://bucket/file.mp3)
        speech_client: SpeechClient instance
        model: Model to use ('chirp' = Chirp 2, 'chirp_2' explicitly, 'chirp_3' = Chirp 3)
    
    Returns:
        Dict with transcription results
    """
    try:
        print(f"Starting transcription with {model}...")
        
        config = cloud_speech.RecognitionConfig(
            auto_decoding_config=cloud_speech.AutoDetectDecodingConfig(),
            language_codes=["en-US"],
            model=model,
        )
        print("Config created")
        
        file_metadata = cloud_speech.BatchRecognizeFileMetadata(uri=audio_uri)
        print(f"Processing: {audio_uri}")
        
        request = cloud_speech.BatchRecognizeRequest(
            recognizer=f"projects/memoirji-amia-2025/locations/{recognizer_location}/recognizers/_",
            config=config,
            files=[file_metadata],
            recognition_output_config=cloud_speech.RecognitionOutputConfig(
                inline_response_config=cloud_speech.InlineOutputConfig(),
            ),
        )
        
        operation = speech_client.batch_recognize(request=request)
        print("Waiting for operation to complete...")
        response = operation.result(timeout=600)
        
        # Parse the response correctly
        # Structure: response.results[uri].transcript.results[...].alternatives[0].transcript
        transcript_segments = []
        
        # Get the result for our URI
        if audio_uri in response.results:
            uri_result = response.results[audio_uri]
            
            # Access the transcript results
            if hasattr(uri_result, 'transcript') and hasattr(uri_result.transcript, 'results'):
                for result in uri_result.transcript.results:
                    if result.alternatives:
                        transcript_segments.append(result.alternatives[0].transcript)
        
        full_transcript = ' '.join(transcript_segments)
        
        print(f"✓ Transcription complete! ({len(transcript_segments)} segments)")
        
        return {
            'provider': f'GCP Speech-to-Text ({model})',
            'text': full_transcript,
            'segments': transcript_segments,
            'full_response': response
        }
        
    except Exception as e:
        print(f"Transcription error: {e}")
        import traceback
        traceback.print_exc()
        return {
            'provider': f'GCP Speech-to-Text ({model})',
            'error': str(e)
        }

In [None]:
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = GCP_CREDENTIALS_PATH
bucket_name = "memoirji-amia-2025-temp"

source_file_path = SAMPLE_AUDIO_PATH
target_filename = os.path.basename(source_file_path)

# Model selection determines endpoint and recognizer location
google_stt_model = "chirp_3"  # chirp = Chirp 2, chirp_2 = Chirp 2 explicit, chirp_3 = Chirp 3

# IMPORTANT: Different models are available in different locations
if google_stt_model in ["chirp", "chirp_2"]:
    # Chirp 2: Available in us-central1
    api_endpoint = "us-central1-speech.googleapis.com"
    recognizer_location = "us-central1"
    print(f"Using Chirp 2 with location: {recognizer_location}")
elif google_stt_model == "chirp_3":
    # Chirp 3: Available in global (NOT us-central1)
    api_endpoint = "us-speech.googleapis.com"  # Global endpoint
    recognizer_location = "us"
    print(f"Using Chirp 3 with location: {recognizer_location}")
else:
    # Default fallback
    api_endpoint = "speech.googleapis.com"
    recognizer_location = "us"
    print(f"Using {google_stt_model} with location: {recognizer_location}")

speech_client = SpeechClient(
    client_options=ClientOptions(
        api_endpoint=api_endpoint,
    )
)

storage_client = storage.Client()

# Upload and transcribe
gcs_filepath = upload_file(storage_client, bucket_name, source_file_path, target_filename)
result_gcp_chirp = transcribe_with_gcp_chirp(
        gcs_filepath, 
        speech_client, 
        model=google_stt_model,
        recognizer_location=recognizer_location
    )

if gcs_filepath:
    # Clean up
    delete_file(storage_client, bucket_name, target_filename)
    
    # Show result
    if 'text' in result_gcp_chirp:
        print(f"\n✓ Transcription successful!")
        print(f"Preview: {result_gcp_chirp['text'][:200]}...")
    else:
        print(f"\n✗ Error: {result_gcp_chirp.get('error', 'Unknown error')}")

---
# AWS - Amazon Transcribe
claim:
Easily embed voice technologies in your applications with Amazon Transcribe, a fully managed, multi-billion parameter speech foundation model that instantly converts real-time or recorded speech into text. It is trained on millions of hours of audio data across a variety of languages.

**WER: 14.01%** (from Artificial Analysis)

Amazon Transcribe offers both batch and streaming transcription. This example shows batch transcription.

In [None]:
import boto3
import time
import urllib

# define bucket/ target obj key
bucket_name = "amia2025-test-bucket"
# define aws transcribe job name
aws_transcribe_job_name = "transcribe-job-test"

source_file_path = SAMPLE_AUDIO_PATH
target_filename = os.path.basename(source_file_path)
s3_object_key = target_filename

# set up s3 client and test upload
s3 = boto3.client('s3', region_name=AWS_REGION)

try:
    s3.upload_file(SAMPLE_AUDIO_PATH, bucket_name, s3_object_key)
    print(f"Successfully uploaded {SAMPLE_AUDIO_PATH} to {bucket_name}/{s3_object_key}")
    s3_uri = f"s3://{bucket_name}/{s3_object_key}"
    print(f"S3 URI: {s3_uri}")
except Exception as e:
    print(f"Error uploading file: {e}")

# set up AWS transcribe service client
aws_transcribe = boto3.client(
    'transcribe',
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
    region_name=AWS_REGION)
# start transcription job

l_media_objects = [s3_uri]

for object in l_media_objects:
    try:
        job = aws_transcribe.start_transcription_job(
            TranscriptionJobName=aws_transcribe_job_name,
            Media={'MediaFileUri': s3_uri},
            # do not specify media format as mp3, as we will normalizing audio, either mp3 or mp4, to 16kHz mono WAV in prod)
            # MediaFormat='mp3',
            LanguageCode='en-US')
        while True:
            # job status monitoring and transcription results parsing
            job = aws_transcribe.get_transcription_job(TranscriptionJobName=aws_transcribe_job_name)
            status = job['TranscriptionJob']['TranscriptionJobStatus']
            if status == 'COMPLETED':
                print(f"Job {aws_transcribe_job_name} completed")
                with urllib.request.urlopen(job['TranscriptionJob']['Transcript']['TranscriptFileUri']) as r:
                    data = json.loads(r.read())
                print(data['results']['transcripts'][0]['transcript'][:50] + "...")
                response = s3.delete_object(Bucket=bucket_name, Key=s3_object_key)
                print(f"Object '{s3_object_key}' deleted successfully from bucket '{bucket_name}'.")
                break
            elif status == 'FAILED':
                print(f"Job {aws_transcribe_job_name} failed")
                print(None)
                break
            else:
                print(f"Status of job {aws_transcribe_job_name}: {status}")
                time.sleep(10)
        # remove job
        aws_transcribe.delete_transcription_job(TranscriptionJobName=aws_transcribe_job_name)
        print(f"Job {aws_transcribe_job_name} removed after transcription completed")
    except Exception as e:
        print(e)

---
# Azure - Azure Speech Services

Azure Speech Services offers competitive accuracy and is well-integrated with the Azure ecosystem.

---
# Batch Testing & Comparison

Helper functions to test multiple providers and compare results.

---
# Integration with Pipeline

When integrating these models into your benchmarking pipeline, consider:

1. **Standardized Interface**: Create a wrapper that normalizes outputs across providers
2. **Rate Limiting**: Implement backoff/retry for API limits
3. **Cost Tracking**: Log API calls and costs for budget management
4. **Error Handling**: Graceful fallbacks when services are unavailable
5. **Batch Processing**: Some providers offer batch discounts

Example standardized interface:

## Resources

- [Artificial Analysis STT Benchmark](https://artificialanalysis.ai/speech-to-text)
- [AWS Transcribe Docs](https://docs.aws.amazon.com/transcribe/)
- [GCP Speech-to-Text Docs](https://cloud.google.com/speech-to-text)
- [Azure Speech Services Docs](https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/)
- [AssemblyAI Docs](https://www.assemblyai.com/docs)
- [Deepgram Docs](https://developers.deepgram.com/)
- [Rev AI Docs](https://docs.rev.ai/)
- [Speechmatics Docs](https://docs.speechmatics.com/)
- [ElevenLabs Docs](https://elevenlabs.io/docs)