# Python Speech2Text Tutorial

This tutorial demonstrates how to use the Speech2Text Python API for running Speech-to-Text models on Hailo hardware.

The Speech2Text API provides audio transcription capabilities using Whisper-based models, supporting both transcription and translation tasks with configurable language settings.

**Key Features:**

- Audio transcription with timestamped segments
- Language translation capabilities
- Support for multiple output formats (segments vs complete text)
- Configurable task types (transcribe/translate)
- Language-specific processing

**Best Practice: Context Manager**
This tutorial does not use context-manager to share resources between different cells. Make sure to create VDevice and Speech2Text using 'with' statements whenever possible. When not using 'with', use VDevice.release() and Speech2Text.release() to clean up resources.

**Requirements:**

* Run the notebook inside the Python virtual environment: ```source hailo_virtualenv/bin/activate```
* A Speech2Text HEF file (Hailo Executable Format for Speech-to-Text models)
* Audio files in PCM float32 format (normalized to [-1.0, 1.0), mono, little-endian, 16 kHz)
* NumPy for audio processing: ```pip install numpy```

**Audio Format Requirements:**

The audio input must be in a specific format for proper processing:

- **Format**: PCM float32 normalized to [-1.0, 1.0)
- **Channels**: Mono (single channel)
- **Endianness**: Little-endian
- **Sample Rate**: 16 kHz

**Tutorial Structure:**

* Basic Speech2Text initialization and audio loading
* Transcription with timestamped segments
* Complete text transcription
* Language translation capabilities
* Task configuration (transcribe vs translate)
* Language-specific processing

When inside the ```virtualenv```, use the command ``jupyter-notebook <tutorial-dir>`` to open a Jupyter server that contains the tutorials (default folder on GitHub: ``hailort/libhailort/bindings/python/platform/hailo_tutorials/notebooks/``).


In [None]:
# Speech2Text Tutorial: Setup and Configuration

from hailo_platform import VDevice
from hailo_platform.genai import Speech2Text, Speech2TextTask
import numpy as np
import os

# Configuration - Update these paths for your setup
MODEL_PATH = "/your/hef/path/speech2text.hef"  # Update this path
AUDIO_FILE_PATH = "/your/audio/file/path/audio.bin"  # Update this path

print("Model path: {}".format(MODEL_PATH))
print("Audio file path: {}".format(AUDIO_FILE_PATH))

vdevice = VDevice()
print("Initializing Speech2Text... this may take a moment...")
speech2text = Speech2Text(vdevice, MODEL_PATH)
print("Speech2Text initialized successfully!")


## Audio Loading and Format Validation

Load and validate audio data in the required format.


In [None]:
def load_audio_file(file_path):
    """
    Load audio file as binary data, and convert it into numpy array.
    Expected format: PCM float32 normalized to [-1.0, 1.0), mono, little-endian, 16 kHz
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"Audio file not found: {file_path}")
    
    with open(file_path, 'rb') as f:
        audio_data = f.read()
    
    print(f"Loaded audio file: {file_path}")
    print(f"Audio data size: {len(audio_data)} bytes")
    
    audio_array = np.frombuffer(audio_data, dtype='<f4').copy()
    return audio_array

# Load audio file
audio_data = load_audio_file(AUDIO_FILE_PATH)


## Basic Transcription with Segments

Generate transcription with timestamped segments for detailed analysis.


In [None]:
task = Speech2TextTask.TRANSCRIBE  # or "TRANSLATE"
language = "en"  # ISO-639-1 language code

print("Generating transcription with segments...")
segments = speech2text.generate_all_segments(audio_data, task=task, language=language)

print("\nTranscription Results:")
print("=" * 50)
for i, segment in enumerate(segments):
    print(f"Segment {i+1}: {segment}")
    print("-" * 30)


## Complete Text Transcription

Generate complete transcription as a single string without timestamps.


In [None]:
# Generate complete transcription
print("Generating complete transcription...")
complete_text = speech2text.generate_all_text(audio_data, task=task, language=language)

print("\nComplete Transcription:")
print("=" * 50)
print(complete_text)


## Language Translation Example

Demonstrate translation capabilities by transcribing speech in one language and translating to another.


In [None]:
task = Speech2TextTask.TRANSLATE  # Translate to English
language = "es"  # Source language

print("Generating translation...")
translation_segments = speech2text.generate_all_segments(audio_data, task=task, language=language)

print("\nTranslation Results:")
print("=" * 50)
for i, segment in enumerate(translation_segments):
    print(f"Segment {i+1}: [{segment.start_sec:.2f}s - {segment.end_sec:.2f}s]")
    print(f"Translation: {segment.text}")
    print("-" * 30)


## Language Detection Example

Demonstrates automatic language detection.
Instead of specifying a language, the model detects it from the audio input,
allowing seamless transcription or translation of audio in an unknown language.


In [None]:
# Automatically detect language by omitting the 'language' parameter
print("Generating transcription with automatic language detection...")
auto_detected_segments = speech2text.generate_all_segments(audio_data, task=Speech2TextTask.TRANSCRIBE)

print("\nAuto-Detected Language Transcription Results:")
print("=" * 50)
for i, segment in enumerate(auto_detected_segments):
    print(f"Segment {i+1}: [{segment.start_sec:.2f}s - {segment.end_sec:.2f}s]")
    print(f"Text: {segment.text}")
    print("-" * 30)


## Tokenization Example
The GenAI HEF comes with tokenization information, allowing the encoding of text into tokens.

In [None]:
tokens = speech2text.tokenize(complete_text)
print("The transcription has {} tokens: {}".format(len(tokens), tokens))

## Cleanup and Resource Management

Properly clean up resources when done (best practice: use context managers when possible)


In [None]:
# Clean up resources
print("Cleaning up resources...")
speech2text.release()
vdevice.release()
print("Resources cleaned up successfully!")
