# Speech-to-Text (STT) with OpenAI Whisper

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nontgcob/HKU-InnoWing-STT-TTS-Workshop/blob/main/InnoWing_Speech_to_Text.ipynb)

This notebook demonstrates how to use the OpenAI Whisper model for Speech-to-Text (STT) transcription. We will download a 'fast' version of the model, load it, and then perform inference on an audio file.

Note: Speech-to-Text (STT) is sometimes referred to as Automatic Speech Recognition (ASR). These two terms can be used interchangeably.

In [None]:
# The first step is to install the necessary library: `openai-whisper`.
# The `-q` flag ensures a quiet installation, meaning it won't print all the installation details.
!pip install -q openai-whisper

Whisper often relies on `ffmpeg` for audio processing. Ensure it's installed in your Colab environment.

In [None]:
# First, we update the list of available packages.
!apt-get update
# Then, we install ffmpeg. The `-y` flag automatically confirms any prompts.
!apt-get install -y ffmpeg

### 1. Load the Whisper Model

We will load the `base` model, which is a good balance between speed and accuracy for a 'fast' version. You can choose other models like `tiny`, `small`, `medium`, or `large` depending on your needs. Smaller models are faster but less accurate.

In [None]:
import whisper # Import the Whisper library to use its functions

# Choose the model size. 'base' is chosen here for a good balance of speed and accuracy.
# You can try 'tiny', 'small', 'medium', or 'large' if you wish.
# For English-only transcription, you could use 'base.en' for slightly better performance.
model_name = "base"

# Load the selected Whisper model into memory.
# This step downloads the model weights the first time you run it.
model = whisper.load_model(model_name)

print(f"Whisper model '{model_name}' loaded successfully.")

### 2.1 Perform Inference: Transcribe from Audio File

Now, let's transcribe the audio file using the loaded Whisper model.

The code cell below transcribes an audio file. It is commented out for the workshop's focus on microphone input.

If you have your own audio file (e.g. `my_audio.wav`) and wish to transcribe it, uncomment the entire code block below and ensure `my_audio.wav` is replaced with your actual audio file name.

In [None]:
# print("Transcribing audio from file...")
# user_audio_path = "my_audio.mp3" # <--- Change this to your audio file's path
# result = model.transcribe(user_audio_path)

# print("\n--- Transcription Result from File ---")
# print(result["text"])

### 2.2. Perform Inference: Transcribe from Browser Microphone

To use your browser's microphone, we'll leverage some `IPython.display` utilities to create a simple recording interface directly in the notebook. The recorded audio will then be processed and transcribed by the Whisper model.

In [None]:
from IPython.display import display, Javascript, Audio # Tools for displaying rich content in Colab
from google.colab import output # Used to register a Python function as a callback for JavaScript
import base64 # For encoding/decoding binary data (our audio) to/from text
from scipy.io.wavfile import write # To save the recorded audio as a WAV file
import threading # Used to help Python 'wait' for the asynchronous JavaScript recording to finish

# This is a special tool (an 'event' object) that helps synchronize tasks.
# We use it to make sure our Python code waits until the browser has finished recording and saving the audio.
recording_done_event = threading.Event()

def record_audio(duration, filename='microphone_audio.wav', samplerate=16000): # 'duration' is now the first parameter for clarity
    global model # Allows us to use the Whisper 'model' loaded earlier in this function
    global recording_done_event # Allows us to use our synchronization event

    # Clear the event. This prepares it to 'wait' again for a new recording.
    recording_done_event.clear()

    # This is a block of JavaScript code that will run in your web browser.
    # It requests microphone access, records audio for a specified duration,
    # and then sends the recorded audio (as base64 encoded text) back to Python.
    JS_CODE = f'''
        async function recordAudio() {{
            const stream = await navigator.mediaDevices.getUserMedia({{ audio: true }}); // Request microphone access
            const mediaRecorder = new MediaRecorder(stream); // Create a recorder
            const audioChunks = []; // Array to store parts of the audio

            mediaRecorder.addEventListener('dataavailable', event => {{
                audioChunks.push(event.data); // Add audio data as it becomes available
            }});

            const audioPromise = new Promise(resolve => {{
                mediaRecorder.addEventListener('stop', () => {{
                    // When recording stops, combine the audio chunks into a Blob
                    const audioBlob = new Blob(audioChunks, {{ 'type' : 'audio/wav' }});
                    const reader = new FileReader();
                    reader.onloadend = () => {{ resolve(reader.result.split(',')[1]); }}; // Convert Blob to base64
                    reader.readAsDataURL(audioBlob);
                }});
            }});

            mediaRecorder.start(); // Start recording
            await new Promise(resolve => setTimeout(resolve, {duration * 1000})); // Record for 'duration' seconds
            mediaRecorder.stop(); // Stop recording

            return await audioPromise; // Return the base64 encoded audio
        }}

        // Call the recordAudio function and then invoke our Python callback with the audio data
        recordAudio().then(audioBase64 => google.colab.kernel.invokeFunction('notebook_record_audio_callback', [audioBase64]));
    '''

    # Display the JavaScript code, which executes it in your browser.
    display(Javascript(JS_CODE))
    print(f"Recording for {duration} seconds...")

    # This nested function is a Python 'callback'. It's designed to be called by the JavaScript
    # code running in your browser, *after* the audio recording is complete.
    def _record_audio_callback(audio_base64):
        # Decode the base64 audio data received from JavaScript back into raw bytes
        audio_bytes = base64.b64decode(audio_base64)
        # Save these audio bytes to a WAV file on our Colab environment's disk
        with open(filename, 'wb') as f:
            f.write(audio_bytes)
        print(f"Audio saved to {filename}")

        # Now that the audio is saved, we can proceed with transcribing it using Whisper
        print("Transcribing recorded microphone audio...")
        result_mic = model.transcribe(filename)

        print("\n--- Transcription Result from Microphone ---")
        print(result_mic["text"])

        # As a bonus, we can play back the recorded audio directly in the notebook
        print("\n--- Playing back recorded audio ---")
        display(Audio(filename))

        # Signal that all recording and processing steps are now complete
        recording_done_event.set()

    # Register our Python callback function so that the JavaScript code can call it.
    # This needs to be done each time `record_audio` is called in a new execution,
    # as the Colab kernel resets callback registrations.
    output.register_callback('notebook_record_audio_callback', _record_audio_callback)

    # The Python code will now 'wait' here until the 'recording_done_event' is set
    # by our callback function (meaning the recording and transcription are done).
    # We add a generous timeout to ensure the process has enough time to complete.
    timeout_seconds = duration + 10 # Adding 10 seconds buffer for recording and processing
    if not recording_done_event.wait(timeout=timeout_seconds):
        print(f"Warning: Recording or transcription timed out after {timeout_seconds} seconds. This usually happens if microphone permissions were not granted or there's a browser issue. Please check your microphone and browser permissions and try again.")

In [None]:
# Define the recording duration in seconds
recording_duration_seconds = 5

# Record audio for the specified duration
mic_audio_path = "microphone_audio.wav"
record_audio(duration=recording_duration_seconds, filename=mic_audio_path)