# Using Whisper API to Transcribe Text from Your Device Microphone

The [Audio Whisper API](https://platform.openai.com/docs/guides/speech-to-text) is capable of translating and transcribing speech into written form. It is powered by OpenAI's [`large-v2 Whisper model`](https://github.com/openai/whisper).

In this guide, we will record audio from your device's microphone and use the Audio Whisper API to transcribe it. This functionality is similar to clicking the microphone 🎙️ icon in ChatGPT (note that speech-to-text is not supported in [ChatGPT for Web](http://chatgpt.com)).

![whisper_onChatGPTApp_cvk](../images/whisper_onChatGPTApp_cvk.gif)

We'll be working with WAV files. Although larger than MP3, WAV files store audio in an uncompressed format, preserving audio quality, which can significantly improve the accuracy of transcription and translation.

We will go through the following steps:

1. **Recording:** Capture audio from your device microphone and store it in a temporary file.
2. **Transcribing or Translating:** Use OpenAI's Whisper API to convert the audio to text (either transcribing English or translating other languages to English).
3. **Copying:** Copy the transcribed/translated text to your clipboard.

## Table of Contents

1. [Microphone Permissions](#microphone-permissions)
2. [Setup](#setup)
3. [Recording Audio](#recording-audio)
4. [Transcribing and Translating Audio](#transcribing-and-translating-audio)
5. [Copying to Clipboard](#copying-to-clipboard)
6. [Main Function and Demos](#main-function-and-demos)
7. [Troubleshooting](#troubleshooting)
8. [FAQ](#faq)
9. [Conclusion](#conclusion)

## Microphone Permissions

Before we start, ensure the necessary permissions to access the microphone.

### For Windows

1. Open **Settings**.
2. Go to **Privacy > Microphone**.
3. Ensure that "Microphone access for this device" is turned on.
4. Ensure that your Python IDE is allowed to access the microphone.

### For MacOS

1. Open **System Preferences**.
2. Go to **Security & Privacy > Privacy**.
3. Select **Microphone** from the left-hand menu.
4. Ensure that your Python IDE is checked.

## Setup

We need several libraries to record and process audio:

-   **pyaudio:** To capture audio from the microphone.
-   **wave:** To handle .wav files.
-   **tempfile:** To create temporary files for storing recordings.
-   **simpleaudio:** To play back audio (for debugging).
-   **openai:** To access the Whisper API.
-   **pyperclip:** To copy text to the clipboard.
-   **python-dotenv:** To load environment variables (for API keys).

In [1]:
# Install prerequisites (you may need to adjust these based on your OS):

!brew install ffmpeg -q       # For audio processing
!brew install portaudio -q    # For PyAudio support

# Install Python packages:
%pip install -q simpleaudio pyaudio wave pyperclip openai python-dotenv

[0mNote: you may need to restart the kernel to use updated packages.


## API Key Setup

1. **Obtain an API key:** Get your API key from the [OpenAI website](https://platform.openai.com/account/api-keys).
2. **Create a .env file:** In your project directory, create a file named `.env`.
3. **Store your API key:** Add the following line to your `.env` file, replacing `your_actual_api_key_here` with your key:

    ```
    OPENAI_API_KEY=your_actual_api_key_here
    ```

In [2]:
from openai import OpenAI
from dotenv import load_dotenv
import os

# Load the API key from the .env file
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize the OpenAI client
client = OpenAI(api_key=openai_api_key)

## Recording Audio

We'll create a `record_audio` function that handles audio recording. It will support both **timed recording** (for a specified duration) and **manual recording** (stopping when the user presses Enter).

Let's break down the steps involved in the `record_audio` function:

1. **Set Up Temporary File:**

    *   A temporary file is created to store the recorded audio.
    *   This file will be automatically deleted after it's no longer needed.

    ```python
    temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    temp_file_name = temp_file.name
    ```

2. **Callback Function:**

    *   The `callback` function is responsible for writing chunks of audio data to the temporary file as they are received from the microphone.

    ```python
    def callback(data_input, frame_count, time_info, status):
        wav_file.writeframes(data_input)
        return None, pyaudio.paContinue
    ```

3. **Record Audio:**

    *   **Open a `.wav` file:** A new WAV file is opened in write-binary ("wb") mode to store the audio data.
    *   **Set Audio Format:** The audio format is configured as follows:
        *   **1 channel (mono):** Using a single channel (mono) is sufficient for speech recognition and reduces processing overhead.
        *   **16-bit samples:** 16-bit samples offer a good balance between audio quality and file size.
        *   **16000 Hz sample rate:** A 16kHz sample rate is commonly used in speech recognition because it captures the relevant frequency range of human speech while keeping file sizes manageable.
    *   **Initialize PyAudio:** A PyAudio object is created to interface with the microphone.
    *   **Start Recording:** The `audio.open()` function starts an audio stream, which continuously receives audio data from the microphone and passes it to the `callback` function for writing to the temporary file.

In [3]:
import pyaudio
import wave
import tempfile
import time


def record_audio(timed_recording=False, record_seconds=5):
    """Records audio from the microphone.

    Args:
        timed_recording (bool): If True, record for a fixed duration.
        record_seconds (int): Duration of recording in seconds (if timed_recording is True).

    Returns:
        str: The path to the temporary audio file.
    """
    temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
    temp_file_name = temp_file.name

    def callback(data_input, frame_count, time_info, status):
        """Writes audio data to the temporary file."""
        wav_file.writeframes(data_input)
        return (None, pyaudio.paContinue)

    with wave.open(temp_file_name, "wb") as wav_file:
        wav_file.setnchannels(1)  # Mono channel
        wav_file.setsampwidth(2)  # 16-bit samples
        wav_file.setframerate(16000)  # 16kHz sample rate

        audio = pyaudio.PyAudio()
        stream = audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=16000,
            input=True,
            frames_per_buffer=1024,
            stream_callback=callback,
        )

        if timed_recording:
            print(f"Recording for {record_seconds} seconds...")
            time.sleep(record_seconds)
        else:
            input("Press Enter to stop recording...")

        stream.stop_stream()
        stream.close()
        audio.terminate()

    return temp_file_name

## Transcribing and Translating Audio

Instead of combining both functionalities, we'll use separate functions for transcribing and translating to improve code clarity. We will also adjust the usage of the `prompt` parameter to make it more aligned with the API design - to be an example rather than an instruction.


In [4]:
def transcribe_audio_file(file_name, prompt=""):
    """Transcribes an audio file using the Whisper API.

    Args:
        file_name (str): The path to the audio file.
        prompt (str): An optional prompt to guide the transcription.

    Returns:
        str: The transcribed text.
    """
    try:
        with open(file_name, "rb") as audio_file:
            response = client.audio.transcriptions.create(
                model="whisper-1", file=audio_file, prompt=prompt
            )
            return response.text.strip()
    except Exception as e:
        print(f"An error occurred during transcription: {e}")
        return None


def translate_audio_file(file_name, prompt=""):
    """Translates an audio file to English using the Whisper API.

    Args:
        file_name (str): The path to the audio file.
        prompt (str): An optional prompt to guide the translation.

    Returns:
        str: The translated text.
    """
    try:
        with open(file_name, "rb") as audio_file:
            response = client.audio.translations.create(
                model="whisper-1", file=audio_file, prompt=prompt
            )
            return response.text.strip()
    except Exception as e:
        print(f"An error occurred during translation: {e}")
        return None

**Note:** You can use the `prompt` parameter to *guide* the transcription or translation. This is useful for various reasons, such as:

*   **Spelling correction:** Providing correctly spelled words or names.
*   **Language specification:** Indicating the language of the audio.
*   **Acronym recognition:** Helping the model recognize specific acronyms.
*   **Filler word control:** Influencing whether filler words are included or excluded.
*   **Punctuation:** Guiding the model to use appropriate punctuation.

For more information, refer to the [Audio Whisper API's reference on prompting](https://platform.openai.com/docs/guides/speech-to-text/prompting) or [prestontuggle's AI Cookbook Recipe](https://cookbook.openai.com/examples/whisper_prompting_guide).

## Copying to Clipboard

We'll use the `pyperclip` library to copy the transcribed or translated text to the clipboard.

In [5]:
import pyperclip


def copy_to_clipboard(text):
    """Copies text to the clipboard."""
    if text:
        pyperclip.copy(text)
        print("Result copied to clipboard!")
    else:
        print("Nothing to copy. Transcription/translation may have failed.")

## Main Function and Demos

Here's the main `transcribe_audio` function that combines all the steps:

1. Recording audio
2. Transcribing or translating
3. Copying the result to the clipboard

In [6]:
import simpleaudio as sa
import os


def transcribe_audio(
    debug: bool = False,
    prompt: str = "",
    timed_recording: bool = False,
    record_seconds: int = 5,
    is_english: bool = True,
) -> str:
    """Records, transcribes/translates, and copies audio to clipboard.

    Args:
        debug (bool): If True, plays back the recorded audio.
        prompt (str): A prompt to guide the transcription/translation.
        timed_recording (bool): If True, record for a fixed duration.
        record_seconds (int): Duration of recording in seconds.
        is_english (bool): If True, transcribes; otherwise, translates to English.

    Returns:
        str: The transcribed or translated text.
    """
    temp_file_name = record_audio(timed_recording, record_seconds)

    if debug:
        print("Playing back recorded audio...")
        playback = sa.WaveObject.from_wave_file(temp_file_name)
        play_obj = playback.play()
        play_obj.wait_done()

    if is_english:
        result = transcribe_audio_file(temp_file_name, prompt)
    else:
        result = translate_audio_file(temp_file_name, prompt)

    os.remove(temp_file_name)
    copy_to_clipboard(result)
    return result

### Demo 1: Transcribe English Speech

This demo records 5 seconds of spoken English and transcribes it. The prompt provides an example of the desired output format. We will also adjust the prompt in the English transcription demo to reflect this.

In [24]:
print("Demo: Transcribe 5 seconds of spoken English")
result = transcribe_audio(
    debug=True,
    prompt="This is a sample transcription of speech in English. The speaker is discussing technology and AI. Ensure the output uses proper grammar, punctuation, and complete sentences, similar to this example.",
    timed_recording=True,
    record_seconds=5,
    is_english=True,
)
print("\nTranscription:", result)

Demo: Transcribe 5 seconds of spoken English
Recording for 5 seconds...
Playing back recorded audio...
Result copied to clipboard!

Transcription: Supercalifragilisticexpialidocious, I assume Whisper cannot transcribe this, but let's see how it performs.


### Demo 2: Translate Spanish Speech to English

This demo records 5 seconds of spoken Spanish and translates it into English. The prompt guides the translation, and unlike before, sets a 5-second limit to the recording time.

In [19]:
print("Demo: Transcribe 5 seconds of spoken Spanish and translate into English")
result = transcribe_audio(
    debug=False,
    prompt="This is a translation of Spanish speech into English. The speaker is having a casual conversation. A sample translation would be: 'Hello, how are you doing today? I hope everything is going well.' Ensure the output uses proper grammar and punctuation.",
    timed_recording=True,
    record_seconds=5,
    is_english=False,
)
print("\nTranslation:", result)

Demo: Transcribe 5 seconds of spoken Spanish and translate into English
Recording for 5 seconds...
Result copied to clipboard!

Translation: Hello, my name is Carl, written with a C. I don't speak Spanish, but it's a pleasure to meet you.


## Troubleshooting

### Common Issues

-   **Microphone not working:**
    -   Check microphone connections and volume.
    -   Ensure your application has microphone access permissions.
-   **Audio quality issues:**
    -   Record in a quiet environment.
-   **Transcription/translation errors:**
    -   Ensure the audio is clear.
    -   Re-record if necessary.
    -   Set `is_english` correctly.
-   **API key issues:**
    -   Verify your `.env` file is in the correct location and has the correct API key.

### Advanced Troubleshooting

-   **Debugging audio playback:**
    -   Enable the `debug` parameter to listen to the recorded audio.
-   **Handling large audio files:**
    -   Consider splitting long recordings into smaller chunks.

## FAQ

See Whisper Audio API's official FAQs [here](https://help.openai.com/en/articles/7031512-whisper-audio-api-faq).

**Q: How can I improve the transcription accuracy?**

-   Ensure the recording environment is quiet.
-   Speak clearly and at a moderate pace.
-   Use a high-quality microphone if possible.
-   For English transcription, use the `prompt` parameter to provide context.

**Q: Can I use this method to transcribe audio in other languages?**

-   Yes, the Whisper model supports [multiple languages](https://platform.openai.com/docs/guides/speech-to-text/supported-languages). Set `is_english=False` to use the translation feature, which will translate non-English audio to English text.

**Q: How do I choose between manual and timed recording?**

-   Use `timed_recording=False` if you want to control the recording duration manually (press Enter to stop).
-   Use `timed_recording=True` and set `record_seconds` to automatically stop recording after a specific duration.

## Conclusion

Congratulations! You have now learned how to create an audio transcription and translation tool using OpenAI's Whisper API. You can record audio from your device's microphone, transcribe English speech, translate other languages to English, and copy the results to your clipboard. Feel free to experiment further with the [API reference](https://platform.openai.com/docs/api-reference/audio) and modify the code to suit your needs!