<a href="https://colab.research.google.com/github/jcv444/FAQ/blob/master/copy_of_2026_02_06_gemini_plished_working_openai_whisper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI Whisper Notebook

## Section 1 - Notebook setup

The following command will pull and install the latest commit from [OpenAI's Whisper repository](https://github.com/openai/whisper) along with its Python dependencies.

In [None]:
pip install git+https://github.com/openai/whisper.git

You'll also want to set Colab's hardware accelerator to 'GPU'. You can do this by going to 'view resources' (available from the drop-down list next to the RAM/Disk bars) and then selecting 'change runtime type'.

## Section 2 - High level model access

### 2.1 - English to English Transcription

In this sub-section we'll upload one or more audio files containing English speech and transcribe the content of that audio into English text. So first things first, let's upload the audio:

In [None]:
pip install ffmpeg-python

In [None]:
!apt-get update
!apt-get install -y ffmpeg

In [None]:
from google.colab import files
uploaded = files.upload() # run this to get an upload widget


Next, we'll load Whisper and ask it to transcribe the audio file we just uploaded:

In [None]:
from google.colab import files
uploaded = files.upload() # run this to get an upload widget[ ]
from google.colab import files
uploaded = files.upload() # run this to get an upload widget


In [None]:
import whisper

# Load the Whisper model. You can choose 'tiny.en', 'base.en', 'small.en', 'medium.en', 'large.en'
# for English-only transcription, or 'tiny', 'base', 'small', 'medium', 'large' for multilingual models.
# Larger models offer better accuracy but require more VRAM and are slower.
model = whisper.load_model("base.en")

# Dictionary to store transcription results
all_transcriptions = {}

# Iterate through each uploaded file and transcribe it
if uploaded:
    print(f"Processing {len(uploaded)} file(s)...")
    for filename in uploaded.keys():
        print(f"\nTranscribing '{filename}'...")
        try:
            # Transcribe the audio file
            # fp16=False is generally safer for accuracy, especially on CPU or if GPU memory is limited.
            # Set fp16=True for potentially faster processing on compatible GPUs.
            result = model.transcribe(filename, language="en", fp16=False)
            transcribed_text = result["text"]
            print(f"Transcription for '{filename}':\n{transcribed_text}")
            all_transcriptions[filename] = transcribed_text
        except Exception as e:
            print(f"Error transcribing '{filename}': {e}")
    print("\nAll transcriptions complete.")
else:
    print("No files uploaded to transcribe.")

# You can now access all_transcriptions dictionary for results, e.g.:
# for filename, text in all_transcriptions.items():
#     print(f"\n-- {filename} --\n{text}")


### 2.2 French to English Translation

In this sub-section we'll upload one or more audio files containing French speech and translate the content of that audio into English text. Let's upload the audio:

In [None]:
from google.colab import files
uploaded_french = files.upload() # run this to get an upload widget for French audio


Let's first see how Whisper fairs transcribing French speech to French text:

In [None]:
model = whisper.load_model("base")

# Assuming you uploaded one French audio file using the previous cell.
# If multiple files were uploaded, this will use the first one.
if uploaded_french:
    french_audio_filename = list(uploaded_french.keys())[0]
    print(f"Transcribing French audio '{french_audio_filename}' to French text...")
    result = model.transcribe(french_audio_filename, language='fr', fp16=False)
    print(result["text"])
else:
    print("No French audio file found for transcription. Please upload one in the cell above.")


Now let's see how well it translates French speech to English text:

In [None]:
model = whisper.load_model("base")

# Assuming you uploaded one French audio file using the previous cell.
# If multiple files were uploaded, this will use the first one.
if uploaded_french:
    french_audio_filename = list(uploaded_french.keys())[0]
    print(f"Translating French audio '{french_audio_filename}' to English text using 'base' model...")
    result = model.transcribe(french_audio_filename, language='fr', task='translate', fp16=False)
    print(result["text"])
else:
    print("No French audio file found for translation. Please upload one in the cell above.")


Let's try the same as above but on a slightly more accurate model:

In [None]:
model = whisper.load_model("small")

# Assuming you uploaded one French audio file using the previous cell.
# If multiple files were uploaded, this will use the first one.
if uploaded_french:
    french_audio_filename = list(uploaded_french.keys())[0]
    print(f"Translating French audio '{french_audio_filename}' to English text using 'small' model...")
    result = model.transcribe(french_audio_filename, language='fr', task='translate', fp16=False)
    print(result["text"])
else:
    print("No French audio file found for translation. Please upload one in the cell above.")


## Section 3 - Low level model access

Below we'll look at some low level Whisper access using `whisper.decode()` and `whisper.detect_language()`:

In [None]:
model = whisper.load_model('small')

# Assuming you uploaded one French audio file using the previous cell.
# If multiple files were uploaded, this will use the first one.
if uploaded_french:
    french_audio_filename = list(uploaded_french.keys())[0]
    print(f"Processing audio '{french_audio_filename}' for low-level access...")
    # load audio and pad/trim it to fit 30 seconds
    audio = whisper.load_audio(french_audio_filename)
    audio = whisper.pad_or_trim(audio)

    # make log-Mel spectrogram and move to the same device as the model
    mel = whisper.log_mel_spectrogram(audio).to(model.device)
else:
    print("No French audio file found for low-level access examples. Please upload one in the cell above.")


### 3.1 - Language detection

In [None]:
# detect the spoken language
_, probs = model.detect_language(mel)
lang = max(probs, key=probs.get)
prob = "{0:.0%}".format(max(probs.values()))

# print language that scored the highest liklihood
print(f'Detected language (and probability): {lang}', f'({prob})')

### 3.2 - French to English Translation

In [None]:
# decode the audio
options = whisper.DecodingOptions(language='fr', task='translate')
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)