##### Copyright 2025 Google LLC.

In [1]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/examples/Podcast_and_Audio_Transcription.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/>

# Podcast and Audio Transcription with the Gemini API

## Advanced Audio-to-Text Conversion  
Gemini 2.0 transforms audio files (podcasts, interviews, call recordings) into **structured transcripts** with:  
- **Precision timestamps** ([00:00] format)  
- **Speaker identification** (labeled or auto-assigned as Speaker A/B)  
- **Audio event detection**:  
  - Background music (with song recognition, e.g., `[02:15] [Firework by Katy Perry]`)  
  - Sound effects (e.g., `[01:30] [Bell ringing]`)  
  - Named jingles (e.g., `[00:45] [The Sofa Shop jingle]`)  

## Implementation Notes  
1. **File Preparation**:  
   - Supports MP3/WAV formats  
   - Update `file_path` to your audio file location  
2. **Output Format**:  
   - Clean text without markdown  
   - Terminates with `[END]` marker  
3. **Accuracy**:  
   - Context-aware spelling (corrects names/titles)  
   - English-alphabet focused (non-English chars only when explicit)  

```python
# Example output snippet:
[00:00] Tom: Welcome to the podcast.  
[00:03] Speaker A: Thanks for having me!  
[00:06] [Coffee shop ambiance]  
[01:30] [END]
```


In [2]:
%pip install -U -q "google-genai>=1.0.0"

In [3]:
from google import genai

from IPython.display import Markdown

## Configure your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](https://github.com/google-gemini/cookbook/blob/main/quickstarts/Authentication.ipynb) for an example.

In [4]:
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
client = genai.Client(api_key=GOOGLE_API_KEY)

In [5]:
MODEL_ID="gemini-2.0-flash" # @param ["gemini-2.0-flash-lite","gemini-2.0-flash","gemini-2.0-pro-exp-02-05"] {"allow-input":true, isTemplate: true}

## Audio Transcription Example

Now that you've configured your API key, let's walk through an example of how to use it for audio transcription.


In [6]:
%pip install jinja2



In [7]:
import requests

# URL of the audio file
file_path = "https://storage.googleapis.com/generativeai-downloads/data/State_of_the_Union_Address_30_January_1961.mp3"

# Download and save the file locally
response = requests.get(file_path)
local_file_path = "/tmp/audio.mp3"
with open(local_file_path, "wb") as f:
    f.write(response.content)

In [8]:
import os

# Upload the file to the API
file = client.files.upload(file=local_file_path)

# Remove the local copy after upload
os.remove(local_file_path)

In [9]:
from jinja2 import Template

# Define the prompt template
prompt_template = Template("""
  Generate a transcript of the episode. Include timestamps and identify speakers.

  Speakers are:
  {% for speaker in speakers %}- {{ speaker }}{% if not loop.last %}\n{% endif %}{% endfor %}

  eg:
  [00:00] Brady: Hello there.
  [00:02] Tim: Hi Brady.

  It is important to include the correct speaker names. Use the names you identified earlier.
  If you really don't know the speaker's name, identify them with a letter of the alphabet,
  e.g., there may be an unknown speaker 'A' and another unknown speaker 'B'.

  If there is music or a short jingle playing, signify like so:
  [01:02] [MUSIC] or [01:02] [JINGLE]

  If you can identify the name of the music or jingle playing then use that instead, e.g.:
  [01:02] [Firework by Katy Perry] or [01:02] [The Sofa Shop jingle]

  If there is some other sound playing try to identify the sound, e.g.:
  [01:02] [Bell ringing]

  Each individual caption should be quite short, a few short sentences at most.

  Signify the end of the episode with [END].

  Don't use any markdown formatting, like bolding or italics.

  Only use characters from the English alphabet, unless you genuinely believe
  foreign characters are correct.

  It is important that you use the correct words and spell everything correctly.
  Use the context of the podcast to help.

  If the hosts discuss something like a movie, book or celebrity,
  make sure the movie, book, or celebrity name is spelled correctly.
""")

# Define known speakers
speakers = ["Tom"]
prompt = prompt_template.render(speakers=speakers)

In [10]:
# Generate content using Gemini API
response = client.models.generate_content(
    model=MODEL_ID,
    contents=[prompt, file],
)

# Display the transcript
print(response.text)

[00:00] [State of the Union Address by John F Kennedy]
[00:02] Tom: The President's State of the Union address to a joint session of the Congress from the rostrum of the House of Representatives, Washington D.C., January 30th, 1961.
[00:25] Tom: Mr. Vice President, Members of the Congress, it is a pleasure to return from whence I came.
[00:41] Tom: You are among my oldest friends in Washington and this house is my oldest home.
[00:47] Tom: It was here.
[00:53] Tom: It was here more than 14 years ago that I first took the oath of federal office.
[01:00] Tom: It was here for 14 years that I gained both knowledge and inspiration from members of both parties in both houses, from your wise and generous leaders and from the pronouncements which I can, vividly recall sitting where you now sit, including the programs of two great presidents, the undimmed eloquence of Churchill, the soaring idealism of Nero, the steadfast words of General De Gaulle.
[01:40] Tom: To speak from this same historic

## Next Steps

Enhance the transcription by adding speaker diarization, summarization, or sentiment analysis. You can also build a simple Streamlit app for interactive playback and transcript search.Try supporting multiple languages or fine-tuning prompts for better formatting and accuracy.
