# Google Gemini 2.5 Pro for Podcast and YouTube Understanding & Transcription

Gemini **2.5 Pro** significantly enhances video and audio understanding through its advanced reasoning and exceptionally large context window (up to 1 million tokens). This massive capacity allows it to process substantial inputs, enabling tasks like analyzing a 30-minute video (which consumes roughly 500k tokens including audio and sampled frames) or transcribing hours of audio content in a single pass.

**Audio & Podcast Transcription:**

*   Generates precise transcripts with second-level [MM:SS] timestamps, automatic speaker identification (diarization), and detection/labeling of audio events like background music (even identifying specific songs), sound effects (e.g., bells), and named jingles.
*   15 minute audio equalls to roughly 8k output token. Leading to a total single request transcription of ~2h. As Gemini 2.5 Pro has now a 64k context output
*   Its 1 million token context window capably handles extensive recordings, processing hours of audio content seamlessly.

**YouTube Video Understanding & Interaction:**

*   Public YouTube URLs directly via the API and AI Studio, analyzing both the audio track and visual frames (sampled at ~1 frame per second).
*   Ask Gemini **2.5 Pro** to summarize, answer specific questions, translate spoken content, transcribe audio, or provide visual descriptions. Interact with specific moments using `MM:SS` timestamps in prompts (e.g., `"What examples are given at 01:05?"`) and request combined outputs like transcriptions alongside visual descriptions in one go.
*   Supports public videos only (no private or unlisted), daily processing limit: 8 hours of YouTube video content, one video URL per API request.

**Learn More & Get Started:**

Explore the documentation to implement these features with **Gemini 2.5 Pro**:

*   **Working with Audio Files:** [https://ai.google.dev/gemini-api/docs/audio?lang=python](https://ai.google.dev/gemini-api/docs/audio?lang=python)
*   **Working with Video (including YouTube URLs):** [https://ai.google.dev/gemini-api/docs/vision?lang=python#youtube](https://ai.google.dev/gemini-api/docs/vision?lang=python#youtube)


In [None]:
%pip install google-genai jinja2

In [1]:
import os
from google import genai


client = genai.Client()  

## Podcast Example


In [2]:
from jinja2 import Template


# path to the file to upload
file_path = "../assets/porsche.mp3" # Repalce with your own file path

# Upload the file to the File API
file = client.files.upload(file=file_path)

# Generate a structured response using the Gemini API
prompt_template = Template("""Generate a transcript of the episode. Include timestamps and identify speakers.

Speakers are: 
{% for speaker in speakers %}- {{ speaker }}{% if not loop.last %}\n{% endif %}{% endfor %}

eg:
[00:00] Brady: Hello there.
[00:02] Tim: Hi Brady.

It is important to include the correct speaker names. Use the names you identified earlier. If you really don't know the speaker's name, identify them with a letter of the alphabet, eg there may be an unknown speaker 'A' and another unknown speaker 'B'.

If there is music or a short jingle playing, signify like so:
[01:02] [MUSIC] or [01:02] [JINGLE]

If you can identify the name of the music or jingle playing then use that instead, eg:
[01:02] [Firework by Katy Perry] or [01:02] [The Sofa Shop jingle]

If there is some other sound playing try to identify the sound, eg:
[01:02] [Bell ringing]

Each individual caption should be quite short, a few short sentences at most.

Signify the end of the episode with [END].

Don't use any markdown formatting, like bolding or italics.

Only use characters from the English alphabet, unless you genuinely believe foreign characters are correct.

It is important that you use the correct words and spell everything correctly. Use the context of the podcast to help.
If the hosts discuss something like a movie, book or celebrity, make sure the movie, book, or celebrity name is spelled correctly.""")

# Define the speakers and render the prompt
speakers = ["John"]
prompt = prompt_template.render(speakers=speakers)

response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[prompt, file],
)

print(response.text)

[00:00] John: If the Porsche Macan has proven anything,
[00:02] John: it's that the days of sacrificing performance for practicality are gone.
[00:06] John: Long gone.
[00:08] John: Engineered to deliver a driving experience like no other,
[00:11] John: the Macan has demonstrated excellence in style and performance
[00:14] John: to become the leading sports car in its class.
[00:17] John: So don't let those five doors fool you.
[00:19] John: Once you're in the driver's seat,
[00:20] John: one thing will become immediately clear.
[00:23] [Car engine roaring]
[00:24] John: This is a Porsche.
[00:25] John: The Macan.
[00:26] John: Now leasing from 3.99%.
[00:29] John: Conditions apply.
[00:30] [Dial-up modem sound]
[00:32] [END]


## Youtube Example


In [3]:
from google.genai import types

youtube_url = "https://www.youtube.com/watch?v=RDOMKIw1aF4" # Repalce with the youtube url you want to analyze

prompt = """Analyze the following YouTube video content. Provide a concise summary covering:

1.  **Main Thesis/Claim:** What is the central point the creator is making?
2.  **Key Topics:** List the main subjects discussed, referencing specific examples or technologies mentioned (e.g., AI models, programming languages, projects).
3.  **Call to Action:** Identify any explicit requests made to the viewer.
4.  **Summary:** Provide a concise summary of the video content.

Use the provided title, chapter timestamps/descriptions, and description text for your analysis."""

# Analyze the video
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=types.Content(
        parts=[
            types.Part(text=prompt),
            types.Part(
                file_data=types.FileData(file_uri=youtube_url)
            )
        ]
    )
)

print(response.text)

Based on the provided video content, here is a concise analysis:

### 1. Main Thesis/Claim
The creator argues that Google's Gemini 2.5 Pro is the best AI for *pure coding* he has ever used. While it struggles with front-end UI generation from images or high-level descriptions, its ability to handle complex logic, refactor code, and build functional applications from detailed prompts is superior to other models.

### 2. Key Topics
*   **AI Model Tested:** Google Gemini 2.5 Pro (Experimental 03-25 version).
*   **Coding Challenges & Projects:**
    *   **Ultimate Tic-Tac-Toe:** Created a fully functional game in **Java** using the **Swing** library with a single, detailed prompt.
    *   **"Kitten Cannon" Clone:** Generated a launch-style game using **p5.js**. The initial code had bugs, but the model successfully debugged and fixed them over two additional prompts.
    *   **Landing Page from Mockup:** Attempted to build a landing page using **React, Vite, and Tailwind CSS** based on a m