# Lesson: Audio Understanding with the Gemini API

Welcome to this lesson on understanding and analyzing audio with the Gemini API. Gemini can process audio files to perform a variety of tasks, from simple transcription to complex analysis.

In this notebook, we will cover:
1.  **Uploading Audio:** The correct way to prepare and upload audio files for the API.
2.  **Transcription & Summarization:** How to get a transcript and a high-level summary from an audio file.
3.  **Multi-file Analysis:** How to ask questions across multiple audio files at once.
4.  **Processing Audio from YouTube:** How to analyze audio from a YouTube video.

In [1]:
#@title 1. Setup
# Install the Google AI Python SDK
!pip install -q -U google-generativeai

In [4]:
#@title 2. Configure your API Key
# Use the "Secrets" tab in Colab (click the key icon on the left) to store your
# API key with the name "GOOGLE_API_KEY".
from google import genai
from google.colab import userdata

try:
    GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
    client = genai.Client(api_key=GOOGLE_API_KEY)
except userdata.SecretNotFoundError as e:
    print('Secret not found. Please add your GOOGLE_API_KEY to the Colab Secrets Manager.')

## Part 1: Uploading and Transcribing Audio

The first step in working with audio is to upload the file to the Gemini API. For large files like audio, you must use the `client.files.upload` method. This sends the file to the API and gives you a reference to use in your prompts.

Let's start by transcribing a short audio clip.

In [6]:
#@title Download and upload a sample audio file

from IPython.display import display, Markdown, Audio

# This is a short public domain recording of the Gettysburg Address.
URL = "https://raw.githubusercontent.com/gopidon/gemini-advanced-api-course/main/Section_2_Advanced_Multimodality/audio/gettysburg_address.mp3"
!wget -q $URL -O sample_audio.mp3

In [7]:
print("Uploading file to the File API...")
audio_file = client.files.upload(file='sample_audio.mp3')

print(f"File uploaded successfully")

File uploaded successfully: files/soutnqzea4dt


In [13]:
print("\nPlaying sample audio:")
display(Audio('sample_audio.mp3'))



Playing sample audio:


In [14]:
prompt = 'Generate a transcript of the speech.'

response = client.models.generate_content(
  model='gemini-2.5-flash',
  contents=[prompt, audio_file]
)

print(response.text)

Four score and seven years ago, our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal.

Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure.

We are met on a great battlefield of that war. We have come to dedicate a portion of that field as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this.

But in a larger sense, we cannot dedicate, we cannot consecrate, we cannot hallow this ground. The brave men, living and dead, who struggled here have consecrated it far above our poor power to add or detract.

The world will little note, nor long remember what we say here, but it can never forget what they did here.

It is for us the living, rather, to be dedicated here to the unfinished work which they, who fought here, have thus far so

## Part 2: Summarization and Analysis

Gemini doesn't just convert speech to text; it *understands* the content. This allows us to ask more complex questions, like asking for a summary. This proves the model is comprehending the meaning of the speech.

In [16]:
prompt = 'Please listen to this audio file and provide a one-sentence summary of what it is about.'

response = client.models.generate_content(
  model='gemini-2.5-flash',
  contents=[prompt, audio_file]
)

print(response.text)

The audio is a recitation of Abraham Lincoln's Gettysburg Address, a pivotal speech delivered during the American Civil War, which honors fallen soldiers and reaffirms the nation's founding principles of liberty, equality, and self-governance.


## Part 3: Multi-file Analysis (Android vs. iOS)

A powerful feature of Gemini is its ability to reason across multiple files at once. You can include several audio files in a single prompt to compare them or find information across them.

For this example, we'll simulate a scenario where we have two audio files: one detailing new features in "Android 16" and another for "iOS 26." We'll then ask the model a question that requires it to listen to both to determine the answer.

In [17]:
ANDROID_URL = "https://raw.githubusercontent.com/gopidon/gemini-advanced-api-course/main/Section_2_Advanced_Multimodality/audio/android_16.mp3"
IOS_URL = "https://raw.githubusercontent.com/gopidon/gemini-advanced-api-course/main/Section_2_Advanced_Multimodality/audio/ios_26.mp3"
!wget -q $ANDROID_URL -O android_16.mp3
!wget -q $IOS_URL -O ios_26.mp3

In [18]:
print("Uploading both the files to the File API...")
ios_file = client.files.upload(file='ios_26.mp3')
android_file = client.files.upload(file='android_16.mp3')

print(f"Files uploaded successfully")

Uploading both the files to the File API...
Files uploaded successfully


In [22]:
#@title Ask a comparative question across both files
# The prompt is framed around our scenario. We'll ask a question that requires
# the model to identify a specific feature mentioned in only one of the files.
prompt = "I have provided two audio files, one about new Android features and one about new iOS features. Please listen to both and tell me: in which of these audio files are 'foldable phones' mentioned?"

print("--- Analyzing multiple audio files to find the answer ---")

# Pass the prompt and both file objects to the model
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
      prompt,
      android_file,
      ios_file
    ])

display(Markdown(response.text))

--- Analyzing multiple audio files to find the answer ---


"Foldable phones" are mentioned in the **Android features** audio file.

Specifically, the speaker mentions the **Pixel Pro Fold** at around the **3:23 mark** and again at the **3:32 mark**, referencing it in the context of animation features for Android 16.

The iOS features audio file does not mention foldable phones.

In [23]:
prompt = "Listen to both the Android 16 and iOS 26 feature announcements. Are there any similar features or themes mentioned in both audio files? If so, please list them."

print("--- Analyzing multiple audio files to find the answer ---")

# Pass the prompt and both file objects to the model
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=[
      prompt,
      android_file,
      ios_file
    ])

display(Markdown(response.text))

--- Analyzing multiple audio files to find the answer ---


Based on the audio files for Android 16 and iOS 26, several similar themes and features are mentioned:

1.  **Dynamic and "Physical" UI/UX (Aesthetic Overhaul):**
    *   **Android 16:** The update is described as "almost purely aesthetic," focusing on "subtle tweaks and refinements" to make the UI feel "built different," "more playful," and like "real physical objects" (Material 3 Expressive). It features "playful and reactive physics" for windows and UI elements, making them bounce and feel tangible, and uses explicit haptics to enhance this feeling (e.g., "unstcking" notifications). It mentions a slightly transparent app drawer and glassy quick settings.
    *   **iOS 26:** The update is centered around "Liquid Glass," a "major visual overhaul" emphasizing "transparency, refractions, and reflections." The goal is to unify aesthetics across Apple's platforms with "frosted glass windows" and dynamic reflections of the environment. Elements like the quick settings and volume slider become "glassy," and the lock screen clock dynamically compresses or fades behind subjects, creating a sense of depth and physical interaction.
    *   **Similarity:** Both operating systems are undergoing significant aesthetic overhauls that prioritize a more dynamic, visually rich, and "physical" feeling user experience. They both use transparency, reflections, and sophisticated animations to make UI elements seem more like interactive, real-world objects, enhancing the overall user interaction.

2.  **Enhanced Lock Screen Customization with Dynamic Visuals:**
    *   **Android 16:** Introduces "lock screen customization" with more organized UI, clock font thickness control, notification display options, and shortcut customization. A key new feature is "wallpaper customization" that allows adding "effects" like live weather (fog, rain, snow, sun) that literally appear over the wallpaper, and a "cinematic" effect that adds subtle 3D motion to photos by separating the subject from the background.
    *   **iOS 26:** Features a redesigned lock screen with an "extended clock" that can be made taller, dynamically compresses based on notifications, and "fades behind subjects" in the background image. It introduces a "Spatial Scene Button" that transforms any suitable photo into a "responsive 3D-ish image" by separating foreground and background, allowing you to "peek around," and this can be set as the lock screen.
    *   **Similarity:** Both platforms are significantly enhancing lock screen customization, moving beyond static images to offer dynamic and interactive visual effects for wallpapers and clocks, allowing users to personalize their device's first point of interaction in more engaging ways.

3.  **Advanced AI-driven Visual Intelligence / Smart Features:**
    *   **Android 16:** The speaker mentions future "glanceable ongoing notifications" (similar to iOS Live Activities) and a "Find Hub" (similar to iOS's Find My app) suggesting an ongoing trend towards smart, context-aware features.
    *   **iOS 26:** Significantly updates "Visual Intelligence," integrating it into the screenshot UI. This includes an "image search button" and the ability to "circle with your finger to search something with Google," which is a direct parallel to Android's existing "Circle to Search" feature. It also uses AI to detect event posters in photos and automatically suggests adding them to the calendar.
    *   **Similarity:** iOS is notably adopting and improving upon visual intelligence features that Android has had or been working on. Both OSes are deepening their integration of AI to analyze on-screen content and images, providing contextual actions and information, indicating a convergence in smart utility.

4.  **Improved Call Management (iOS Adopting Android Features):**
    *   **Android 16:** While not explicitly a *new* feature in the described Android 16 beta, Google Pixel phones (which run Android) already have "Call Screening" and "Hold for Me."
    *   **iOS 26:** Introduces "Call Screening" (which transcribes calls from unknown numbers, allowing for canned responses) and "Hold Assist" (which detects when you're on hold and can hold for you, notifying you when a human returns). These are highly similar to existing Google Pixel/Android features.
    *   **Similarity:** iOS is incorporating advanced call management features that have been a hallmark of the Android (specifically Google Pixel) experience, making phone calls more manageable and user-friendly on both platforms.

These similarities suggest a common direction in mobile OS development, focusing on a more immersive visual experience, deeper personalization, and smarter, more convenient utility features, often with one platform influencing the other.

## Part 4: Analyzing Audio from a YouTube Video

You can also analyze audio from YouTube videos and then process it with Gemini. This opens up a vast new source of content for analysis.

In [25]:
from google.genai import types

youtube_url = "https://www.youtube.com/watch?v=RDOMKIw1aF4" # Repalce with the youtube url you want to analyze

prompt = """
    Analyze the following YouTube video content. Provide a concise summary covering:

    1.  **Main Thesis/Claim:** What is the central point the creator is making?
    2.  **Key Topics:** List the main subjects discussed, referencing specific examples or technologies mentioned (e.g., AI models, programming languages, projects).
    3.  **Call to Action:** Identify any explicit requests made to the viewer.
    4.  **Summary:** Provide a concise summary of the video content.

    Use the provided title, chapter timestamps/descriptions, and description text for your analysis.
"""
# Analyze the video
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=types.Content(
        parts=[
            types.Part(text=prompt),
            types.Part(
                file_data=types.FileData(file_uri=youtube_url)
            )
        ]
    )
)
display(Markdown(response.text))

Here's a concise analysis of the YouTube video content:

1.  **Main Thesis/Claim:** Google's Gemini 2.5 Pro Experimental is an exceptionally powerful and cost-effective AI model, particularly excelling in logical coding tasks, code refactoring, and general reasoning/knowledge benchmarks, often surpassing competitors. However, its frontend UI generation capabilities from visual mockups still need improvement.

2.  **Key Topics:**
    *   **AI Model Comparison:** Gemini 2.5 Pro is compared against OpenAI (o3-mini, GPT-4.5), Claude (3.7 Sonnet, 3.5 Sonnet), Grok 3 Beta, and DeepSeek R1 across various benchmarks.
    *   **Coding Capabilities (Strengths):**
        *   **Complex Game Generation:** Successfully generated an "Ultimate Tic-Tac-Toe" game in Java using Swing in a single prompt.
        *   **Code Refactoring:** Impressively refactored inefficient Rust code (replacing for loops with iterators) into more idiomatic and cleaner Rust, outperforming other models.
        *   **Debugging Assistance:** Capable of debugging its own generated code with minimal additional prompts (demonstrated with a p5.js "Kitten Cannon" game).
    *   **Coding Capabilities (Weaknesses):**
        *   **Frontend UI Generation:** Struggled to build a functional landing page using Vite, React, and Tailwind CSS from a mockup image, leading to a poorly formatted and incomplete result. It did better recreating a static HTML/CSS layout of the X (Twitter) website.
    *   **Benchmark Performance:**
        *   **Reasoning & Knowledge, Science, Mathematics:** Gemini 2.5 Pro showed significant leads or comparable performance in these areas, especially when measured by percentage increase over rivals.
        *   **Coding Benchmarks:** Led in "Code editing (Aider Polyglot)" and "Long context (MRCR)," but fell slightly behind in "Code generation (LiveCodeBench v5)" compared to OpenAI o3-mini. Grok 3 Beta showed strong "Agentic coding (SWE-bench verified)" performance.
    *   **Data Currency:** Gemini 2.5 Pro's training data extends to March 2025, allowing it to provide up-to-date information (e.g., correct React.js version 19.0.0 when "Grounding with Google Search" is enabled).
    *   **Cost-Effectiveness:** The creator emphasizes the model's affordability despite its powerful capabilities.

3.  **Call to Action:**
    *   Explicit: The creator asks viewers to "Subscribe," "like," and hit the "Notification bell."
    *   Implicit: Encourages viewers to share their thoughts and experiences with Gemini 2.5 Pro in the comments.

4.  **Summary:**
    The video evaluates Google's Gemini 2.5 Pro Experimental, concluding it's an "awesome" AI for coding. The creator tests it with several challenges, demonstrating its ability to "one-shot" a complex Java Tic-Tac-Toe game and expertly refactor Rust code. While it proved less adept at generating a complete frontend landing page from a visual mockup, it successfully replicated the static layout of the X (Twitter) website. Benchmark comparisons highlight Gemini 2.5 Pro's strong performance in reasoning, science, mathematics, code editing, and handling long contexts, with its training data current up to March 2025 (especially when augmented by Google Search). The creator praises its effectiveness and affordability, suggesting it could become his primary AI tool for coding, and invites viewer discussion and engagement.