<a href="https://colab.research.google.com/github/lakshmishivanik/lakshmi_google_ai_studio/blob/main/Audio_capabilities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Copyright 2025 Google LLC.



In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# **Explore audio capabilities with the Gemini API**
Gemini can respond to prompts about audio. For example, Gemini can:

Describe, summarize, or answer questions about audio content.
Provide a transcription of the audio.
Provide answers or a transcription about a specific segment of the audio.

**Note: You can't generate audio output with the Gemini API.**

This guide demonstrates different ways to interact with audio files and audio content using the Gemini API.

#Supported audio formats
Gemini supports the following audio format MIME types:

*   WAV - audio/wav
*   MP3 - audio/mp3
*   AIFF - audio/aiff
*   AAC - audio/aac
*   OGG Vorbis - audio/ogg
*   FLAC - audio/flac

#Technical details about audio
Gemini imposes the following rules on audio:

*   Gemini represents each second of audio as 25 tokens; for example, one minute of audio is represented as 1,500 tokens.
*   Gemini can only infer responses to English-language speech.
*   Gemini can "understand" non-speech components, such as birdsong or sirens.
*   The maximum supported length of audio data in a single prompt is 9.5 hours. Gemini doesn't limit the number of audio files in a single prompt; however, the total combined length of all audio files in a single prompt cannot exceed 9.5 hours.
*   Gemini downsamples audio files to a 16 Kbps data resolution.
*   If the audio source contains multiple channels, Gemini combines those channels down to a single channel.








# Before you begin: Set up your project and API key
**Install the Gemini API library**

Using Python 3.9+, install the google-genai package using the following pip command:

In [1]:
pip install -q -U google-genai


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/137.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━[0m [32m133.1/137.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.7/137.7 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

#Get and secure your API key
You need an API key to call the Gemini API. If you don't already have one, create a key in Google AI Studio.

Using Colab Secrets, you can ensure that your Gemini API key is managed securely and persistently in your Google Colab notebooks.

In [2]:
from google import genai
import os
from google.colab import userdata  # Import for Colab Secrets

# Retrieve API key from Colab Secrets
try:
    api_key = userdata.get('GOOGLE_GENAI_API_KEY')
    print("API key successfully retrieved from Colab Secrets.")  # Confirmation
except KeyError:
    raise ValueError(
        "GOOGLE_GENAI_API_KEY not found in Colab Secrets. Please configure your API key in Colab Secrets."
    )

API key successfully retrieved from Colab Secrets.


#Upload an audio file and generate content
You can use the File API to upload an audio file of any size. Always use the File API when the total request size (including the files, text prompt, system instructions, etc.) is larger than 20 MB.

Call media.upload to upload a file using the File API. The following code uploads an audio file and then uses the file in a call to models.generateContent.

(NOTE:-To use this function, the file must be uploaded in google colab files folder and the uploaded file is available only during that particular runtime in which it was uploaded and gets deleted once that runtime is terminated)

In [3]:
#Upload an audio file and generate content
from google import genai

client = genai.Client(api_key=api_key)

myfile = client.files.upload(file='/content/sample.mp3')

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=['Describe this audio clip', myfile]
)

print(response.text)

This is a recording of a male singer performing an acoustic cover of Ed Sheeran's song "Perfect". The audio quality is decent, and the singer accompanies himself with guitar.


#Get metadata for a file
You can verify the API successfully stored the uploaded file and get its metadata by calling files.get.

In [4]:
#Get metadata for a file
myfile = client.files.upload(file='/content/sample.mp3')
file_name = myfile.name
myfile = client.files.get(name=file_name)
print(myfile)

name='files/tuz6eqqbcmgs' display_name=None mime_type='audio/mpeg' size_bytes=2097152 create_time=datetime.datetime(2025, 2, 26, 9, 19, 40, 670295, tzinfo=TzInfo(UTC)) expiration_time=datetime.datetime(2025, 2, 28, 9, 19, 40, 657852, tzinfo=TzInfo(UTC)) update_time=datetime.datetime(2025, 2, 26, 9, 19, 40, 670295, tzinfo=TzInfo(UTC)) sha256_hash='NjhkMThjNWUzMzdjYWY1NDAyOTkxMWM4MTdlNGFjNzQxNjVmZDBkN2IyNWJlYzI5NmUzMTNhOTk1NGYyZGFkYQ==' uri='https://generativelanguage.googleapis.com/v1beta/files/tuz6eqqbcmgs' download_uri=None state=<FileState.ACTIVE: 'ACTIVE'> source=<FileSource.UPLOADED: 'UPLOADED'> video_metadata=None error=None


#List uploaded files
You can upload multiple audio files (and other kinds of files). The following code generates a list of all the files uploaded:

In [5]:
#List uploaded files
print('My files:')
for f in client.files.list():
    print(' ', f.name)

My files:
  files/tuz6eqqbcmgs
  files/wf3yfqu6ogvx
  files/hl2ssbtkptlb
  files/byez0y7tnyx5
  files/befxceiov8sn
  files/tc9069jdvdc8
  files/jhyw18giw2jj
  files/kf9uw7yncesk
  files/9ko0mxy9ow87
  files/nif3ng8o9hmk
  files/oj2o1lvbr52l
  files/apu6m5r2kohr
  files/2i5bxrn7ehzk
  files/lcz2hqi0hxng
  files/soxm87ox36lx


#Delete uploaded files
Files are automatically deleted after 48 hours. Optionally, you can manually delete an uploaded file. For example:

In [6]:
#Delete uploaded files
myfile = client.files.upload(file='/content/sample.mp3')
client.files.delete(name=myfile.name)

DeleteFileResponse()

#Provide the audio file as inline data in the request
Instead of uploading an audio file, you can pass audio data in the same call that contains the prompt.

(Note the following about providing audio as inline data:-The maximum request size is 20 MB, which includes text prompts, system instructions, and files provided inline. If your file's size will make the total request size exceed 20 MB, then use the File API to upload files for use in requests.
If you're using an audio sample multiple times, it is more efficient to use the File API.)

Then, pass that downloaded small audio file along with the prompt to Gemini:

In [7]:
#Provide the audio file as inline data in the request
from google.genai import types

with open('/content/sample.mp3', 'rb') as f:
    image_bytes = f.read()

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=[
    'Describe this audio clip',
    types.Part.from_bytes(
      data=image_bytes,
      mime_type='audio/mp3',
    )
  ]
)

print(response.text)

This is a recording of a male vocalist singing an acoustic song. It seems to be a cover, possibly of Ed Sheeran's "Perfect," or a similar song. There's a simple instrumental backing, likely just an acoustic guitar and perhaps some subtle production enhancements to give it a polished feel. The singer has a pleasant voice and the overall quality is decent.



#Get a transcript of the audio file
To get a transcript, just ask for it in the prompt. For example:

In [9]:
#To get a transcript, just ask for it in the prompt
myfile = client.files.upload(file='/content/sample.mp3')
prompt = 'Generate a transcript of the speech.'

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=[prompt, myfile]
)

print(response.text)

I found a love for me
Darling just dive right in, and follow my lead
Well I found a girl, beautiful and sweet
I never knew you were the someone
waiting for me
'Cause we were just kids when we fell in love
Not knowing what it was
I will not give you up this time
Darling, just kiss me slow, your heart is all
I own
And in your eyes, you're holding mine
Baby, I'm dancing in the dark
With you between my arms
Barefoot on the grass
Listening to our favorite song

When you said you looked a mess,
I whispered underneath my breath
But you heard it, darling you look perfect tonight
Well I found a woman,
stronger than anyone I know
She shares my dreams,
I hope that someday I'll share her home
I found a love,
to carry more than just my secrets
To carry love, to carry children



#Get the duration of the audio file
To get the duration, just ask for it in the prompt. For example:

In [10]:
# Create a prompt to analyze the duration of the file.
myfile = client.files.upload(file='/content/sample.mp3')
prompt = "analyze and tell me how many minutes is the audio file"

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=[prompt, myfile]
)

print(response.text)

The audio file is 3 minutes and 51 seconds long.


#Refer to timestamps in the audio file
A prompt can specify timestamps of the form MM:SS to refer to particular sections in an audio file. For example, the following prompt requests a transcript that:

Starts from the beginning of the file.
Ends at 3 minutes 9 seconds from the beginning of the file.

In [12]:
# Create a prompt containing timestamps.
myfile = client.files.upload(file='/content/sample.mp3')
prompt = 'Generate a transcript of the audio from the beginning of the audio file to 3 minutes 9 seconds of the file.'

response = client.models.generate_content(
  model='gemini-2.0-flash',
  contents=[prompt, myfile]
)

print(response.text)

I found a love for me. Darling just dive right in follow my lead. I found a girl beautiful and sweet. I never knew you were the someone waiting for me.
Cause we were just kids when we fell in love not knowing what it was. I will not give you up this time. Darling just kiss me slow.
Your heart is all I own.
And in your eyes you're holding mine. Baby, I dancing in the dark with you between my arms.
Barefoot on the grass listening to our favorite song. When you said you looked a mess, I whispered underneath my breath, but you heard it.
Darling, you look perfect tonight. Well I found a woman stronger than anyone I know. She shares my dreams, I hope that someday I'll share her home.
I found a love to carry more than just my secrets. To carry love, to carry children.


#Count tokens
Call the countTokens method to get a count of the number of tokens in the audio file. For example:

In [13]:
#Call the countTokens method to get a count of the number of tokens in the audio file.
myfile = client.files.upload(file='/content/sample.mp3')
response = client.models.count_tokens(
  model='gemini-2.0-flash',
  contents=[myfile]
)

print(response)

total_tokens=4065 cached_content_token_count=None
