# Gemini API with images and video input

In [1]:
!pip install google-cloud-aiplatform
!pip install rich



In [2]:
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')

import google.generativeai as genai
genai.configure(api_key=GOOGLE_API_KEY)

from rich.console import Console
console = Console()

Gemini 1.5 Pro and 1.5 Flash support a maximum of 3,600 image files.

Images must be in one of the following image data MIME types:
*   PNG - image/png
*   JPEG - image/jpeg
*   WEBP - image/webp
*   HEIC - image/heic
*   HEIF - image/heif

Each image is equivalent to 258 tokens.

For best results:

*   Rotate images to the correct orientation before uploading.
*   Avoid blurry images.
*   If using a single image, place the text prompt after the image.

In [3]:
!curl -o animales_mexico.jpeg https://storage.googleapis.com/questionsanswersproject/animales_mexico.jpeg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  283k  100  283k    0     0  1358k      0 --:--:-- --:--:-- --:--:-- 1360k


In [4]:
ref_image = genai.upload_file(path="animales_mexico.jpeg", display_name="animales_mexico")
console.print(f"{ref_image=}")

In [5]:
file = genai.get_file(name=ref_image.name)
console.print(f"{file=}")

In [6]:
model = genai.GenerativeModel('gemini-1.5-flash-001')
prompt = [ref_image, "Write a nature blog with the following mexican animals:"]
response = model.generate_content(prompt, stream=True)
for chunk in response:
  console.print(chunk.text)

In [7]:
prompt = "Return a bounding box for the Axolotl. \n [ymin, xmin, ymax, xmax]"
response = model.generate_content([ref_image, prompt])

print(response.text)

The bounding box for the Axolotl is [61, 126, 412, 432].


# Prompt with multiple images

In [8]:
!curl -o got.jpg https://storage.googleapis.com/questionsanswersproject/got.jpg
!curl -o trump.jpeg https://storage.googleapis.com/questionsanswersproject/trump.jpeg

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 67883  100 67883    0     0   368k      0 --:--:-- --:--:-- --:--:--  370k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  157k  100  157k    0     0  1240k      0 --:--:-- --:--:-- --:--:-- 1246k


In [12]:
model = genai.GenerativeModel('gemini-1.5-flash-001')
ref_image_got = genai.upload_file(path="got.jpg", display_name="got")
ref_image_trump = genai.upload_file(path="trump.jpeg", display_name="trump")

prompt = ["""
      image 1 <ref_image_got>
      image 2 <ref_image_trump>
      Create a funny story where combine images 1 and 2
    """,
    ref_image_got,
    ref_image_trump]

response = model.generate_content(prompt)
console.print(response.text)

# Prompting with video

Video must be in one of the following video format MIME types:

*  video/mp4
*  video/mpeg
*  video/mov
*  video/avi
*  video/x-flv
*  video/mpg
*  video/webm
*  video/wmv
*  video/3gpp

The File API service extracts image frames from videos at 1 frame per second (FPS) and audio at 1Kbps, single channel, adding timestamps every second

Individual frames are 258 tokens, and audio is 32 tokens per second

In [13]:
!wget https://storage.googleapis.com/questionsanswersproject/simpsons-first-episode.mp4

--2024-07-26 04:04:53--  https://storage.googleapis.com/questionsanswersproject/simpsons-first-episode.mp4
Resolving storage.googleapis.com (storage.googleapis.com)... 64.233.181.207, 173.194.206.207, 173.194.193.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|64.233.181.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4205052 (4.0M) [video/mp4]
Saving to: ‘simpsons-first-episode.mp4’


2024-07-26 04:04:53 (75.4 MB/s) - ‘simpsons-first-episode.mp4’ saved [4205052/4205052]



In [14]:
import time

ref_video = genai.upload_file(path="simpsons-first-episode.mp4", display_name="simpsons-first-episode")

# Check whether the file is ready to be used.
while ref_video.state.name == "PROCESSING":
    print('.', end='')
    time.sleep(1)
    ref_video = genai.get_file(ref_video.name)

if ref_video.state.name == "FAILED":
  raise ValueError(ref_video.state.name)

console.print(f"{ref_video=}")

..

In [15]:
prompt = "Identify the characters in the video"
model = genai.GenerativeModel('gemini-1.5-flash-001')
response = model.generate_content([ref_video, prompt])
console.print(response.text)

In [17]:
prompt = "What are between 01:05 and 01:19"
model = genai.GenerativeModel(model_name="gemini-1.5-pro")
response = model.generate_content([ref_video, prompt],
                                  request_options={"timeout": 600})
print(response.text)

The scene is dark between 01:05 and 01:09, and there are four pairs of eyes blinking in the dark between 01:14 and 01:19. 


In [19]:
prompt = "Transcribe video from 00:05 until 00:30"
model = genai.GenerativeModel(model_name="gemini-1.5-pro")
response = model.generate_content([ref_video, prompt],
                                  request_options={"timeout": 600})
print(response.text)

0:05 Um, Dad? Yeah?
0:06 What is the mind? Is it just a system of impulses or is it something tangible?
0:13 Relax.
0:14 What is mind? No matter.
0:17 What is matter? Never mind.
0:20 Thanks, Dad.
0:22 Good night, son.
0:26 Good night, Lisa.
0:28 Good night, Mom.
0:29 Sweet dreams.
0:31 Thanks, Mom. Sleep tight.
0:34 I will, Mom. Don't let the bedbugs bite.
0:40 Bedbugs?
0:43 Rock a bye baby, in the tree top.
0:48 When the wind blows, the cradle will rock.
0:55 When the bough breaks, the cradle will fall.
1:01 And down will come baby cradle and all.
1:09 Sweet dreams. 

