# How To Use Vertex Gemini for Multimodal Embeddings

This notebook outlines how to interact with Vertex AI's Gemini Vision Pro GenAI model to inspect images and generate detailed information about its content. Visual Question Answering (VQA) lets you provide an image to the model and ask a question about the image's contents. In response to your question you get one or more natural language answers.

## Prepare the python development environment

First, let's identify any project specific variables to customize this notebook to your GCP environment. Change YOUR_PROJECT_ID with your own GCP project ID.

In [1]:
#PROJECT_ID = 'YOUR_PROJECT_ID'
PROJECT_ID = 'rkiles-demo-host-vpc'
LOCATION = 'us-central1'

Next, let's specify the name of the image file you want to inspect, such as "OJ.png" or "shoe.png"

In [2]:
image_filename = 'stuff_on_a_shelf.jpg'

Install any needed python modules from our requirements.txt file. Most Vertex Workbench environments include all the packages we'll be using, but if you are using an external Jupyter Notebook or require any additional packages for your own needs, you can simply add them to the included requirements.txt file an run the folloiwng commands.

In [3]:
#pip install -r requirements.txt

Now we will import all required modules. For our purpose, we will be utilizing the following:

- google.auth - Provides authentication access to the Google API's, such as imagegeneration:predict
- base64 - Imagen API requests return generated or edited images as base64-encoded strings. This module will help us decode this data to an image file
- requests - This module will allow us to interact directly with Imagen over the REST API. 
- json - Python module used to interact with JSON data. Imagen returns results in json format.

In [4]:
import google.auth.transport.requests
import google.auth
import base64
import requests
import json

## Authenticate to the Vertex AI API

Our Vertex Workbench instance is configured to use a specified service account that has IAM access to the Gemini Vison Pro API. The following two secitons will allow us to generate the access token we will pass as an authorization bearer token later in the header.

In [5]:
credentials, project_id = google.auth.default()
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)

In [6]:
access_token = credentials.token

## Prepare the HTTP POST request to the REST API

Define the header fields, including the access token we created in the last step

In [7]:
headers = {
        'Authorization': 'Bearer ' + access_token,
        'Content-Type': 'application/json; charset=utf-8'
    }

You can uncomment the following line for troubleshooting if you want to see how the header will be passed to the API.

In [8]:
#print(headers)

Next we will specifiy the URL for the Imagen REST API. You should have already specified the correct project ID in the very first step of this notebook.

In [9]:
url = f'https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/gemini-pro-vision:streamGenerateContent'

To use Gemini Vision Pro on Vertex AI you must provide a text description of what you want to inspect, generate or edit. These descriptions are called prompts, and these prompts are the primary way you communicate with Generative AI. Here, we are specifiying what we want the model to identify using a prompt. Play around with this content and see what kind of details you can extract from an image. More information can be found here https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/overview

In this example, we will ask Gemini to inspect a picture of an orange juice carton and provide it's results in a json format.

In [10]:
#vqa_prompt = 'Briefly describe each product you see in this picture and provide your response in JSON format including the brand, description, price and size. If you can not determine the size, mark it as NA. Do not include the json prefix in your response.'

vqa_prompt = 'Briefly describe each product you see in this picture. Include the brand, description, price, size and item number. If you can not determine the size, mark it as NA. Format the output as a csv with each item on a different row'

Next we will specify the mime type and locaiton of the image file we want to inspect. The example below uses a local file named OJ.jpeg. More information can be found at https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/gemini

In [11]:
with open(image_filename, "rb") as f:
    encoded_base_image = base64.b64encode(f.read())
B64_BASE_IMAGE = encoded_base_image.decode('utf-8')

image_file = '"data": "'+ B64_BASE_IMAGE +'"'
mime_type = '"mimeType": "image/png"'

image_file_data = '{"inlineData": {' + mime_type +','+ image_file +'}}'

Instead of using a local image, you can optionally provide the location of an image stored in a GCS bucket as outlined below. Use the format gs://BUCKET_NAME/

In [12]:
#GCS_BUCKET = 'YOUR_BUCKET_PATH'
#image_file = '"fileUri": "'+ GCS_BUCKET + image_filename +'"'
#mime_type = '"mimeType": "image/png"'

#image_file_data = '{"fileData": {' + mime_type +','+ image_file +'}}'

We will now create the request body that will be passed to the REST API.

In [13]:
request_body = '{"contents": {"role": "user","parts": ['+ image_file_data +',{"text": "' + vqa_prompt + '"}]},"generation_config": {"maxOutputTokens": 2048,"temperature": 0.4,"topP": 1.0,"topK": 32}}'

Lastly, we will post the request to the Imagen REST API and wait for the requested response to be generated and returned

In [14]:
r = requests.post(url, data=request_body, headers=headers)

You can optionally uncomment the following to view the returned status code for verification or troubleshooting

In [15]:
print(r.status_code)
#print(r.content)

200


## Process the returned request and decode the image

The Imagen API returns the prediciton in a JSON string. We will start by defining our data and then extracting the prediction.

In [16]:
img_data = r.json()

You can optionally uncomment the following to view the returned json payload

In [17]:
#print(img_data)

As you can see form the full display of the json response by uncommenting the previous line, the Gemini_Vison_Pro model can return the text in multiple sections of the array based on the image and input prompt used. To better process the returned text, we will insert a simple 'for loop' here to iterate through multiple predictions.

In [29]:
with open('info.csv', 'w') as file:
    for candidate in img_data:
        file.write(candidate['candidates'][0]['content']['parts'][0]['text'])

qa_response = ''
for candidate in img_data:
    qa_response = qa_response.lstrip() + candidate['candidates'][0]['content']['parts'][0]['text']
print(qa_response)

Brand,Description,Price,Size,Item Number
Pledge,Furniture Polish, $10.98, 14.2 OZ, 1460444
Old English, Scratch Cover for Light Woods, $8.28, 8 OZ, 317394
Old English, Scratch Cover for Dark Woods, $8.28, 8 OZ, 271565
Method, Daily Wood Cleaner, $6.98, 28 OZ, 696426
Weiman, Leather Conditioning Wipes, $6.78, 30 Count, 111359
Weiman, Leather Conditioner, $8.98, 22 OZ, 1314425
Resolve, Easy Clean Brushing Kit, $22.98, NA, 900751


That's it! Congratulations on defining your first visual Q&A with Gemini!

In [19]:
emb_url = f'https://{LOCATION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{LOCATION}/publishers/google/models/multimodalembedding:predict'

In [20]:
text_body = f'Here is an example of the product details - {qa_response}'

req_body = '{"instances": [{"text": "'+text_body+'", "image": {"bytesBase64Encoded": "'+ B64_BASE_IMAGE +'"}}],"parameters": {"dimension": 128}}'

#--WORKS--#req_body = '{"instances": [{"image": {"bytesBase64Encoded": "'+ B64_BASE_IMAGE +'"}}],"parameters": {"dimension": 128}}'
#print(req_body)

In [21]:
r_emb = requests.post(emb_url, data=req_body, headers=headers)

In [22]:
print(r_emb.status_code)
#print(r_emb.content)

200


In [23]:
emb_data = r_emb.json()

In [24]:
print(emb_data)

{'predictions': [{'textEmbedding': [0.228069067, -0.144761205, 0.00823902898, 0.00407459075, -0.0589863695, 0.00988769811, 0.150284916, 0.0631608218, 0.0150525263, 0.235682562, -0.118112102, -0.0520278, -0.00307456241, 0.130200252, 0.0537960306, 0.0503928289, -0.0534048043, 0.0361501239, -0.0317128859, -0.0998976156, 0.0890983716, -0.0682249814, -0.028494684, 0.0508125499, 0.0211343244, -0.0636638254, -0.0296866242, -0.0650721937, -0.0205081683, -0.0208172165, 0.0427054912, 0.175412402, -0.0409029052, -0.0355198681, 0.0526674204, 0.0578075089, -0.0512161143, 0.0838210136, -0.0674320534, -0.0749670342, -0.0880455524, 0.0194353666, -0.0238316823, -0.00649636192, -0.0238625705, 0.054662019, -0.0413860977, -0.00781233702, 0.0659946874, -0.0286152679, -0.0230994634, 0.0128300851, -0.071731396, 0.000149143598, -0.0782638043, 0.0305278581, 0.0408225693, 0.544384301, -0.0388627574, -0.0119128758, 0.0747102574, 0.0058103879, 0.00922250468, 0.0716734082, -0.0374136791, -0.00361539098, 0.15679253

In [25]:
# generate a unique id for this session
from datetime import datetime
UID = datetime.now().strftime("%m%d%H%M")

In [26]:
BUCKET_URI = f"gs://{PROJECT_ID}-vs-quickstart-{UID}"