# How To Use Vertex Gemini Vision Pro Generative AI To Inspect Image Details

This notebook outlines how to interact with Vertex AI's Gemini Vision Pro GenAI model to inspect images and generate detailed information about its content. Visual Question Answering (VQA) lets you provide an image to the model and ask a question about the image's contents. In response to your question you get one or more natural language answers.

## Prepare the python development environment

First, let's identify any project specific variables to customize this notebook to your GCP environment. Change YOUR_PROJECT_ID with your own GCP project ID.

In [19]:
PROJECT_ID = 'YOUR_PROJECT_ID'
LOCATION = 'us-central1'

Next, let's specify the name of the image file you want to inspect, such as "OJ.png" or "shoe.png"

In [20]:
image_filename = 'stuff_on_a_shelf.jpg'

Install any needed python modules from our requirements.txt file. Most Vertex Workbench environments include all the packages we'll be using, but if you are using an external Jupyter Notebook or require any additional packages for your own needs, you can simply add them to the included requirements.txt file an run the folloiwng commands.

In [21]:
#pip install -r requirements.txt

Now we will import all required modules. For our purpose, we will be utilizing the following:

- vertexai - Provides authentication access to the Google API's, such as imagegeneration:predict
- vertexai.preview.generative_models - Interact with new multimodal models
- base64 - Imagen API requests return generated or edited images as base64-encoded strings. This module will help us decode this data to an image file
- json - Python module used to interact with JSON data. Imagen returns results in json format.

In [22]:
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part
import base64
import json

## Instantiate Vertex AI ojbect

To use Gemini Vision Pro on Vertex AI you must provide a text description of what you want to inspect, generate or edit. These descriptions are called prompts, and these prompts are the primary way you communicate with Generative AI. Here, we are specifiying what we want the model to identify using a prompt. Play around with this content and see what kind of details you can extract from an image. More information can be found here https://cloud.google.com/vertex-ai/docs/generative-ai/multimodal/overview

In this example, we will ask Gemini to inspect a picture of an orange juice carton and provide it's results in a json format.

In [23]:
vqa_prompt = 'Briefly describe each product you see in this picture. Provide your response in JSON format including the brand, description, price, size and bin location. If you can not determine the size, mark it as NA. Do not include the json prefix in your response.'

Next we define a function to build the request to be sent to the multimodal model. Two examples are provided, the first creates a base64 encoded string of a local image and uses the from_data function. The second example shows how to provide an image stored in a GCS bucket and use the from_uri function. When using a GCS bucket as the source, you can specify up to 14 different files and do not have to manually create the base64 string. This makes it convenent for processing multiple files at a time.

In [24]:
with open(image_filename, "rb") as f:
    encoded_base_image = base64.b64encode(f.read())
B64_BASE_IMAGE = encoded_base_image.decode('utf-8')

In [38]:
def generate_text(project_id: str, location: str, b64_image: str, prompt: str) -> str:
    # Initialize Vertex AI
    vertexai.init(project=project_id, location=location)
    # Load the model
    #multimodal_model = GenerativeModel("gemini-pro-vision")
    multimodal_model = GenerativeModel("gemini-1.5-pro-preview-0409")
    # Query the model
    response = multimodal_model.generate_content(
        [
            # Add an example image
            Part.from_data(
                data=base64.b64decode(b64_image), mime_type="image/png"
            ),
            #"what is shown in this image?",
            vqa_prompt,
        ]
    )
    #print(response)

    return response.text

Uncomment this section if you would prefer to use an image lcoated in a GCS bucket.

In [39]:
#def generate_text(project_id: str, location: str) -> str:
#    # Initialize Vertex AI
#    vertexai.init(project=project_id, location=location)
#    # Load the model
#    multimodal_model = GenerativeModel("gemini-pro-vision")
#    # Query the model
#    response = multimodal_model.generate_content(
#        [
#            # Add an example image
#            Part.from_uri(
#                "gs://generativeai-downloads/images/scones.jpg", mime_type="image/jpeg"
#            ),
#            # Add an example query
#            vqa_prompt,
#        ]
#    )
#    #print(response)
#    return response.text

## Send the request and display the response

Call the above generate_text fuction and display the result.

In [40]:
qa_response = json.loads(generate_text(PROJECT_ID, LOCATION, B64_BASE_IMAGE, vqa_prompt))
print(qa_response)

[{'brand': 'Pledge', 'description': 'Expert Care Wood Oil, Amber & Argan Scent', 'price': 10.98, 'size': '14.2 OZ', 'bin_location': '16 57'}, {'brand': 'Old English', 'description': 'Scratch Cover for Light Woods', 'price': 8.28, 'size': '8 OZ', 'bin_location': '16 59'}, {'brand': 'Old English', 'description': 'Scratch Cover for Dark Woods', 'price': 8.28, 'size': '8 OZ', 'bin_location': '16 59'}, {'brand': 'method', 'description': 'Daily Wood Cleaner, Almond', 'price': 6.98, 'size': '28 OZ', 'bin_location': '16 61'}, {'brand': 'Weiman', 'description': 'Leather Conditioning Wipes', 'price': 6.78, 'size': '30 count (7" x 8")', 'bin_location': '16 63'}, {'brand': 'Weiman', 'description': 'Leather Conditioner', 'price': 8.98, 'size': '22 FL. OZ.', 'bin_location': '16 65'}, {'brand': 'Resolve', 'description': 'Pet Expert Heavy Traffic Foam Carpet Cleaner', 'price': 22.98, 'size': '22 OZ. (1 LB. 6 OZ.)', 'bin_location': '16 67'}, {'brand': 'Resolve', 'description': 'Easy Clean Brushing Kit'

That's it! Congratulations on defining your first visual Q&A with Gemini!