# Vertex AI Gemini Pro (Vision)

Already, there are lots of LLMs and embedding models in the world. But multi-modal (Image / Video) supported models are few.

In this example, we will show the how to use it on the RPA segment. 

In [1]:
#! pip3 install --upgrade google-cloud-aiplatform
#! pip3 install google-cloud-storage

In [2]:
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

Wait for restarting the kernel.

In [8]:
from vertexai.preview.generative_models import (
    GenerationConfig,
    GenerativeModel,
    Image,
    Part,
    HarmBlockThreshold,
    HarmCategory,
)

In [9]:
multimodal_model = GenerativeModel("gemini-pro-vision")

### Video Captioning Test

The sample video(wooribank_login.mp4) is made from the window's recording. And it was converted into 10fps to reduce video size (under 8 MB - the limit of Gemini Pro video clip).

In [10]:
import os

#os.environ["BUCKET_UPLOAD_TEMP"] = "<your-bucket>"

In [7]:
from google.cloud import storage

bucket_name = os.environ.get("BUCKET_UPLOAD_TEMP")
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)

blob = bucket.blob("wooribank_login.mp4")
blob.upload_from_filename("resources/wooribank_login.mp4")




In [19]:
prompt = "Please describe the user's actions in this video in detail, including the URL sites they accessed and the buttons they clicked."

generation_config = GenerationConfig(
    temperature=0.1,
    max_output_tokens=2048,
)

safety_config = {
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}

video = Part.from_uri(
    uri=f"gs://{bucket_name}/wooribank_login.mp4",
    mime_type="video/mp4",
)
contents = [prompt, video]

responses = multimodal_model.generate_content(contents, generation_config=generation_config, 
    safety_settings=safety_config, stream=False)


In [20]:
print(responses)

candidates {
  content {
    role: "model"
    parts {
      text: " The user opens Google Chrome and goes to google.com. They then search for \"wooribank.\" The user clicks on the first result, which is the Woori Bank website.\n\nOn the Woori Bank website, the user clicks on the \"Personal Login\" button. A login page appears, and the user enters their login information. The user then clicks on the \"Login\" button.\n\nThe user is now logged into their Woori Bank account. They click on the \"My Profile\" tab, and then click on the \"Change Password\" link. The user enters their current password and new password, and then clicks on the \"Change Password\" button.\n\nThe user\'s password has now been changed. They click on the \"Logout\" button to log out of their account."
    }
  }
  finish_reason: STOP
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  s

### Snapshot analysis

Gemini Pro (Vision) can analyze multiple snapshots at once. 

I will provide Gemini Pro with snapshots of investing.com & UTHY chart and ask them to interpret them

In [11]:
def upload_file_to_temp_bucket(file_name):
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)

    blob = bucket.blob(file_name)
    blob.upload_from_filename(file_name)
    
    return blob.public_url.replace("https://storage.googleapis.com/", "gs://")

In [12]:
image1_uri = upload_file_to_temp_bucket("resources/investing-1.png")
image2_uri = upload_file_to_temp_bucket("resources/investing-2.png")
image3_uri = upload_file_to_temp_bucket("resources/investing-3.png")
image4_uri = upload_file_to_temp_bucket("resources/investing-4.png")

image_uris = [image1_uri, image2_uri, image3_uri, image4_uri]

prompt = "Please explain the what the user is doing in the provided screenshots."

contents = [prompt]

generation_config = GenerationConfig(
    temperature=0.1,
    max_output_tokens=2048,
)

safety_config = {
    HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: HarmBlockThreshold.BLOCK_NONE,
    HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
}

for image_uri in image_uris:
    image = Part.from_uri(
        uri=image_uri,
        mime_type="image/png",
    )
    contents.append(image)

responses = multimodal_model.generate_content(contents, generation_config=generation_config, safety_settings=safety_config, stream=False)




In [14]:
print(responses)

candidates {
  content {
    role: "model"
    parts {
      text: " The user is viewing a stock chart for the US Treasury 30 Year Bond ETF (UTHY). They are looking at the price history of the ETF, and they are using the interactive chart to zoom in and out and to view different time periods. The user is also looking at the news and other information about the ETF."
    }
  }
  finish_reason: STOP
  safety_ratings {
    category: HARM_CATEGORY_HARASSMENT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_HATE_SPEECH
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_SEXUALLY_EXPLICIT
    probability: NEGLIGIBLE
  }
  safety_ratings {
    category: HARM_CATEGORY_DANGEROUS_CONTENT
    probability: NEGLIGIBLE
  }
}
usage_metadata {
  prompt_token_count: 1045
  candidates_token_count: 66
  total_token_count: 1111
}

