##### Copyright 2024 Google LLC.

In [None]:
# @title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Image captioning using PaliGemma
In this notebook, we'll explore image captioning using PaliGemma, a state-of-the-art vision-language model developed by Google. PaliGemma is designed to understand both images and text, making it ideal for generating accurate and descriptive captions for a wide range of images.

Image captioning plays a crucial role in making the web accessible to everyone, particularly individuals who are blind or visually impaired. While alternative text (alt text) provides a concise description of an image, captions offer a more comprehensive explanation, conveying the context, details, and nuances that might be missed in a brief alt text. This ensures that all users, regardless of their visual abilities, can fully understand and appreciate the content of images on websites, contributing to a more inclusive and equitable online experience.


## Setup

### Select the Colab runtime
To complete this tutorial, you'll need to have a Colab runtime with sufficient resources to run the Gemma model. In this case, you should use a L4 GPU or an A100 GPU, as a T4 will be insufficient:

1. In the upper-right of the Colab window, select **▾ (Additional connection options)**.
2. Select **Change runtime type**.
3. Under **Hardware accelerator**, select **L4 GPU** or **A100 GPU**.


### Gemma setup on Kaggle
To complete this tutorial, you'll first need to complete the setup instructions at [Gemma setup](https://ai.google.dev/gemma/docs/setup), as PaliGemma is a Gemma variant.

In brief, you will need to

* Get access to Gemma on kaggle.com.
* Generate and configure a Kaggle username and API key.

After you've completed the Gemma setup, move on to the next section, where you'll set your username and API key as environment variables for your Colab environment.

## Accessing Kaggle Credentials

We will need to provide our Kaggle username and API key in order to download the PaliGemma model from Kaggle.

The code below fetches these credentials from the Google Colab user data, avoiding the need to expose them directly in the notebook.

If you haven't already, set your Kaggle username and API key appropriately in your Colab user data.

In [4]:
import os
from google.colab import userdata

os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')
os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')

## Installing Required Libraries

Before we dive into using PaliGemma, let's make sure we have all the necessary libraries installed. The following commands will upgrade `keras-cv`, `keras-nlp`, and `keras` to their latest versions, ensuring we have access to the most up-to-date features and improvements for working with vision and language models.

In [1]:
!pip install --upgrade keras-cv
!pip install --upgrade keras-nlp
!pip install --upgrade keras

Collecting keras-cv
  Downloading keras_cv-0.9.0-py3-none-any.whl (650 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/650.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m368.6/650.7 kB[0m [31m10.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m650.7/650.7 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
Collecting keras-core (from keras-cv)
  Downloading keras_core-0.1.7-py3-none-any.whl (950 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/950.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m950.8/950.8 kB[0m [31m59.8 MB/s[0m eta [36m0:00:00[0m
Collecting namex (from keras-core->keras-cv)
  Downloading namex-0.0.8-py3-none-any.whl (5.8 kB)
Installing collected packages: namex, keras-core, keras-cv
Successfully installed keras-core-0.1.7 keras-cv-0.9.0 na

## Loading PaliGemma and Configuring Image Dimensions

Now we'll load the PaliGemma model itself. We'll use a preset configuration to streamline the process and ensure we have a compatible model for image captioning.

Today we will be using the **pali_gemma_3b_mix_448** model, which will require our images to be 448x448 pixels... but luckily we can specify this when we load our images later.

>⚠️ This is crucial as PaliGemma expects images in a specific format for accurate caption generation.

For future reference, the various presets primarily differ in three aspects:

1. **Image Size:**
  - `_224`: trained and expects input images of size 224x224 pixels. This is suitable for smaller images and less computationally demanding.
  - `_448`: trained and expects input images of size 448x448 pixels. This offers a balance between detail and computational cost.
  - `_896`: trained and expects input images of size 896x896 pixels. This provides the highest level of detail, but is more computationally intensive.
2. **Training Type:**
  - `_pt`: *pre-trained* on a large dataset of image-text pairs. It's a good starting point for general image captioning tasks.
  - `_mix`: *mix fine-tuned* on a diverse set of vision-language tasks. It's expected to perform well on a wider variety of tasks, but is generally intended for research purposes only.
3. **Text Sequence Length:** \
This refers to the maximum length of the generated caption. Presets with higher image sizes usually have longer text sequence lengths as they can potentially provide more detailed descriptions.

At time of writing (2024/05/28), the available presets are as follows.

Preset name |	Parameters |	Description
------------|------------|----------------
pali_gemma_3b_mix_224 |	2.92B	 | image size 224, mix fine tuned, text sequence length is 256
pali_gemma_3b_mix_448	| 2.92B	| image size 448, mix fine tuned, text sequence length is 512
pali_gemma_3b_224	| 2.92B	| image size 224, pre trained, text sequence length is 128
pali_gemma_3b_448	| 2.92B	| image size 448, pre trained, text sequence length is 512
pali_gemma_3b_896	| 2.93B	| image size 896, pre trained, text sequence length is 512

You can always see an up-to-date list in the Keras docs [here](https://keras.io/api/keras_nlp/models/pali_gemma/pali_gemma_causal_lm/#frompreset-method).

In [5]:
import keras_nlp

# load paligemma from a preset
#
# for more info and options to use, see the docs:
# https://keras.io/api/keras_nlp/models/pali_gemma/pali_gemma_causal_lm/#frompreset-method
model_name = "pali_gemma_3b_mix_448"
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset(model_name)

# we need to resize the image to the size expected by the model
# we're assuming the model name ends with _NUM here
target_size_x = int(model_name[model_name.rfind('_')+1:])
target_size = (target_size_x, target_size_x)

Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/metadata.json...
100%|██████████| 143/143 [00:00<00:00, 191kB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/task.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/config.json...
100%|██████████| 861/861 [00:00<00:00, 1.02MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/model.weights.h5...
100%|██████████| 5.45G/5.45G [07:10<00:00, 13.6MB/s]
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/preprocessor.json...
Downloading from https://www.kaggle.com/api/v1/models/keras/paligemma/keras/pali_gemma_3b_mix_448/1/download/tokenizer.json...
100%|██████████| 410/410 [00:00<00:00, 494kB/s]
Downloading from https://www.kaggle.com/api/

## Loading and Preparing the Image

Let's load our image and get it ready for PaliGemma. We'll use a sample image of a cat (my cat!) in this example.

The code below will load the image from a URL, resize it to the dimensions expected by the PaliGemma model, and convert it into a Tensor object, which is the format required for model input.

In [7]:
from keras.preprocessing.image import load_img, img_to_array
import tensorflow as tf

# here we're loading an image of my cat because that's easier than finding a
# creative commons image
image_path = tf.keras.utils.get_file('juice.jpg', 'https://jethac.github.io/assets/juice.jpg')
keras_img = load_img(image_path, target_size=target_size)

# convert image to NumPy array
img_array = img_to_array(keras_img)

# convert NumPy array to Tensor object
img_tensor = tf.convert_to_tensor(img_array)


Downloading data from https://jethac.github.io/assets/juice.jpg
[1m251543/251543[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


## Generating the Image Caption

Finally, we'll use PaliGemma to generate a caption for our image. We'll provide the model with the image tensor and a prompt that instructs it to describe the image.

Since we're not using an instruction-tuned model, we need to manually remove the prompt from the model's output to get a clean caption.

In [8]:
# define prompt separately so we can measure its length later
prompt = "Caption the image:"

# pass images and prompts to paligemma
response = pali_gemma_lm.generate(
  {
    "images": [img_tensor],
    "prompts": [prompt]
  }
)

# we're not using an instruction-trained model so we have to cut the prompt off
# the front of our output
filtered = response[0][len(prompt):]
print(filtered)

A black and white cat sits comfortably on a black backpack, its eyes open and its paw resting on the bag. The cat's white fur and black nose are prominent features in the image. The backpack is open, revealing the cat's black and white paws and the black strap on the side. The cat's eyes are green, and its whiskers are white. The cat's head is tilted slightly towards the camera, and its ears are perked up. The cat's black and white coat is contrasted by its white chest and paws. The cat's eyes are bright and alert, and its nose is wrinkled in concentration.
