<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# <a id='toc1_'></a>[Build an Image Captioning System with IBM watsonx and Granite](#toc0_)


## <a id='Setup'></a>[Setup](#toc0_)

For this lab, you will be using the following libraries:


*   [`ibm-watsonx-ai`](https://pypi.org/project/ibm-watsonx-ai/): `ibm-watsonx-ai` is a library that allows to work with watsonx.ai service on IBM Cloud and IBM Cloud for Data. Train, test and deploy your models as APIs for application development, share with colleagues using this python library.

* `image`: `image` from Pillow is the Python Imaging Library (PIL) fork that provides easy-to-use methods for opening, manipulating, and saving image files in various formats. It’s commonly used for preprocessing images before feeding them into machine learning models or APIs.

* `requests`: `requests` is a simple and intuitive HTTP library for Python. It sends all kinds of HTTP/1.1 requests with methods like GET and POST. In this lab, it downloads images from the web for analysis by the multimodal AI model.


### [Installing required libraries](#toc0_)

The following required libraries are __not__ pre-installed in the Skills Network Labs environment. __You must run the following cell__ to install them. Please wait until it completes.

This step could take **several minutes**; please be patient.

**NOTE**: If you encounter any issues, please restart the kernel and run the cell again.  You can do that by clicking the **Restart the kernel** icon.

<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/crvBKBOkg9aBzXZiwGEXbw/Restarting-the-Kernel.png" width="50%" alt="Restart kernel">


In [None]:
%%capture
%pip install ibm-watsonx-ai==1.1.20 image==1.5.33 requests==2.32.0

## <a id='toc1_6_'></a>[watsonx API credentials and project_id](#toc0_)


In [None]:
from ibm_watsonx_ai import Credentials, APIClient
import os

credentials = Credentials(
    url="https://us-south.ml.cloud.ibm.com",
    )

project_id="skills-network"
client = APIClient(credentials)
# GET TextModels ENUM
client.foundation_models.TextModels

# PRINT dict of Enums
client.foundation_models.TextModels.show()

## <a href="#Image-preparation">Image preparation</a>

- Download the image
- Display the image


In [None]:
url_image_1 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/5uo16pKhdB1f2Vz7H8Utkg/image-1.png'
url_image_2 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/fsuegY1q_OxKIxNhf6zeYg/image-2.png'
url_image_3 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/KCh_pM9BVWq_ZdzIBIA9Fw/image-3.png'
url_image_4 = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VaaYLw52RaykwrE3jpFv7g/image-4.png'

image_urls = [url_image_1, url_image_2, url_image_3, url_image_4]

To gain a better understanding of our data input, let's display the images.


![Image 1](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/5uo16pKhdB1f2Vz7H8Utkg/image-1.png)<figcaption>Image 1</figcaption>

![Image 2](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/fsuegY1q_OxKIxNhf6zeYg/image-2.png)<figcaption>Image 2</figcaption>

![Image 3](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/KCh_pM9BVWq_ZdzIBIA9Fw/image-3.png)<figcaption>Image 3</figcaption>

![Image 4](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/VaaYLw52RaykwrE3jpFv7g/image-4.png)<figcaption>Image 4</figcaption>


## <a href="#Work-with-large-language-models-on-watsonx.ai"></a>[Work with large language models on watsonx.ai](#toc0_)

Specify the `model_id` of the model that you will use for the chat with image modalities.


In [None]:
model_id = 'ibm/granite-vision-3-2-2b'

### <a id='toc1_8_1_'></a>[Check the model parameters](#toc0_)

In [None]:
from ibm_watsonx_ai.foundation_models.schema import TextChatParameters

TextChatParameters.show()

params = TextChatParameters(
    temperature=0.2,
    top_p=0.5,

)

params

## <a id='toc1_9_'></a>[Initialize the model](#toc0_)


Initialize the `ModelInference` class with the previously specified parameters.


In [None]:
import os
from ibm_watsonx_ai.foundation_models import ModelInference

model = ModelInference(
    model_id=model_id,
    credentials=credentials,
    project_id=project_id,
    params=params
)

## <a id='toc1_10_'></a>[Encode the image](#toc0_)

Encode the image to `base64.b64encode`. Why do you need to encode the image to `base64.b64encode`? JSON is a text-based format and does not support binary data. By encoding the image as a Base64 string, you can embed the image data directly within the JSON structure.


In [None]:
import base64
import requests

def encode_images_to_base64(image_urls):
    """
    Downloads and encodes a list of image URLs to base64 strings.

    Parameters:
    - image_urls (list): A list of image URLs.

    Returns:
    - list: A list of base64-encoded image strings.
    """
    encoded_images = []
    for url in image_urls:
        response = requests.get(url)
        if response.status_code == 200:
            encoded_image = base64.b64encode(response.content).decode("utf-8")
            encoded_images.append(encoded_image)
            print(type(encoded_image))
        else:
            print(f"Warning: Failed to fetch image from {url} (Status code: {response.status_code})")
            encoded_images.append(None)
    return encoded_images

In [None]:
encoded_images = encode_images_to_base64(image_urls)

## <a id='#Multimodal-inference-function'></a>[Multimodal inference function](#toc0_)

Next, define a function to generate responses from the model.

The `generate_model_response` function is designed to interact with a multimodal AI model that accepts both text and image inputs. This function takes an image, along with a user’s query, and generates a response from the model.


In [None]:
def generate_model_response(encoded_image, user_query, assistant_prompt="You are a helpful assistant. Answer the following user query in 1 or 2 sentences: "):
    """
    Sends an image and a query to the model and retrieves the description or answer.

    Parameters:
    - encoded_image (str): Base64-encoded image string.
    - user_query (str): The user's question about the image.
    - assistant_prompt (str): Optional prompt to guide the model's response.

    Returns:
    - str: The model's response for the given image and query.
    """

    # Create the messages object
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": assistant_prompt + user_query
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "data:image/jpeg;base64," + encoded_image,
                    }
                }
            ]
        }
    ]

    # Send the request to the model
    response = model.chat(messages=messages)

    # Return the model's response
    return response['choices'][0]['message']['content']

## <a id='#Exercises'></a>[Exercises](#toc0_)

Now, let's practice by exploring some other capabilities of this model. Try asking "How much cholesterol is in this product?" in the 4th image


In [None]:
image = encoded_images[3]
user_query = "How much cholesterol is in this product?"
print("User Query: ", user_query)
print("Model Response: ", generate_model_response(image, user_query))

Try asking "What is the color of the woman's jacket?" in the 2nd image.


In [None]:
image = encoded_images[1]
user_query = "What is the color of the woman's jacket?"
print("User Query: ", user_query)
print("Model Response: ", generate_model_response(image, user_query))

Copyright © IBM Corporation. All rights reserved.
