# Programmatic querying of large (vision) language models for research with focus on scientific images.

> Author: Prateek Verma  
> Notebook created for Data Science Core Workshop Series, AIMRC @ U Arkansas  
> This Notebook is designed to run entirely in Google Colab.

Welcome to the Large (Vision) Language Models Workshop! In this workshop, we will learn about some of the state-of-the art large language models and large vision language models and how to use them programmatically. This will also be a hands-on workshop like the previous ones where we will go through a variety of models to perform tasks like captioning, classification, segmentation, and so on. We hope that after this workshop you will be able to think of ideas to leverage and incorporate the ever-increasing power of LLMs into your own research. This workshop is beginner friendly.

## Before you begin

While I highly recommend to use your own images during the exercises, you are welcome to (also) use the ones I am using. You can download them from our [GitHub repo](https://github.com/pv-is-nrt/aimrc-data-science-core/blob/main/workshops/vlms/sample_images.zip). Be sure to unzip the file after downloading.

## Exercise 1: Image Captioning with BLIP

Learn how to use the BLIP (Bootstrapped Language-Image Pretraining) model to automatically generate captions for images. Image captioning is the task of generating descriptive text for a given image, which can be used in various applications such as:

- Enhancing accessibility for visually impaired users.
- Automatic description generation for images on social media.
- Supporting content creators by generating captions for images quickly.

We will use a pre-trained version of the BLIP model from Hugging Face, which is designed to understand the visual content of an image and generate meaningful descriptions based on that content. You will also have the opportunity to experiment with adding prompts to influence the generated captions.

### 1. Install and import necessary libraries

The necessary libraries come preinstalled in Google Colab, but if you want to run this on your own computer or a server, you will need to install these libraries first. A good approach is to create a new virtual environment (whether Python or Conda) for each model.

In [None]:
# these should come preinstalled in a Google Colab environment
#!pip install transformers
#!pip install torch torchvision

In [None]:
# import necessary libraries
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import requests
import torch
from google.colab import files

### 2. Load the pre-trained BLIP model and processor from Hugging Face

This will download about 1 GB data temporarily to your disk. This includes a copy of the pretrained language model. Always check out the model's website or GitHub page to learn how to use them.

In [None]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

### 3. Create some functions

Note that the second function makes use of the `processor` and `model` objects created above. The first function uses standard Python library to make uploading of images possible.

In [None]:
# Function to upload and preprocess image
def upload_image():
    uploaded = files.upload()
    for fn in uploaded.keys():
        image = Image.open(fn).convert('RGB')
        return image

In [None]:
# Run the caption generation
def generate_caption(image):
    # Preprocess the image
    inputs = processor(images=image, return_tensors="pt")
    # Generate caption
    output = model.generate(**inputs, max_length=100)
    # Decode and return the generated caption
    caption = processor.decode(output[0], skip_special_tokens=True)
    return caption

### 4. Upload the image and generate caption

In [None]:
print("Please upload an image:")
image = upload_image()

In [None]:
# Generate and display the caption
caption = generate_caption(image)
print("Generated Caption:", caption)

## Very fun homework:


1.   Modify the code so that instead of asking the user to upload the image, the user is able to specify a path to the image stored on user's Google Drive.
2.   Use a for loop to loop through all images in a directory and generate captions for all of them.
3. Save a csv file with file paths and captions for each image.
4. BLIP can answer questions about the image for you. You can send a prompt and generate an output. You will need to use the BLIP model variant that can handle VQA tasks (note that in this exercise you used the captioning variant). Try it out.
```Python
  # Load the pre-trained BLIP model and processor for VQA from Hugging Face
  processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
  model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-vqa-base")
  # BLIP with VQA prompt
  prompt = "Is this picture of outdoors or indoors?"
  # Preprocess the image and prompt for VQA mode
  inputs = processor(images=image, text=prompt, return_tensors="pt")
  # Generate the answer with a limited output length
  output = model.generate(**inputs, max_length=20)
  # Decode and return the generated answer
  response = processor.decode(output[0], skip_special_tokens=True)
  print(f"Generated Response: {response}")
```



## Exercise 2: Image Classification with CLIP

In this exercise, you'll learn how to use the CLIP (Contrastive Language-Image Pretraining) model to classify images based on provided text labels. CLIP is a powerful model that aligns visual data (images) with textual data (text prompts). By comparing an image with multiple possible labels, CLIP determines which label best matches the content of the image.

You'll upload an image and provide a list of possible answers. CLIP will return the label that best describes the image, showcasing its ability to perform zero-shot classification without prior training on specific categories.

### 1. Install and import necessary libraries

We do not need to install the libraries again (already available in Google Colab), and also because they were taken care of in the previous exercise.

In [None]:
# Import necessary libraries
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
from google.colab import files

### 2. Load the pre-trained CLIP model and processor from Hugging Face
This will download about 600 + 400 MB of data temporarily to your disk. This includes a copy of the pretrained language model.

In [None]:
# Load the CLIP model and processor from Hugging Face
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

### 3. Create some functions

The output generation function may differ greatly from model to model and we usually have to refer to the maker's website to learn how to implement them. However, the semantics and features are similar across various language models that you'll get a hang of with experience.

In [None]:
# Function to upload and preprocess image
def upload_image():
    uploaded = files.upload()
    for fn in uploaded.keys():
        image = Image.open(fn).convert('RGB')
        return image

# Function to generate answers based on image and text prompts
def generate_answer(image, possible_answers):
    inputs = processor(text=possible_answers, images=image, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)  # Convert logits to probabilities
    best_idx = probs.argmax().item()  # Get the index of the highest probability
    return possible_answers[best_idx]  # Return the answer with the highest score

### 4. Upload image, ask a question, provide possible answers, and generate the right answer

In the CLIP classification step, you upload an image, ask a question about it, and provide possible answers to guide the model in selecting the most relevant label. The provided answers act as options for the model to compare with the image, enabling it to choose the best match based on visual-text alignment. This is important because CLIP doesn’t generate answers on its own but works by ranking pre-defined options.

Beyond classification, CLIP can perform tasks like image captioning, visual search, and even zero-shot learning, allowing it to generalize well across unseen data without additional training.

In [None]:
# Upload an image and ask a question
print("Please upload an image:")
image = upload_image()

# Ask a question about the image
question = "Is this image taken indoors or outdoors?"

# Provide possible answers
possible_answers = ["outdoors", "indoors", "unclear"]
answer = generate_answer(image, possible_answers)

print(f"Generated Answer: {answer}")

## Exercise 3: Object isolation using Segment Anything Model

In this exercise, you will use the Segment Anything Model (SAM) to perform image segmentation, a task where objects within an image are identified and separated into different regions. SAM is a versatile model designed to handle a wide range of images without requiring task-specific training. You'll upload an image, and SAM will automatically generate segmentation masks that highlight the various objects present in the image.

By the end of this exercise, you will understand how SAM processes images and generates segmentation masks, making it a powerful tool for tasks like object detection, medical imaging, and visual analysis.

NOTE: I advise you to switch to a GPU kernel for this exercise.  
`Runtime` > `Change runtime type` > `T4 GPU`

### 1. Clone SAM repository and install dependencies

Because SAM is not served via HuggingFace, we clone its repo and go from there. Once cloned, we install the `segment-anything` library'', so we can use it.

In [None]:
!git clone https://github.com/facebookresearch/segment-anything.git
%cd segment-anything
!pip install -e .

### 2. Install and import other necessary libraries

In [None]:
# Install additional dependencies (these too should already be installed in Colab)
# !pip install opencv-python-headless matplotlib

In [None]:
import torch
import matplotlib.pyplot as plt
import cv2
import numpy as np
from segment_anything import sam_model_registry, SamAutomaticMaskGenerator
from google.colab import files

### 3. Download the pretrained SAM model

And specify SAM to use this model. SAM weighs about 200 MBs.

In [None]:
# Download the pre-trained SAM model
!wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -O sam_vit_h.pth

In [None]:
# check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"

# Specify the path to the model to SAM
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth").to(device)
mask_generator = SamAutomaticMaskGenerator(sam)

### 4. Upload image and run SAM

We create function for uploading image just like before. Because SAM runs slower on larger images, we also include code to automatically resize the image. We also copy a function from SAM's website that helps us to show an overlay of the generated masks on top of the input image.

In [None]:
# Upload and resize the image if needed
def upload_and_resize_image(target_width=512):
    uploaded = files.upload()
    for fn in uploaded.keys():
        image = cv2.imread(fn)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)  # Convert to RGB for correct visualization
        # Get the original dimensions of the image
        height, width = image.shape[:2]
        if width > 500:
          # Calculate the new height while maintaining the aspect ratio
          aspect_ratio = height / width
          new_height = int(target_width * aspect_ratio)
          # Resize the image while maintaining the aspect ratio
          image = cv2.resize(image, (target_width, new_height))
        return image

# Show mask overlays on top of images (copied from SAM's website)
def show_anns(anns):
    if len(anns) == 0:
        return
    sorted_anns = sorted(anns, key=(lambda x: x['area']), reverse=True)
    ax = plt.gca()
    ax.set_autoscale_on(False)

    img = np.ones((sorted_anns[0]['segmentation'].shape[0], sorted_anns[0]['segmentation'].shape[1], 4))
    img[:,:,3] = 0
    for ann in sorted_anns:
        m = ann['segmentation']
        color_mask = np.concatenate([np.random.random(3), [0.35]])
        img[m] = color_mask
    ax.imshow(img)

Generating masks takes up to 10 minutes on a CPU and 20 s on a GPU runtime.

In [None]:
print("Please upload an image:")
image = upload_and_resize_image()

# Use SAM to generate segmentation masks
print("Generating segmentation mask...")
masks = mask_generator.generate(image)

In [None]:
# Display the original image and segmentation mask
plt.figure()
plt.imshow(image)
show_anns(masks)
plt.axis('off')
plt.show()

### 5. Some analysis on masks

You can do some analysis on masks with help of properties of the generated mask objects. Use SAM's website for reference. Here we have accessed the area and bounding box properties of the masks.

In [None]:
plt.figure()
plt.imshow(image)
show_anns(masks)

image_area = image.shape[0] * image.shape[1]
print(f"Image area: {image_area}")

# add text with mask number on top of the image
for i in range(len(masks)):
    plt.text(
        masks[i]['bbox'][0] + masks[i]['bbox'][3]/2, # x-coordinate
        masks[i]['bbox'][1] + masks[i]['bbox'][2]/2, # y-coordinate
        str(i),
        fontsize=8,
        color='white',
        va='center',
        ha='center',
        bbox=dict(boxstyle='round', facecolor='black', alpha=0.5)
    )

    # you can also print the fraction of the image area that the mask occupies
    mask_area = masks[i]['area']
    fraction_area = np.round(mask_area * 100 / image_area, 2)
    print("mask", str(i), "area:", mask_area, "\tfraction:", fraction_area)

plt.axis('off')
plt.show()