# **Image Captioning with Different Models in Google Colab**

This notebook provides an easy-to-use platform to generate image captions using multiple models in Google Colab. You will be able to compare the outputs of various captioning models, analyze their differences, and evaluate the quality of the captions generated based on the input images in the huggingface link. GPU support is provided for faster processing.

In [None]:
!pip install datasets
!pip install transformers==4.44.2

# BLIP2

In [2]:
import torch
from tqdm import tqdm  # For displaying progress bars
import pandas as pd  # For saving the output in a CSV file
from transformers import Blip2Processor, Blip2ForConditionalGeneration  # BLIP-2 processor and model
from datasets import load_dataset  # For loading the dataset

In [3]:
# Define your parameters directly in the script
class Args:
    # Parameters (model checkpoint, prompt, device, batch size, etc.)
    model_ckpt = "Salesforce/blip2-flan-t5-xl"  # Model checkpoint
    prompt = "a photography of "  # Default prompt to be added before each generated caption
    device = 'cuda' if torch.cuda.is_available() else 'cpu'  # Automatically select GPU (CUDA) if available, otherwise use CPU
    batch_size = 100  # Number of images to process in each batch
    num_beams = 1  # Beam search parameter for the model generation
    cache_dir = './model_cache'  # Directory to store cached models; if not specified, defaults to cache location
    output_dir = "./out.csv"  # Output directory for saving the generated captions
    max_new_tokens = 50  # Maximum number of tokens for the generated captions (not used explicitly)
    num_images_to_caption = 400  # New parameter: Limit how many images to caption

args = Args()  # Create an instance of the `Args` class to hold these parameter values

def generate(och_dataset, args):
    # Load the model and processor
    processor = Blip2Processor.from_pretrained(args.model_ckpt, cache_dir=args.cache_dir)
    model = Blip2ForConditionalGeneration.from_pretrained(
        args.model_ckpt, torch_dtype=torch.float16, cache_dir=args.cache_dir
    )
    model.to(args.device)

    generated_captions = []

    # Limit the dataset to the number of images specified
    total_images = min(len(och_dataset), args.num_images_to_caption)

    # Generate Captions in batches, showing progress with a progress bar
    with tqdm(total=total_images) as pbar:
        for i in range(0, total_images, args.batch_size):
            batch_data = och_dataset[i:i + args.batch_size]  # Select batch according to the batch size and the number of images to caption

            inputs = processor(text=[args.prompt for _ in range(len(batch_data['image']))],
                               images=batch_data['image'], return_tensors="pt",
                               padding=True, truncation=True).to(args.device, torch.float16)

            with torch.no_grad():
                generated_ids = model.generate(**inputs, num_beams=args.num_beams)
            generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
            generated_captions += generated_texts
            pbar.update(len(batch_data['image']))  # Update the progress bar with the number of images processed

    return generated_captions

def run(args):
    print("Loading Dataset\n")
    och_dataset = load_dataset("Hagarsh/OpenCHAIR_verb_exp_2", cache_dir=args.cache_dir)['test']

    print("\nGenerating Captions\n")
    generated_captions = generate(och_dataset, args)

    # Save the generated captions to a CSV file
    df = pd.DataFrame()
    df['generated_caption'] = generated_captions
    df.to_csv(args.output_dir)
    print(f"Captions saved to {args.output_dir}")

# Now run the function
run(args)


Loading Dataset



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/328 [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1377 [00:00<?, ? examples/s]


Generating Captions



preprocessor_config.json:   0%|          | 0.00/432 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/7.68k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/133k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.44G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/6.33G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 400/400 [00:34<00:00, 11.53it/s]

Captions saved to ./out.csv





# BLIP L

In [5]:
import torch
from tqdm import tqdm  # For displaying progress bars
import pandas as pd  # For saving the output in a CSV file
from transformers import BlipProcessor, BlipForConditionalGeneration  # BLIP processor and model for BLIP-L
from datasets import load_dataset  # For loading the dataset

In [6]:
# Define your parameters directly in the script
class Args:
    # Parameters (model checkpoint, prompt, device, batch size, etc.)
    model_ckpt = "Salesforce/blip-image-captioning-large"  # BLIP-Large model checkpoint
    prompt = "a photography of "  # Default prompt to be added before each generated caption
    device = 'cuda' if torch.cuda.is_available() else 'cpu'  # Automatically select GPU (CUDA) if available, otherwise use CPU
    batch_size = 100  # Number of images to process in each batch
    num_beams = 1  # Beam search parameter for the model generation
    cache_dir = './model_cache'  # Specify a local cache directory for model caching
    output_dir = "./out_blip_l.csv"  # Output directory for saving the generated captions
    max_new_tokens = 50  # Maximum number of tokens for the generated captions (not used explicitly)
    num_images_to_caption = 400  # New parameter: Limit how many images to caption

args = Args()  # Create an instance of the `Args` class to hold these parameter values

def generate(och_dataset, args):
    # Load the model and processor
    processor = BlipProcessor.from_pretrained(args.model_ckpt, cache_dir=args.cache_dir)  # Load BLIP processor for tokenizing text and processing images
    model = BlipForConditionalGeneration.from_pretrained(
        args.model_ckpt, torch_dtype=torch.float16, cache_dir=args.cache_dir  # Load the BLIP-L model in half-precision for efficiency
    )
    model.to(args.device)  # Move the model to the specified device (GPU or CPU)

    generated_captions = []  # List to store generated captions

    # Limit the dataset to the number of images specified
    total_images = min(len(och_dataset), args.num_images_to_caption)

    # Generate Captions in batches, showing progress with a progress bar
    with tqdm(total=total_images) as pbar:  # Progress bar for tracking how many batches are processed
        for i in range(0, total_images, args.batch_size):
            batch_data = och_dataset[i:i + args.batch_size]  # Select batch according to the batch size and the number of images to caption

            # Preprocess images and prompts for the model
            inputs = processor(text=[args.prompt for _ in range(len(batch_data['image']))],  # Create a list of prompts, one for each image
                               images=batch_data['image'], return_tensors="pt",  # Process the images, convert to PyTorch tensors
                               padding=True, truncation=True).to(args.device, torch.float16)  # Send the inputs to the device and use float16 precision

            with torch.no_grad():  # Disable gradient calculation for inference (faster and uses less memory)
                generated_ids = model.generate(**inputs, num_beams=args.num_beams)  # Generate captions using the model with beam search
            generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)  # Decode the generated token IDs into text captions
            generated_captions += generated_texts  # Append the generated captions to the list
            pbar.update(len(batch_data['image']))  # Update the progress bar for each batch

    return generated_captions  # Return the list of generated captions

def run(args):
    print("Loading Dataset\n")
    # Load the dataset from Hugging Face. Replace this with your dataset path or name.
    och_dataset = load_dataset("Hagarsh/OpenCHAIR_verb_exp_2", cache_dir=args.cache_dir)['test']

    print("\nGenerating Captions\n")
    # Call the generate function to produce captions for the dataset
    generated_captions = generate(och_dataset, args)

    # Save the generated captions to a CSV file
    df = pd.DataFrame()  # Create a new dataframe
    df['generated_caption'] = generated_captions  # Add the generated captions to the dataframe
    df.to_csv(args.output_dir)  # Save the dataframe as a CSV file to the specified output directory
    print(f"Captions saved to {args.output_dir}")  # Notify that the CSV file has been saved

# Now run the function
run(args)  # Call the run function to execute the script


Loading Dataset


Generating Captions



preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/527 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/4.60k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.88G [00:00<?, ?B/s]

100%|██████████| 400/400 [00:39<00:00, 10.18it/s]

Captions saved to ./out_blip_l.csv





# GIT BASE

In [2]:
import torch
from tqdm import tqdm  # For displaying progress bars
import pandas as pd  # For saving the output in a CSV file
from transformers import GitProcessor, GitForCausalLM  # Correct model and processor import
from datasets import load_dataset  # For loading the dataset
from PIL import Image

In [3]:
# Define your parameters directly in the script
class Args:
    model_ckpt = "microsoft/git-base"  # GIT-B model checkpoint for image captioning
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  # Use GPU if available, otherwise use CPU
    batch_size = 1  # Number of images to process in each batch (adjustable)
    cache_dir = './model_cache'  # Cache directory for model files
    output_dir = "./out_git_b.csv"  # Output CSV file for saving captions
    max_length = 50  # Maximum number of tokens for the generated captions
    num_images_to_caption = 400  # New parameter: Limit how many images to caption

args = Args()  # Create an instance of the `Args` class to hold these parameter values

# Function to generate captions
def generate(och_dataset, args):
    # Load the processor and model
    processor = GitProcessor.from_pretrained(args.model_ckpt, cache_dir=args.cache_dir)
    model = GitForCausalLM.from_pretrained(args.model_ckpt, cache_dir=args.cache_dir).to(args.device)

    generated_captions = []  # List to store generated captions

    # Limit the dataset to the number of images specified
    total_images = min(len(och_dataset), args.num_images_to_caption)

    # Generate captions in batches, showing progress with a progress bar
    with tqdm(total=total_images) as pbar:
        for idx, data in enumerate(och_dataset):
            if idx >= total_images:  # Break the loop if the specified limit is reached
                break

            # Load and preprocess the image
            image = data['image']  # Replace 'image_path' with the correct key, like 'image' if images are stored directly
            inputs = processor(images=image, return_tensors="pt").to(args.device)

            # Generate caption
            output = model.generate(**inputs, max_length=args.max_length)

            # Decode the generated caption
            caption = processor.decode(output[0], skip_special_tokens=True)
            generated_captions.append(caption)  # Append the caption to the list

            pbar.update(1)  # Update the progress bar
    return generated_captions

# Function to run the script
def run(args):
    print("Loading Dataset\n")
    # Load the dataset from Hugging Face. Replace this with your dataset path or name.
    och_dataset = load_dataset("Hagarsh/OpenCHAIR_verb_exp_2", cache_dir=args.cache_dir)['test']

    print("\nGenerating Captions\n")
    # Call the generate function to produce captions for the dataset
    generated_captions = generate(och_dataset, args)

    # Save the generated captions to a CSV file
    df = pd.DataFrame()  # Create a new dataframe
    df['generated_caption'] = generated_captions  # Add the generated captions to the dataframe
    df.to_csv(args.output_dir)  # Save the dataframe as a CSV file to the specified output directory
    print(f"Captions saved to {args.output_dir}")  # Notify that the CSV file has been saved

# Now run the function
run(args)


Loading Dataset



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/328 [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1377 [00:00<?, ? examples/s]


Generating Captions



preprocessor_config.json:   0%|          | 0.00/503 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/453 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/2.82k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/707M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

100%|██████████| 400/400 [01:20<00:00,  4.94it/s]


Captions saved to ./out_git_b.csv
