## Image Captioning using BLIP

-BLIP uses a Vision Transformer (ViT) to process the image.

-It uses a Text Transformer to generate text.

-In conditional mode, the text encoder injects context into the decoder so the generated caption aligns with the prompt.

-In unconditional mode, the decoder starts from scratch (like free captioning).

In [None]:
'''
BLIP Captioning example as it is from Hugging Face
https://huggingface.co/Salesforce/blip-image-captioning-base
'''

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base").to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

# conditional image captioning
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda")

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
# >>> a woman sitting on the beach with her dog


a photography of a woman and her dog on the beach
a woman sitting on the beach with her dog


In [None]:
'''
Saving the pretrained weights of the preprocessor so that I donont need to download every time
'''
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
processor.save_pretrained("./SavedModels/BLIP")

[]

In [None]:
'''
Saving the pretrained weights of the IImage Captioning model so that I donont need to download every time

'''
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
model.save_pretrained("./SavedModels/BLIP/CondiGen")

In [10]:
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor=BlipProcessor.from_pretrained("./SavedModels/BLIP") # Loading the preprocessor
model=BlipForConditionalGeneration.from_pretrained("./SavedModels/BLIP/CondiGen").to("cuda") # Loading the model
raw_image = Image.open(r"./InputImages/mine.jpg") # Loading the Image from the folder

import os
from os import listdir

folder_dir= "./InputImages/" # Directory of the folder containing images

with open('image_captioned.txt', 'w') as wf:
    for image in listdir(folder_dir):
        if image.endswith(".jpg") or image.endswith(".png"): # Checking for image files
            raw_image = Image.open(os.path.join(folder_dir, image)).convert('RGB') # Opening the image
            print(f"Processing {image}...") # Printing the name of the image being processed

            # For Conditional Image captioning
            text ="a good captioning"


            inputs = processor(raw_image,text, return_tensors="pt").to("cuda") # Processing the inputs
            out = model.generate(**inputs) # Generating the output
            print(processor.decode(out[0], skip_special_tokens=True)) # Decoding and printing the output
            wf.write(f"{image}: {processor.decode(out[0], skip_special_tokens=True)}\n")


Processing 20240622_164021.jpg...
a good captioning picture of a man and a woman
Processing dennis.jpg...
a good captioning view of the beach and the fog
Processing keyur_borad.jpg...
a good captioning photo of a man ' s face
Processing mine.jpg...
a good captioning picture of a man standing on a bridge
Processing my_graduation.jpg...
a good captioning man in a graduation gown
Processing risha.jpg...
a good captioning man


In [21]:
import os