# Using Blip for Image-to-Text and Visual Question Answering

Notebook for using the Blip model for image-to-text and visual question answering. The code is taken from here: https://huggingface.co/Salesforce/blip-image-captioning-large

## Requirements

### For running on Google Colab

To run this on Google Colab, all you need is to do

`!pip install transformers`

In [1]:
!pip install transformers



### For running on your local machine

To run it on your local machine, I would suggest creating a conda environment with python==3.10.12, and then installing the following dependencies into it:

`pip install transformers==4.35.2` \
`pip install Pillow==9.4.0` \
`pip install torch==2.1.0`

## Image-to_Text

In the code cell below, where it says img_url, you can provide a link to whatever image you want to send to the model. (I instead used the commented-out lines to get a picture from my Google Drive.)

### GPU Version

In [6]:
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

#png_image_path = '/content/drive/MyDrive/kitchen.png' # path to image on Google Drive
#png_image = Image.open(png_image_path)
#raw_image = png_image.convert('RGB')

# conditional image captioning
text = "an image of "
inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(f"\n\nOutput of conditional image captioning:\n\n{processor.decode(out[0], skip_special_tokens=True)}\n\n")
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(f"\n\nOutput of unconditional image captioning:\n\n{processor.decode(out[0], skip_special_tokens=True)}")



Output of conditional image captioning:

an image of a woman sitting on the beach with a dog




Output of unconditional image captioning:

woman sitting on the beach with her dog and a cell phone


### CPU Version

In [3]:
import torch
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

#png_image_path = '/content/drive/MyDrive/kitchen.png' # path to image on Google Drive
#png_image = Image.open(png_image_path)
#raw_image = png_image.convert('RGB')

# conditional image captioning
text = "an image of "
inputs = processor(raw_image, text, return_tensors="pt")

out = model.generate(**inputs)
print(f"\nOutput of conditional image captioning:\n\n{processor.decode(out[0], skip_special_tokens=True)}\n\n")
# >>> a photography of a woman and her dog

# unconditional image captioning
inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
print(f"Output of unconditional image captioning:\n\n{processor.decode(out[0], skip_special_tokens=True)}")


Output of conditional image captioning:

an image of a woman sitting on the beach with a dog


Output of unconditional image captioning:

woman sitting on the beach with her dog and a cell phone


## Visual Question Answering

### GPU Version

In [4]:
import requests
import sys
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base", torch_dtype=torch.float16).to("cuda")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

#png_image_path = '/content/drive/MyDrive/plant.png' # path to image on Google Drive
#png_image = Image.open(png_image_path)
#raw_image = png_image.convert('RGB')

question = "What animal can you see in the picture?"
inputs = processor(raw_image, question, return_tensors="pt").to("cuda", torch.float16)

out = model.generate(**inputs)
print(f"\n\nOutput of visual question answering: {processor.decode(out[0], skip_special_tokens=True)}\n")

preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.54G [00:00<?, ?B/s]



Output of visual question answering: dog



### CPU Version

In [5]:
import requests
import sys
from PIL import Image
from transformers import BlipProcessor, BlipForQuestionAnswering

processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

#png_image_path = '/content/drive/MyDrive/plant.png' # path to image on Google Drive
#png_image = Image.open(png_image_path)
#raw_image = png_image.convert('RGB')

question = "What animal can you see in the picture?"
inputs = processor(raw_image, question, return_tensors="pt")

out = model.generate(**inputs)
print(f"\n\nOutput of visual question answering: {processor.decode(out[0], skip_special_tokens=True)}\n")



Output of visual question answering: dog

