<a href="https://colab.research.google.com/github/peaceful-1/peaceful-1/blob/main/VQA_using_BLIP_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Visual Question Answering With BLIP**

This notebook demonstrates Visual Question Answering (VQA) using BLIP (Bootstrapping Language-Image Pretraining), a model that can answer text-based questions about images.

First, **Install dependencies,** Install the transformers library by Hugging Face, which provides pre-trained models and utilities for natural language processing(NLP) and computer vision tasks.. The ! means this is a shell command (not Python) executed inside the notebook environment.

In [1]:
!pip install transformers



**2. Suppress excessive logging:** Import Hugging Face’s logging utility.

Import logging utilities and set the verbosity level to only show ERROR messages, reducing unnecessary output clutter, (hides warnings/info for cleaner output).

In [2]:
from transformers.utils import logging
logging.set_verbosity("ERROR")

**3. Suppressing Specific Warnings:** Suppress specific warnings about using default 'max_length' values to keep the output clean.

This cell Imports Python’s warnings module.

Ignores a specific warning about max_length (not critical to the notebook).

In [3]:
import warnings
warnings.filterwarnings("ignore", message = "using the model agnostic default 'max_length'")

**Model and Processor Initialization** Load BLIP model for question answering.

Import the BLIP model specifically designed for Visual Question Answering (VQA).

Loads the pre-trained "blip-vqa-base" model from Salesforce's Hugging Face repository

model can now take in an image + question and generate an answer.

In [4]:
from transformers import BlipForQuestionAnswering
model = BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.54G [00:00<?, ?B/s]

**5. Import and load the processor:**  (tokenizer + feature extractor)
The processor prepares inputs for the model.

Import the AutoProcessor class that can automatically handle both image and text preprocessing.

Loads the processor that matches the BLIP model, which knows how to preprocess images and questions for the model

Converts images → tensors.

Tokenizes text (questions) → tensors.

This ensures image + text are in the right format for BLIP.

In [5]:
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(
    "Salesforce/blip-vqa-base"
)

preprocessor_config.json:   0%|          | 0.00/445 [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

**6. Data Loading: Mount Google Drive** (to access stored files)
Mount Google Drive to access files stored in the user's Google Drive account (specific to Google Colab environment).

In [7]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**7. Load the input image: Import the Python Imaging Library (PIL) for image handling**

This Uses PIL to Open a specific image file ("city_people.jpg") from the mounted Google Drive path

The image (city_people.jpg) will be used for visual question answering.

In [8]:
from PIL import Image
image = Image.open("/content/drive/MyDrive/Simplilearn_Tasks/city_people.jpg")

**8. Visual Question Answering Execution: First question and input preparation**
This cell Defines the question (text input).

The processor converts the image and question into PyTorch tensors (pt) that BLIP can understand.

inputs is a dictionary containing the processed data.

**9. Generate answer and decode:** Generate a textual answer from the model given the inputs, then decodes the output tokens to a readable string, printing it.

In [9]:
question = "how many people are in the picture?"
inputs = processor(image, question, return_tensors="pt")

# 9. Generate answer and decode:

out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

20


9. Generate answer and decode:

In [10]:
question = "how many people are in the picture?"
inputs = processor(image, question, return_tensors="pt")

# Generate the output using the model
out = model.generate(**inputs)

# Decode and print the answer
print(processor.decode(out[0], skip_special_tokens=True))

20


✅ Summary of workflow:

Install and set up Hugging Face.

Load BLIP VQA model + processor.

Load an image.

Provide a question about the image.

Convert (image + question) → tensors.

Model generates an answer.

Decode and print the final text answer.