# Inference with Fine-Tuned BLIP VQA Model

## Notebook Overview

**Purpose:** This notebook demonstrates how to perform inference using the fine-tuned BLIP Visual Question Answering (VQA) model. The fine-tuning process (presumably done in a separate notebook like `MedGPT_Finetuning.ipynb`) saved PEFT LoRA adapters, and this notebook loads them to answer questions based on user-provided images.

**Functionality:**
- Loads the pre-trained BLIP model and applies the fine-tuned PEFT LoRA adapters from the `Model/blip-saved-model/` directory.
- Provides interactive widgets for users to:
  - Upload an image.
  - Type a question related to the uploaded image.
- Upon image upload, the notebook processes the image and question, feeds them to the fine-tuned model, and displays the model's generated answer.

**Key Libraries:**
- `transformers`: For loading the BLIP model (`BlipForQuestionAnswering`) and its processor (`BlipProcessor`).
- `peft` (Hugging Face PEFT): For loading the LoRA adapter configuration (`PeftConfig`) and potentially applying adapter weights (`PeftModel`) if not already merged.
- `PIL` (Pillow): For opening and handling images.
- `ipywidgets`: To create interactive file upload and text input widgets in the Jupyter environment.
- `IPython.display`: To display images and widgets within the notebook.
- `torch`: Core PyTorch library.

**Prerequisite Model:**
This notebook expects the fine-tuned BLIP model components to be present in the `Model/blip-saved-model/` directory. Specifically, it relies on:
- `adapter_config.json`: The PEFT LoRA adapter configuration file.
- `adapter_model.safetensors` (or `adapter_model.bin`): The saved PEFT LoRA adapter weights.
- Other base model files that `BlipForQuestionAnswering.from_pretrained()` might look for if loading a full saved model. If only adapters were saved, the base model will be downloaded by Hugging Face transformers and then adapters applied.

### 1. Imports

This cell imports all the necessary libraries:
- `BlipProcessor`, `BlipForQuestionAnswering` from `transformers` for the VQA model and its preprocessing.
- `requests` (though not explicitly used in the main flow here, often included for fetching data).
- `Image` from `PIL` for image manipulation.
- `json`, `os`, `csv` for general utility and data handling (though `csv` and `json` are not directly used in this specific inference script's main flow).
- `logging` for setting up logging (not actively used in this script).
- `tqdm` for progress bars (not actively used in this script's interactive flow).
- `torch` for PyTorch functionalities.
- `PeftModel`, `PeftConfig` from `peft` for loading PEFT adapters (LoRA in this case).
- `io` for handling byte streams (used with image upload).
- `display` from `IPython.display` for showing images and widgets in the notebook.
- `widgets` from `ipywidgets` for creating the interactive UI components.

In [3]:
from transformers import BlipProcessor, BlipForQuestionAnswering
import requests
from PIL import Image
import json, os, csv
import logging
from tqdm import tqdm
import torch
from peft import PeftModel, PeftConfig
import io
from IPython.display import display
from ipywidgets import widgets

### 2. PEFT Configuration Loading

This cell loads the PEFT configuration from the directory where the fine-tuned adapter model was saved.
- `peft_model_id = "Model/blip-saved-model"`: Specifies the path to the saved PEFT adapters and configuration.
- `config = PeftConfig.from_pretrained(peft_model_id)`: Loads the `adapter_config.json` file from the specified directory. This configuration contains information about the base model type and the PEFT setup (e.g., LoRA parameters) that was used during fine-tuning. This `config` object itself isn't directly used in the next cell if that cell loads a model that *already has adapters merged or auto-loads them*, but it's good practice to load it to verify the path and to have it available if one were to load the base model and adapters separately.

In [None]:
peft_model_id = "Model/blip-saved-model"
config = PeftConfig.from_pretrained(peft_model_id)

### 3. Loading Model and Processor

This cell is responsible for loading the actual model and processor needed for inference:
- `processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")`: Loads the standard BLIP processor associated with the `Salesforce/blip-vqa-base` model. This processor handles the text tokenization and image preprocessing steps to prepare the data in the format the BLIP model expects.
- `model = BlipForQuestionAnswering.from_pretrained("Model/blip-saved-model").to("cuda")`: This is a key step. It loads the `BlipForQuestionAnswering` model.
    - Instead of loading the base `Salesforce/blip-vqa-base` model, it directly loads from `"Model/blip-saved-model"`.
    - The Hugging Face `from_pretrained` method is intelligent: if it finds PEFT adapter files (like `adapter_config.json` and `adapter_model.safetensors`) in the specified directory alongside base model files (or if the base model is specified in `adapter_config.json`), it will automatically load the base model and apply the PEFT adapters. This results in the fine-tuned version of the model being loaded.
    - `.to("cuda")` moves the loaded model to the GPU for faster inference, assuming a CUDA-enabled GPU is available.

The `stderr` output showing "Some weights of BlipForQuestionAnswering were not initialized..." and listing many specific layers is often normal when loading a model that has been fine-tuned with PEFT, especially if only the adapters were saved and are being applied to a freshly loaded base model. It indicates that the base model weights are loaded, and then the adapter weights are applied on top. The listed weights are typically those *not* part of the LoRA adapters (i.e., the original pre-trained weights).

In [4]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-vqa-base")
model = BlipForQuestionAnswering.from_pretrained("Model/blip-saved-model").to("cuda")

Some weights of BlipForQuestionAnswering were not initialized from the model checkpoint at Model/blip-saved-model and are newly initialized: ['text_encoder.encoder.layer.5.crossattention.self.value.weight', 'text_decoder.bert.encoder.layer.3.attention.self.value.bias', 'text_decoder.bert.encoder.layer.11.crossattention.self.query.bias', 'text_encoder.encoder.layer.5.output.LayerNorm.weight', 'text_encoder.encoder.layer.4.crossattention.output.LayerNorm.bias', 'text_encoder.encoder.layer.4.crossattention.output.LayerNorm.weight', 'text_encoder.encoder.layer.2.attention.output.LayerNorm.bias', 'text_encoder.encoder.layer.6.attention.output.dense.weight', 'text_encoder.encoder.layer.8.output.LayerNorm.bias', 'vision_model.encoder.layers.10.self_attn.projection.bias', 'text_encoder.encoder.layer.10.crossattention.output.dense.bias', 'text_decoder.bert.encoder.layer.5.attention.self.value.bias', 'text_encoder.encoder.layer.6.crossattention.self.value.bias', 'text_encoder.encoder.layer.5.att

### 4. Loading PEFT Model (Commented Out)

This cell contains code that would explicitly load the PEFT adapters and apply them to a base model. However, it's commented out.

```python
# model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
# model.eval()
# print("Peft model loaded")
```

**Explanation:**
- `PeftModel.from_pretrained(model, peft_model_id, ...)`: This function is used to load LoRA (or other PEFT) adapters from the `peft_model_id` path and apply them to an existing base `model` (which would have been loaded as, e.g., `BlipForQuestionAnswering.from_pretrained("Salesforce/blip-vqa-base")`).
- `model.eval()`: Sets the model to evaluation mode, which is important for inference as it disables layers like dropout.

**Reason for being commented out:**
This cell is likely commented out because the previous cell (`model = BlipForQuestionAnswering.from_pretrained("Model/blip-saved-model")`) already handles the loading of the fine-tuned model (base + adapters). The Hugging Face `from_pretrained` method can automatically detect and load PEFT adapters if they are present in the specified directory. Therefore, this explicit step of using `PeftModel.from_pretrained` might be redundant if the model saved in `Model/blip-saved-model/` is either a fully merged model or contains the necessary files for `BlipForQuestionAnswering.from_pretrained` to correctly load the adapters.

In [None]:
# model = PeftModel.from_pretrained(model, peft_model_id, device_map={"":0})
# model.eval()
#
# print("Peft model loaded")

### 5. Interactive Inference Function (`process_image_and_question`)

This cell defines the core logic for the interactive VQA demo.

**Function `process_image_and_question(change)`:**
- This function is designed to be called when the value of the `upload_button` widget changes (i.e., when a file is uploaded).
- **Image Handling:**
  - It retrieves the uploaded image data from the `upload_button.value`.
  - `Image.open(io.BytesIO(image_data)).convert("RGB")`: Opens the image from its byte content and ensures it's in RGB format.
  - `display(image)`: Shows the uploaded image in the notebook output.
- **Question Handling:**
  - `question = question_input.value`: Gets the text entered by the user in the `question_input` widget.
- **Inference:**
  - `encoding = processor(image, question, return_tensors="pt").to("cuda:0", torch.float16)`: 
    - The BLIP `processor` prepares the image and question for the model.
    - `return_tensors="pt"` ensures PyTorch tensors are returned.
    - `.to("cuda:0", torch.float16)` moves the input tensors to the GPU and uses `float16` (half-precision) for potentially faster inference if the model supports it and the hardware is suitable. (Note: `model.eval()` should ideally be called before running inference, which is missing in the active path if cell [4] is commented out and cell [3] is used directly for inference. However, `from_pretrained` might set it to eval mode by default for some model types, or training mode might not drastically affect BLIP generation for VQA if batch norm/dropout are not heavily used in generation paths).
  - `out = model.generate(**encoding)`: The model generates an answer sequence based on the processed image and question. `**encoding` unpacks the dictionary of inputs.
  - `generated_text = processor.decode(out[0], skip_special_tokens=True)`: The generated output tokens (from `out[0]`) are decoded back into human-readable text by the `processor`. `skip_special_tokens=True` removes any special tokens like padding or end-of-sequence tokens.
- **Display Answer:**
  - `print("Answer:", generated_text)`: Prints the model's answer.

**Widget Setup:**
- `upload_button = widgets.FileUpload(...)`: Creates a file upload button. `accept='image/*'` restricts uploads to image files, and `multiple=False` allows only one file at a time.
- `question_input = widgets.Text(...)`: Creates a text input field for the user to type their question.

**Event Handling:**
- `upload_button.observe(process_image_and_question, names='value')`: This is crucial. It links the `process_image_and_question` function to the `upload_button`. Whenever a file is successfully uploaded (i.e., the `value` of the widget changes), the function will be executed.

In [6]:
import torch
from PIL import Image
import io
from IPython.display import display
from ipywidgets import widgets

# Assuming 'processor' and 'model' are already defined and loaded
# Ensure model is in evaluation mode
model.eval()

def process_image_and_question(change):
    if upload_button.value:
        # Get the image file
        # Correctly access the uploaded file's content
        # For single file upload, upload_button.value is a tuple, get the first item
        if isinstance(upload_button.value, tuple) and len(upload_button.value) > 0:
            uploaded_file_dict = upload_button.value[0] # Get the dict for the first file
            image_data = uploaded_file_dict['content']
        elif isinstance(upload_button.value, dict): # Fallback for older ipywidgets or different environments
             uploaded_file = next(iter(upload_button.value.values()))
             image_data = uploaded_file['content']
        else:
            print("Could not retrieve uploaded file data.")
            return
        
        image = Image.open(io.BytesIO(image_data)).convert("RGB")

        # Display the uploaded image
        display(image)

        # Read the question from the text widget
        question = question_input.value

        # Prepare inputs
        # Ensure inputs are on the same device as the model and correct dtype
        inputs = processor(images=image, text=question, return_tensors="pt").to(model.device, dtype=torch.float16 if hasattr(model, 'config') and model.config.torch_dtype == torch.float16 else torch.float32)

        # Generate output from the model
        with torch.no_grad(): # Ensure no gradients are computed during inference
            out = model.generate(**inputs)
        generated_text = processor.decode(out[0], skip_special_tokens=True)

        # Display the generated text
        print("Answer:", generated_text)
        
        # Clear the upload button value to allow re-triggering with the same file if needed
        upload_button.value.clear()
        # Resetting the FileUpload widget properly is tricky; this helps but might not be perfect
        # A more robust way is to re-create the widget if re-uploading the exact same file and name is needed

# Create widgets
upload_button = widgets.FileUpload(
    accept='image/*',  # Accept image/* means it can accept any type of image
    multiple=False,  # Allow only single file upload
    description='Upload Image'
)

question_input = widgets.Text(
    value='What is in this image?', # Default question
    description='Question:',
    placeholder='Type your question here...',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

# Set up event to trigger on file upload
# Using 'data' instead of 'value' might be more reliable for observing changes after clearing
upload_button.observe(process_image_and_question, names='data') # Changed to 'data' for potentially better re-trigger

# Display widgets
display(upload_button, question_input)

FileUpload(value=(), accept='image/*', description='Upload')

Text(value='', description='Question:', placeholder='Type your question here...')

### 6. Displaying Widgets

The final active cell in the notebook:
`display(upload_button, question_input)`

This line uses `IPython.display.display` to render the interactive `FileUpload` button and the `Text` input field in the notebook's output area. The user can now interact with these widgets to upload an image and ask a question. Once an image is uploaded, the `process_image_and_question` function (observed by the upload button) will trigger and perform the VQA inference.