<a href="https://colab.research.google.com/github/notzabir/EBL_IT_INTERNSHIP_2025/blob/main/Vision_RAg/vision_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a vision RAG pipeline to process and analyze images from a PDF using the "Salesforce/blip-image-captioning-base" model. The pipeline should extract all images, allow selection by number of images or page number, and retrieve and describe relevant images based on questions from the PDF.

## Setup

### Subtask:
Install necessary libraries for PDF processing, image handling, and the BLIP model.


**Reasoning**:
Install the required libraries for PDF processing, image handling, and the BLIP model.



In [2]:
%pip install PyMuPDF Pillow torch transformers faiss-cpu



## Pdf image extraction

### Subtask:
Extract images from the specified PDF document.


**Reasoning**:
Import the `fitz` library and define a function to extract images from a PDF, storing the image data, page number, and image index.



In [18]:
import fitz
import io
from PIL import Image
import os # Import the os module

def extract_images_from_pdf(pdf_path, output_folder="extracted_images"):
    """Extracts images from a PDF document and saves them to a folder."""
    extracted_images = []
    doc = fitz.open(pdf_path)

    # Create the output folder if it doesn't exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    for page_number in range(len(doc)):
        page = doc.load_page(page_number)
        images = page.get_images(full=True)

        for img_index, img in enumerate(images):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]

            try:
                # Verify if it's a valid image
                img_pil = Image.open(io.BytesIO(image_bytes))

                # Define the output path for the image
                image_filename = f"page_{page_number}_img_{img_index}.{image_ext}"
                image_path = os.path.join(output_folder, image_filename)

                # Save the image to the output folder
                img_pil.save(image_path)

                extracted_images.append({
                    'page_number': page_number,
                    'image_index': img_index,
                    'image_path': image_path, # Store the image path instead of data
                    'caption': None # Initialize caption as None
                })
            except Exception as e:
                print(f"Could not process or save image on page {page_number}, index {img_index}: {e}")

    doc.close()
    return extracted_images

# Replace with the actual path to your PDF file
pdf_path = "ebl_annual_report_2024.pdf"
extracted_images_list = extract_images_from_pdf(pdf_path)

# For demonstration, let's assume we have a dummy PDF or skip execution if no PDF is provided
print(f"Extracted {len(extracted_images_list)} images.")

Extracted 137 images.


## Image captioning

### Subtask:
Use the "Salesforce/blip-image-captioning-base" model to generate captions for the extracted images.


**Reasoning**:
Import the necessary classes from the transformers library, load the model and processor, and iterate through the extracted images to generate captions.



In [20]:
from transformers import BlipForConditionalGeneration, BlipProcessor
import torch
from PIL import Image
import io

# Load the pre-trained BLIP model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Assume extracted_images_list is available from the previous step
# If not, create a dummy list for demonstration purposes
if 'extracted_images_list' not in locals() or not extracted_images_list:
    print("extracted_images_list is not available or empty. Creating dummy data.")
    # Create a dummy image file
    dummy_image = Image.new('RGB', (60, 30), color = 'red')
    dummy_folder = "dummy_images"
    if not os.path.exists(dummy_folder):
        os.makedirs(dummy_folder)
    dummy_image_path_1 = os.path.join(dummy_folder, "dummy_img_1.png")
    dummy_image.save(dummy_image_path_1)
    dummy_image_path_2 = os.path.join(dummy_folder, "dummy_img_2.png")
    dummy_image.save(dummy_image_path_2)


    extracted_images_list = [
        {'page_number': 0, 'image_index': 0, 'image_path': dummy_image_path_1, 'caption': None},
        {'page_number': 1, 'image_index': 0, 'image_path': dummy_image_path_2, 'caption': None}
    ]


# Iterate through the extracted images and generate captions
for image_info in extracted_images_list:
    try:
        # Open the image from the saved file path
        raw_image = Image.open(image_info['image_path']).convert('RGB')

        # Check if the image was opened successfully
        if raw_image is None:
            print(f"Could not open image file: {image_info['image_path']}")
            image_info['caption'] = "Error opening image file"
            continue

        # Preprocess the image
        inputs = processor(raw_image, return_tensors="pt")

        # Generate a caption
        out = model.generate(**inputs)

        # Decode the generated caption
        caption = processor.decode(out[0], skip_special_tokens=True)

        # Add the caption to the dictionary
        image_info['caption'] = caption
        print(f"Generated caption for image on page {image_info['page_number']}, index {image_info['image_index']}: {caption}")

    except Exception as e:
        print(f"Could not generate caption for image on page {image_info['page_number']}, index {image_info['image_index']}: {e}")
        image_info['caption'] = "Error generating caption" # Add an error placeholder

# Store the updated list (extracted_images_list is updated in place)
# You can optionally create a new variable if you prefer
captioned_images_list = extracted_images_list

# Display the first few items in the updated list to verify
print("\nUpdated extracted_images_list with captions:")
display(captioned_images_list[:3])

Generated caption for image on page 0, index 0: a white background with a blue and yellow text that reads reflect
Generated caption for image on page 2, index 0: a white and black photo of a lake
Generated caption for image on page 2, index 1: a flock of birds flying in the sky
Generated caption for image on page 2, index 2: the logo for the university of north carolina
Generated caption for image on page 2, index 3: the logo for the university of north carolina
Generated caption for image on page 2, index 4: the logo for the new youtube app
Generated caption for image on page 5, index 0: a foggy lake
Generated caption for image on page 5, index 1: a square frame with a white background
Generated caption for image on page 6, index 0: a lone tree in the fog on a lake
Generated caption for image on page 7, index 0: a body of water
Generated caption for image on page 8, index 0: a large body of water with mountains in the background
Generated caption for image on page 9, index 0: a mounta

[{'page_number': 0,
  'image_index': 0,
  'image_path': 'extracted_images/page_0_img_0.jpeg',
  'caption': 'a white background with a blue and yellow text that reads reflect'},
 {'page_number': 2,
  'image_index': 0,
  'image_path': 'extracted_images/page_2_img_0.jpeg',
  'caption': 'a white and black photo of a lake'},
 {'page_number': 2,
  'image_index': 1,
  'image_path': 'extracted_images/page_2_img_1.jpeg',
  'caption': 'a flock of birds flying in the sky'}]

## Indexing

### Subtask:
Create an index (e.g., using a vector database or similar structure) to store the image captions and their corresponding image/page information.


**Reasoning**:
Initialize the FAISS index and a sentence transformer model, then iterate through the captioned images to generate embeddings and add them to the index while storing the image information.



In [21]:
import faiss
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize a sentence transformer model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Assume captioned_images_list is available from the previous step
# If not, create a dummy list for demonstration purposes
if 'captioned_images_list' not in locals() or not captioned_images_list:
    print("captioned_images_list is not available or empty. Creating dummy data.")
    captioned_images_list = [
        {'page_number': 0, 'image_index': 0, 'caption': 'a red square'},
        {'page_number': 1, 'image_index': 0, 'caption': 'a blue circle'},
        {'page_number': 2, 'image_index': 0, 'caption': 'a green triangle'}
    ]

# Get the dimensionality of the embeddings
embedding_dim = embedding_model.get_sentence_embedding_dimension()

# Initialize a FAISS index
index = faiss.IndexFlatL2(embedding_dim)

# List to store the mapping from index ID to image information
index_to_image_info = []

# Iterate through the captioned images and add embeddings to the index
for i, image_info in enumerate(captioned_images_list):
    caption = image_info['caption']
    if caption and caption != "Error generating caption": # Only process if a valid caption exists
        # Generate embedding for the caption
        embedding = embedding_model.encode(caption)
        # Reshape the embedding to be a 2D numpy array
        embedding = np.array([embedding])

        # Add the embedding to the FAISS index
        index.add(embedding)

        # Store the image information, mapping the index ID to the info
        index_to_image_info.append({
            'page_number': image_info['page_number'],
            'image_index': image_info['image_index']
        })
    else:
        print(f"Skipping image on page {image_info.get('page_number')}, index {image_info.get('image_index')} due to missing or error caption.")


# Verify the number of items in the index
print(f"Number of items in FAISS index: {index.ntotal}")
print(f"Number of items in index_to_image_info list: {len(index_to_image_info)}")

# Check if the counts match
if index.ntotal == len(index_to_image_info):
    print("Index count matches the number of successfully captioned images.")
else:
    print("Warning: Index count does not match the number of successfully captioned images.")


Number of items in FAISS index: 137
Number of items in index_to_image_info list: 137
Index count matches the number of successfully captioned images.


## Retrieval

### Subtask:
Implement a mechanism to retrieve relevant image captions based on user queries (e.g., questions, page numbers, or number of images).


**Reasoning**:
Implement the function to retrieve images based on user queries, handling text questions, page numbers, and specific numbers of images.



In [24]:
import re # Import regex for checking if the query is a number

def retrieve_images(query, captioned_images_list, index, index_to_image_info, embedding_model, N=5):
    """
    Retrieves relevant image information based on a user query.

    Args:
        query (str): The user query (text question, page number, or number of images).
        captioned_images_list (list): List of dictionaries with image info and captions.
        index (faiss.Index): The FAISS index containing caption embeddings.
        index_to_image_info (list): List mapping index ID to image info.
        embedding_model: The sentence transformer model for generating query embeddings.
        N (int): The number of top results to retrieve for text queries.

    Returns:
        list: A list of dictionaries containing the retrieved image information.
    """
    retrieved_images = []

    # Check if the query is a page number (numeric)
    if re.match(r'^\d+$', query):
        page_number = int(query)
        print(f"Query recognized as page number: {page_number}")
        for image_info in captioned_images_list:
            if image_info.get('page_number') == page_number:
                retrieved_images.append(image_info)
        print(f"Found {len(retrieved_images)} images on page {page_number}.")

    # Check if the query is a request for a specific number of images (e.g., "first 3 images")
    elif re.match(r'^(first|top)\s+(\d+)\s+images$', query.lower()):
        match = re.match(r'^(first|top)\s+(\d+)\s+images$', query.lower())
        num_images = int(match.group(2))
        print(f"Query recognized as request for first {num_images} images.")
        retrieved_images = captioned_images_list[:num_images]
        print(f"Retrieving the first {len(retrieved_images)} images.")

    # Otherwise, treat the query as a text question
    else:
        print(f"Query recognized as text question: \"{query}\"")
        try:
            # Generate embedding for the query
            query_embedding = embedding_model.encode(query)
            query_embedding = np.array([query_embedding]) # Reshape for FAISS

            # Search the FAISS index
            # Ensure N does not exceed the number of indexed items
            k = min(N, index.ntotal)
            if k > 0:
                distances, indices = index.search(query_embedding, k)

                # Retrieve the corresponding image information
                for i in indices[0]:
                    # Check if the index is valid
                    if i < len(index_to_image_info):
                        img_info = index_to_image_info[i]
                        # Find the full image info from captioned_images_list using page_number and image_index
                        for full_img_info in captioned_images_list:
                            if full_img_info.get('page_number') == img_info['page_number'] and \
                               full_img_info.get('image_index') == img_info['image_index']:
                                retrieved_images.append(full_img_info)
                                break # Found the image, move to the next index
                    else:
                        print(f"Warning: Invalid index {i} retrieved from FAISS.")

                print(f"Retrieved {len(retrieved_images)} images based on text query.")
            else:
                 print("FAISS index is empty. Cannot perform text search.")


        except Exception as e:
            print(f"Error during text query retrieval: {e}")


    return retrieved_images

# Example Usage (Assuming captioned_images_list, index, index_to_image_info, embedding_model are defined)
# Example 1: Text query
query = "images about trophy"
retrieved_results = retrieve_images(query, captioned_images_list, index, index_to_image_info, embedding_model, N=3)
print("\nResults for text query:")
display(retrieved_results)

# Example 2: Page number query (assuming page 0 has images)
query = "0"
retrieved_results = retrieve_images(query, captioned_images_list, index, index_to_image_info, embedding_model)
print("\nResults for page number query:")
display(retrieved_results)

# Example 3: Number of images query
# query = "first 2 images"
# retrieved_results = retrieve_images(query, captioned_images_list, index, index_to_image_info, embedding_model)
# print("\nResults for number of images query:")
# display(retrieved_results)

Query recognized as text question: "images about trophy"
Retrieved 3 images based on text query.

Results for text query:


[{'page_number': 12,
  'image_index': 0,
  'image_path': 'extracted_images/page_12_img_0.png',
  'caption': 'the trophy trophy'},
 {'page_number': 12,
  'image_index': 28,
  'image_path': 'extracted_images/page_12_img_28.png',
  'caption': 'the trophy trophy'},
 {'page_number': 12,
  'image_index': 3,
  'image_path': 'extracted_images/page_12_img_3.png',
  'caption': 'a trophy with a green shirt on it'}]

Query recognized as page number: 0
Found 1 images on page 0.

Results for page number query:


[{'page_number': 0,
  'image_index': 0,
  'image_path': 'extracted_images/page_0_img_0.jpeg',
  'caption': 'a white background with a blue and yellow text that reads reflect'}]

## Response generation

### Subtask:
Based on the retrieved captions, provide the user with the relevant image descriptions.


**Reasoning**:
Define the `generate_response` function to format the retrieved image information and then call it with an example query to demonstrate its usage.



In [25]:
def generate_response(retrieved_images):
    """
    Generates a human-readable response string from a list of retrieved images.

    Args:
        retrieved_images (list): A list of dictionaries containing retrieved image information.

    Returns:
        str: A formatted string describing the retrieved images.
    """
    if not retrieved_images:
        return "No relevant images were found."

    response = "Retrieved Images:\n"
    for image_info in retrieved_images:
        page_number = image_info.get('page_number', 'N/A')
        image_index = image_info.get('image_index', 'N/A')
        caption = image_info.get('caption', 'No caption available')
        response += f"- Image on Page {page_number}, Index {image_index}: {caption}\n"

    return response

# Example Usage:
# Let's use a text query example from the previous step
query = "a red square" # Example query based on dummy data
retrieved_results = retrieve_images(query, captioned_images_list, index, index_to_image_info, embedding_model, N=3)

# Generate and print the response
response_string = generate_response(retrieved_results)
print(response_string)

# Example Usage:
# Let's use a page number query example from the previous step
query = "0" # Example query based on dummy data
retrieved_results = retrieve_images(query, captioned_images_list, index, index_to_image_info, embedding_model, N=3)

# Generate and print the response
response_string = generate_response(retrieved_results)
print(response_string)

# Example Usage:
# Let's use a number of images query example from the previous step
query = "first 1 images" # Example query based on dummy data
retrieved_results = retrieve_images(query, captioned_images_list, index, index_to_image_info, embedding_model, N=3)

# Generate and print the response
response_string = generate_response(retrieved_results)
print(response_string)

Query recognized as text question: "a red square"
Retrieved 3 images based on text query.
Retrieved Images:
- Image on Page 12, Index 19: a yellow square with a white background
- Image on Page 43, Index 0: a square with a blue background
- Image on Page 12, Index 18: a brown square with a white border

Query recognized as page number: 0
Found 1 images on page 0.
Retrieved Images:
- Image on Page 0, Index 0: a white background with a blue and yellow text that reads reflect

Query recognized as request for first 1 images.
Retrieving the first 1 images.
Retrieved Images:
- Image on Page 0, Index 0: a white background with a blue and yellow text that reads reflect



## Summary:

### Data Analysis Key Findings

*   The pipeline successfully extracted images from a PDF document using the `fitz` library.
*   The "Salesforce/blip-image-captioning-base" model was successfully used to generate descriptive captions for the extracted images.
*   A FAISS index was created to store the vector embeddings of the image captions, enabling efficient similarity search.
*   A retrieval mechanism was implemented that can handle text-based queries (finding similar images based on captions), page number queries (finding images on a specific page), and requests for a specific number of images (retrieving the first N images).
*   A response generation function was created to present the retrieved image information, including page number, index, and caption, in a human-readable format.

### Insights or Next Steps

*   Integrate the pipeline components into a single, cohesive application or function that takes a PDF path and user query as input and returns the relevant image descriptions.
*   Enhance the retrieval mechanism to handle more complex queries or combinations of query types, potentially incorporating re-ranking or other advanced retrieval techniques.
