# BLIP (Bootstrapping Language-Image Pretraining)

## Introduction to BLIP

BLIP represents a significant advancement in the intersection of natural language processing (NLP) and computer vision. BLIP, designed to improve AI models, enhances their ability to understand and generate image descriptions. It learns to associate images with relevant text, allowing it to generate captions, answer image-related questions, and support image-based search queries.

## Why BLIP Matters

BLIP is crucial for several reasons:

- Enhanced understanding: It provides a more nuanced understanding of the content within images, going beyond object recognition to comprehend scenes, actions, and interactions.
- Multimodal learning: By integrating text and image data, BLIP facilitates multimodal learning, which is closer to how humans perceive the world.
- Accessibility: Generating accurate image descriptions can make content more accessible to people with visual impairments.
- Content creation: It supports creative and marketing endeavors by generating descriptive texts for visual content, saving time and enhancing creativity.

## Real-Time Use Case: Automated Photo Captioning

A practical application of BLIP is in developing an automated photo captioning system. Such a system can be used in diverse domains. It enhances social media platforms by suggesting captions for uploaded photos automatically. It also aids digital asset management systems by offering searchable descriptions for stored images.

# Image Captioning

In [32]:
# Install the transformers library
#!pip install transformers
#!pip install pillow
#!pip install torch
#!pip install requests
#!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
import os
import glob
from pathlib import Path
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

In [2]:
# Initialize the processor and model from Hugging Face
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

2024-05-01 14:51:21.934118: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


config.json:   0%|          | 0.00/4.56k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

In [3]:
# Load an image
image = Image.open("./resources/BaysanSoft.png")
# Prepare the image
inputs = processor(image, return_tensors="pt")
# Generate captions
outputs = model.generate(**inputs)
caption = processor.decode(outputs[0],skip_special_tokens=True)



In [4]:
 print("Generated Caption:", caption)

Generated Caption: the logo for the future of software


# Visual Question Answering

In [6]:
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

In [11]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-large")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-large")
img_url = './resources/logo.png' 
raw_image = Image.open(img_url).convert('RGB')

## Conditional Image Captioning

In [13]:
text = "a photography of"
inputs = processor(raw_image, text, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))



a photography of a white snake with a blue tail and a purple tail


## Unconditional Image Captioning

In [None]:
inputs = processor(raw_image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

## My Demos

In [61]:
def summary_of_image(image_path, model, processor):
    raw_image = Image.open(image_path).convert('RGB')
    inputs = processor(raw_image, return_tensors="pt")
    out = model.generate(**inputs, max_length=150)
    summary = processor.decode(out[0], skip_special_tokens=True)
    return summary

In [62]:
images_folder = Path(os.path.abspath(".")) / "resources"

In [63]:
images = glob.glob(str(images_folder / "*"))

In [64]:
for i in images:
    summary = summary_of_image(i, model, processor)
    print(summary)

arafed building with a clock tower in the background
a large explosion of smoke and smoke is seen from a building
zebras and elephants are grazing in a field with a lion
a drawing of a lion and a dog are standing together


# Gradio

## Why Use Gradio?

Gradio is useful for several reasons:
- Ease of use: Gradio enables the creation of interfaces for models with just a few lines of code.
- Flexibility: Gradio supports various inputs and outputs, such as text, images, files, and more.
- Sharing and collaboration: Interfaces can be shared with others through unique URLs, facilitating easy collaboration and feedback collection.

In [None]:
!pip install gradio

In [None]:
import gradio as gr

In [None]:
def greet(name, intensity):
  return "Hello, " + name + "!" * int(intensity)

In [None]:
demo = gr.Interface(
  fn=greet,
  inputs=["text", "slider"],
  outputs=["text"],
)

In [None]:
demo.launch() # http://localhost:7860

## Understanding the Interface class

Note that to make your first demo, you created an instance of the gr.Interface class. The Interface class is designed to create demos for machine learning models that accept one or more inputs and return one or more outputs.

The Interface class has three core arguments:

- fn: The function to wrap a user interface (UI) around
- inputs: The Gradio component(s) to use for the input. The number of components should match the number of arguments in your function.
- outputs: The Gradio component(s) to use for the output. The number of components should match the number of return values from your function.

The fn argument is flexible — you can pass any Python function you want to wrap with a UI. In the example above, you saw a relatively simple function, but the function could be anything from a music generator to a tax calculator to the prediction function of a pretrained machine learning model.

The input and output arguments take one or more Gradio components. As we'll see, Gradio includes more than 30 built-in components (such as the gr.Textbox(), gr.Image(), and gr.HTML() components) that are designed for machine learning applications.

If your function accepts more than one argument, as is the case above, pass a list of input components to inputs, with each input component corresponding to one of the function's arguments in order. The same applies if your function returns more than one value: simply pass a list of components to outputs. This flexibility makes the Interface class a very powerful way to create demos.

# BLIP + Gradio

In [None]:
import gradio as gr
from transformers import BlipProcessor, BlipForConditionalGeneration

In [None]:
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

In [None]:
def generate_caption(image_path):
    image = Image.open(image_path)
    inputs = processor(image, return_tensors="pt")
    outputs = model.generate(**inputs)
    caption = processor.decode(outputs[0], skip_special_tokens=True)
    return caption
    
def caption_image(image):
    """
    Takes an image input and returns a caption.
    """
    caption = generate_caption(image)
    return caption

In [None]:
iface = gr.Interface(
    fn=caption_image,
    inputs=gr.inputs.Image(type="pil", shape=(224, 224)),
    outputs="text",
    title="Image Captioning with BLIP",
    description="Upload an image to generate a caption."
)

In [None]:
iface.launch()