# BLIP Image Captioning
### Ashish Kumar

## Motivation
I chose this topic because it combines two powerful modalities—vision and language—into a single model. Understanding how models like BLIP generate text from visual input helped me appreciate the recent progress in multimodal learning.

## Historical Perspective on Multimodal Learning
Multimodal learning has grown rapidly with models like CLIP, ViLBERT, and BLIP, which align image and text representations. These models have enabled tasks like image captioning, VQA, and retrieval with high accuracy, demonstrating the effectiveness of joint training on large image-text datasets.

## Learning from the Project
This project helped me understand how pretrained models work, how image and text embeddings are processed, and how to apply transformer-based architectures in practical applications like caption generation.

## Code Example
The following Python snippet uses BLIP to generate a caption from a local image.

In [None]:
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

# Load model and processor
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")

# Load the image (update path accordingly)
image_path = r"C:\Users\ashish\Downloads\image.jpeg"
image = Image.open(image_path).convert("RGB")

# Preprocess and generate caption
inputs = processor(images=image, return_tensors="pt")
out = model.generate(**inputs, max_length=50)
caption = processor.decode(out[0], skip_special_tokens=True)

print("Generated Caption:", caption)

## Reflections
**What surprised me?** The quality and coherence of the captions generated by the model with zero fine-tuning.
**Scope for improvement:** Fine-tuning BLIP on domain-specific datasets could improve performance further.

## References
- [BLIP GitHub](https://github.com/salesforce/BLIP)
- [Hugging Face Transformers](https://huggingface.co/docs/transformers)
- [Visual Haystacks Blog](https://bair.berkeley.edu/blog/2024/07/20/visual-haystacks/)