In [None]:
!pip install torch transformers peft safetensors accelerate

In [None]:
#not possible to run on colab
"""
This script generates a caption for an image using the **LLaVA (Large Language and Vision Assistant)** model.
LLaVA is a vision-language model designed for image understanding and multimodal tasks.

### Overall Process:
1. **Load the LLaVA processor and model** (`llava-hf/llava-1.5-7b-hf`), a pretrained vision-language model.
2. **Open and preprocess the image** using the PIL library.
3. **Convert the image into tensors** using the processor to prepare it for model inference.
4. **Pass the processed image to LLaVA** to generate a caption.
5. **Decode and print the generated caption**.

### Libraries Used:
- `transformers`: Provides the LLaVA processor and model for image captioning.
- `PIL (Pillow)`: Handles image loading and processing.
- `torch`: Supports tensor operations required for model input and output.

### Notes:
- **LLaVA** is optimized for vision-language tasks, offering robust image captioning.
- This script uses **LLaVA-1.5-7B**, a 7-billion parameter model, which balances performance and efficiency.
- The model processes images without an explicit prompt, making it a general-purpose vision-language tool.

"""

from transformers import LlavaProcessor, LlavaForConditionalGeneration
from PIL import Image
import torch

# Load LLaVA Processor & Model
processor = LlavaProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")

# Load Image
image = Image.open("/content/20250326_232107.jpg")

# Preprocess Image
inputs = processor(images=image, return_tensors="pt")

# Generate Caption
caption_ids = model.generate(**inputs)
caption = processor.decode(caption_ids[0], skip_special_tokens=True)

print("Generated Caption:", caption)
