[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oviya-raja/ist-402/blob/main/learning-path/W08/W8_image_caption.ipynb)

---

# Image Caption Generator (BLIP)

## Overview
This notebook implements an image captioning system using the BLIP (Bootstrapping Language-Image Pre-training) model to generate accurate, descriptive captions for uploaded images.

## Architecture
BLIP uses a vision-language transformer architecture:
- **Vision Encoder**: Processes image into visual features
- **Multimodal Fusion**: Combines visual and textual representations  
- **Language Decoder**: Generates caption text autoregressively

## Features
- **Automatic Captioning**: Generate descriptive captions for any image
- **Multiple Format Support**: JPG, JPEG, PNG
- **Fast Processing**: ~2-5 seconds per image
- **User-Friendly Interface**: Simple upload and view workflow
- **Robust Handling**: Works with various image sizes and aspect ratios

## Usage
1. Run the cell below to install dependencies and launch the app
2. Upload an image using the file uploader
3. View the automatically generated caption
4. Upload additional images to test with different types

## Technical Stack
- **Model**: Salesforce BLIP (blip-image-captioning-base)
- **Image Processing**: PIL (Pillow)
- **UI Framework**: Streamlit
- **Deep Learning**: Transformers (Hugging Face)

In [1]:
# =====================================================
#  BLIP Image Caption Generator ‚Äî Local Version (FIXED)
# =====================================================
# This cell installs dependencies and launches the Streamlit app
# type: ignore

# Install dependencies
# - streamlit: Web interface framework
# - transformers: Hugging Face library for BLIP model
# - pillow: Image processing
# - pyngrok: For public URL tunneling
%pip install -q streamlit transformers pillow torch torchvision pyngrok

# Save Streamlit app as a Python script
# The app includes:
# - Image upload and processing
# - BLIP model loading (cached for efficiency)
# - Caption generation and display
app_code: str = """
import streamlit as st
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

@st.cache_resource
def load_model():
    '''
    Load BLIP model and processor.
    Cached to avoid reloading on every interaction.
    First run downloads the model (~1GB).
    '''
    processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
    model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base")
    return processor, model

processor, model = load_model()

st.title("üñºÔ∏è Image to Caption Generator (BLIP Model)")
st.markdown("Upload an image to generate an automatic caption using AI.")

uploaded_file = st.file_uploader("Upload an Image", type=["jpg", "jpeg", "png"], 
                                 help="Supported formats: JPG, JPEG, PNG")

if uploaded_file is not None:
    # Load and display image
    image = Image.open(uploaded_file).convert("RGB")  # Convert to RGB for compatibility
    st.image(image, caption="Uploaded Image")
    
    # Generate caption
    with st.spinner("Generating caption..."):
        inputs = processor(image, return_tensors="pt")  # Preprocess image
        out = model.generate(**inputs)  # Generate caption tokens
        caption = processor.decode(out[0], skip_special_tokens=True)  # Decode to text

    st.subheader("üìù Generated Caption:")
    st.success(caption)
    
    # Additional info
    st.caption("üí° Tip: Try different types of images (nature, objects, people, scenes) to see the model's capabilities!")
"""

# Write app.py with error handling
try:
    with open("app.py", "w", encoding="utf-8") as f:
        f.write(app_code)
    print("‚úÖ app.py generated successfully")
except Exception as e:
    print(f"‚ùå Failed to write app.py: {e}")
    raise

# Setup ngrok for public URL
from pyngrok import ngrok

# ‚ö†Ô∏è IMPORTANT: Set your ngrok token here
# Get it from: https://dashboard.ngrok.com/get-started/your-authtoken
# Replace with your own token for public access
NGROK_TOKEN = "3443vHI71ODZeUY6WQUeBW45KG7_HL7SDdKFz6uty9yqd8Cg"  # ‚ö†Ô∏è CHANGE THIS!

if NGROK_TOKEN == "YOUR_TOKEN_HERE":
    print("\n‚ùå ERROR: Please set your ngrok token!")
    print("   1. Go to: https://dashboard.ngrok.com/get-started/your-authtoken")
    print("   2. Copy your token")
    print("   3. Replace 'YOUR_TOKEN_HERE' in the code above")
    raise SystemExit

try:
    ngrok.set_auth_token(NGROK_TOKEN)
    print("‚úÖ ngrok token configured")
except Exception as e:
    print(f"‚ö†Ô∏è Warning: Could not set ngrok token: {e}")
    print("   Continuing without ngrok (local access only)")

# Kill existing tunnels
try:
    for tunnel in ngrok.get_tunnels():
        ngrok.disconnect(tunnel.public_url)
except:
    pass

# Start Streamlit locally
import subprocess
import time
import os
import sys

# Kill any existing streamlit on port 8501
try:
    if os.name == 'nt':  # Windows
        os.system('netstat -ano | findstr :8501')
    else:  # macOS/Linux
        os.system('lsof -ti:8501 | xargs kill -9 2>/dev/null || true')
except:
    pass

# Start Streamlit
print("\nüöÄ Starting Streamlit...")
try:
    if sys.platform.startswith('win'):
        subprocess.Popen(
            [sys.executable, "-m", "streamlit", "run", "app.py", "--server.port", "8501", "--server.headless", "true"],
            creationflags=subprocess.CREATE_NEW_CONSOLE
        )
    else:
        subprocess.Popen(
            ["streamlit", "run", "app.py", "--server.port", "8501", "--server.headless", "true"],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
            start_new_session=True
        )
    
    time.sleep(5)  # Give Streamlit time to start
    print("‚úÖ Streamlit started!")
    
except Exception as e:
    print(f"‚ö†Ô∏è Error starting Streamlit: {e}")
    print("   You can start it manually with: streamlit run app.py")

# Create ngrok tunnel
print("\nüåê Creating public URL with ngrok...")
try:
    public_url = ngrok.connect(8501)
    print("\n" + "="*60)
    print("‚úÖ SUCCESS! Your app is running!")
    print("="*60)
    print(f"\nüåê Public URL (share this):")
    print(f"   {public_url}")
    print(f"\nüè† Local URL:")
    print(f"   http://localhost:8501")
    print(f"\nüìå Tips:")
    print(f"   ‚Ä¢ Keep this notebook running")
    print(f"   ‚Ä¢ Upload images to test the caption generator")
    print(f"   ‚Ä¢ Try different image types (nature, objects, people, scenes)")
    print("\n" + "="*60)
    
except Exception as e:
    print(f"\n‚ö†Ô∏è Could not create ngrok tunnel: {e}")
    print("\nüìå App is running locally at: http://localhost:8501")
    print("   (ngrok tunnel failed, but local access works)")
    print("\nüîß Troubleshooting:")
    print("   1. Check your ngrok token is correct")
    print("   2. Make sure you replaced 'YOUR_TOKEN_HERE'")
    print("   3. Try restarting the kernel and running again")


Note: you may need to restart the kernel to use updated packages.
‚úÖ app.py generated successfully
üöÄ Starting Streamlit...
   This will open in your default browser
   If it doesn't open automatically, go to: http://localhost:8501

‚úÖ Streamlit should be running!
   üåê Open: http://localhost:8501

   üí° To stop Streamlit, run: pkill -f streamlit

‚úÖ Setup complete! Streamlit is running locally.


## Example Usage

### Step 1: Upload Image
Click "Upload an Image" and select a JPG, JPEG, or PNG file.

### Step 2: View Caption
The system automatically:
1. Processes the image
2. Generates a descriptive caption
3. Displays the result

### Testing with Different Image Types

**Nature/Landscape Images:**
- Mountain scenes, beaches, forests
- Expected: Captions describing scenery, natural elements

**People/Portrait Images:**
- Portraits, group photos, candid shots
- Expected: Captions identifying people and activities

**Object Images:**
- Products, everyday items, food
- Expected: Precise object identification

**Scene/Activity Images:**
- Street scenes, indoor settings, activities
- Expected: Captions describing setting and context

### Understanding the Output
- **Accuracy**: BLIP excels at identifying common objects and scenes
- **Detail Level**: Captions are generally accurate but may be somewhat generic
- **Limitations**: May miss fine details or abstract/artistic elements

### Testing Limitations

To demonstrate BLIP's limitations (missing fine details or abstract elements), try these images from `data/`:

**Best images to show limitations:**
1. **`scene_street.jpg`** - Complex scenes with many details
   - Look for: Missing specific objects, people, or activities in the background
   - BLIP may give a generic "street scene" description rather than detailed elements

2. **`scene_activity.jpg`** - Activity scenes with multiple elements
   - Look for: Generic activity description vs. specific actions or interactions
   - May miss subtle interactions or specific activities happening

3. **`people_group.jpg`** - Group photos with multiple people
   - Look for: Generic "group of people" vs. specific number, poses, or interactions
   - May miss individual details or relationships between people

4. **`object_everyday.jpg`** - Everyday objects with fine details
   - Look for: Generic object identification vs. specific brand, model, or detailed features
   - May miss subtle design elements or specific characteristics

**How to test:**
1. Upload one of these images in the Streamlit app
2. Compare the generated caption with what you actually see
3. Note what specific details, objects, or elements are missing
4. The caption will likely be accurate but generic, missing fine-grained details

### Tips
- Use clear, well-lit images for best results
- The model works with various image sizes (auto-resized internally)
- Processing time: ~2-5 seconds per image
- For limitation testing: Choose complex scenes or images with many fine details