[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/marketing-and-customer-insight/AI_For_Marketing/blob/main/GenAI%20for%20Influencer%20Marketing/Phi_Captioning.ipynb)

In Google Colab please make sure to select: Runtime -> Change Runtime -> Tesla T4 (GPU)

# Phi-4 Multimodal Image Captioning for Influencer Content Analysis

This notebook demonstrates how to use Phi-4-multimodal-instruct, a state-of-the-art Vision Language Model (VLM) from Microsoft, to automatically generate captions for images in a local folder.

## What is Phi-4 Multimodal?

Phi-4 is Microsoft's advanced multimodal AI model that seamlessly combines vision and language understanding. It can process images and generate detailed, contextually-aware text descriptions. The model excels at:

1. **Detailed Image Understanding**: Analyzing visual content with nuanced descriptions
2. **Instruction-Based Captioning**: Responding to specific prompts about images with tailored output
3. **Rich Context Generation**: Creating comprehensive descriptions that capture objects, styles, and semantic meaning

## Use Case for Influencer Marketing

Automatically generating captions for influencer content with Phi-4 helps you:
- Extract precise descriptions of visual elements in influencer posts
- Understand brand visibility and product placement in images
- Analyze whether images convey intended messaging with detailed context
- Process large volumes of influencer content programmatically
- Create comprehensive alternative text descriptions for accessibility
- Generate marketing insights from visual content analysis

## Workflow

This notebook will:
1. Load the Phi-4 multimodal model
2. Find all images in a local folder (`./Images/`)
3. Generate detailed captions for each image using Phi-4
4. Save results to a CSV file for analysis
5. Display sample results


## 1. Import Required Libraries and Setup

Import the necessary libraries for image processing, model loading, and data management.

In [None]:
!pip install transformers==4.48.2 accelerate evaluate backoff --quiet
import torch
import glob
import os
import pandas as pd
from PIL import Image
from tqdm import tqdm


# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")

Download existing images from Github

In [None]:
!git clone --depth 1 --filter=blob:none --sparse \
https://github.com/marketing-and-customer-insight/AI_For_Marketing.git
%cd AI_For_Marketing
!git sparse-checkout set "GenAI for Influencer Marketing/Images"
!mv "GenAI for Influencer Marketing/Images" /content

## 2. Configuration

Set the paths and parameters for the captioning task.

In [None]:
"""
Configuration Settings
"""
import os
os.chdir('/content')

# Path to folder containing images
IMAGE_FOLDER = './Images'

# File extensions to look for
IMAGE_EXTENSIONS = ['*.jpg', '*.jpeg', '*.png', '*.gif', '*.webp']

# Find all image files
image_paths = []
for extension in IMAGE_EXTENSIONS:
    image_paths.extend(glob.glob(os.path.join(IMAGE_FOLDER, extension)))
    # Also search in subdirectories
    image_paths.extend(glob.glob(os.path.join(IMAGE_FOLDER, '**', extension), recursive=True))

# Remove duplicates
image_paths = list(set(image_paths))
image_paths.sort()

print(f"Found {len(image_paths)} images in {IMAGE_FOLDER}")
if len(image_paths) > 0:
    for img_path in image_paths:
        print(f"  - {img_path}")

## 3. Load the Phi-4 Multimodal Vision Language Model

Download and load the pre-trained Phi-4 multimodal model. This may take a minute on first run as it downloads the model weights (~4GB). The model will be loaded with float16 precision for optimal GPU memory usage.


In [None]:
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
import torch

model_path = "microsoft/Phi-4-multimodal-instruct"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
    _attn_implementation='eager'
)
model.eval()
model.requires_grad_(False)

generation_config = GenerationConfig.from_pretrained(model_path)

def predict_phi(inference_img_path:str, prompt:str, max_tokens:int=128):
    image_predict = Image.open(inference_img_path).convert('RGB').resize((224, 224))

    inputs = processor(text=prompt, images=image_predict, return_tensors='pt').to(device)
    with torch.inference_mode():
        generate_ids = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            generation_config=generation_config,
        )
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    return response

## 5. Generate Captions for All Images

Process each image through the model and generate captions. This may take several minutes depending on the number of images and your GPU.

In [None]:
# Output file to save results
OUTPUT_CSV = 'image_captions.csv'

PROMPT = "<|user|><|image_1|>Describe all objects and visual style of the image in detail.<|end|><|assistant|>"

In [None]:
results = []

print(f"\nGenerating captions for {len(image_paths)} images...")

for image_path in tqdm(image_paths):
    try:
        caption = predict_phi(image_path, PROMPT, max_tokens=128)

        # Store result
        results.append({
            'image_path': image_path,
            'caption': caption
        })

    except Exception as e:
        print(f"\nError processing {image_path}: {str(e)}")
        results.append({
            'image_path': image_path,
            'caption': f"Error: {str(e)}"
        })

print(f"\nSuccessfully generated captions for {len(results)} images")

# Convert to DataFrame
df_results = pd.DataFrame(results)

# Save to CSV
df_results.to_csv(OUTPUT_CSV, index=False)
print(f"Results saved to: {OUTPUT_CSV}")

# Display summary
print("\nFirst 5 results:")
df_results.head()

## 7. Display Sample Results

View captions alongside their images to verify quality.

In [None]:
from IPython.display import Image as IPImage, display, HTML

# Display first 5 images with their captions
num_samples = min(5, len(df_results))

print(f"Displaying first {num_samples} results:\n")

for idx in range(num_samples):
    row = df_results.iloc[idx]
    image_path = row['image_path']
    caption = row['caption']

    print(f"\n{'='*60}")
    print(f"Image {idx+1}: {os.path.basename(image_path)}")
    print(f"Caption: {caption}")
    print(f"Full path: {image_path}")

    # Display image if running in Jupyter
    try:
        display(IPImage(filename=image_path, width=400))
    except:
        print("(Image preview not available in terminal)")