[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/marketing-and-customer-insight/AI_For_Marketing/blob/main/AI%20for%20Image%20Classification/Run_VisionLanguageModel.ipynb)

In Google Colab please make sure to select: Runtime -> Change Runtime -> Tesla T4 (GPU) 

# Vision Language Model Image Classification

This notebook demonstrates how to use a Vision Language Model (Phi-4-multimodal) for image classification. The workflow includes:
- Loading a pre-trained multimodal model that can process both images and text
- Using in-context learning with example images to classify new images
- Evaluating model performance on labeled data
- Making predictions on new, unlabeled images

The model uses few-shot learning, providing a few example images of each class to help guide predictions.

## 1. Import Required Libraries

Import essential libraries for image processing, data manipulation, and model evaluation.

In [None]:
!pip install transformers==4.48.2 accelerate evaluate backoff --quiet
import os
import pandas as pd
from PIL import Image
from tqdm import tqdm
from evaluate import load

## 2. Configuration Settings

Set the paths for your dataset and where to save the evaluation results:
- **DATASET_PATH**: Path to your CSV file containing image paths and labels
- **PERFORMANCE_SUMMARY_PATH**: Where to save the performance metrics

In [None]:
"""

SETTINGS

"""

DATASET_PATH = './Datasets/Brand_Selfies/dataset.csv'
PERFORMANCE_SUMMARY_PATH = './evaluation_summary_vlm.csv'

Download the datasets

In [None]:
!git clone --depth 1 --filter=blob:none --sparse \
https://github.com/marketing-and-customer-insight/AI_For_Marketing.git
%cd AI_For_Marketing
!git sparse-checkout set "AI for Image Classification/Datasets"
!mv "AI for Image Classification/Datasets" /content
os.chdir('/content')

## 3. Load Vision Language Model and Define Helper Functions

This cell performs several important tasks:

**Metrics Computation** (`compute_weighted_metrics`):
- Computes precision, recall, F1, and accuracy metrics for predictions
- Handles unknown labels gracefully
- Returns weighted metrics across all classes

**Load Phi-4-Multimodal Model**:
- Loads the Phi-4-multimodal model from Hugging Face
- Uses GPU if available, falls back to CPU otherwise
- This is a state-of-the-art vision language model that can understand both images and text

**Few-Shot Learning Utilities**:
- `get_prompt_with_examples_phi()`: Creates a prompt that includes example images and their labels, enabling few-shot learning
- `predict_phi()`: Runs inference on a new image using the model with the example images as context

In [None]:
def compute_weighted_metrics(y_true: pd.Series, y_pred: pd.Series) -> dict:
    precision_metric = load("precision")
    recall_metric = load("recall")
    f1_metric = load("f1")
    accuracy_metric = load("accuracy")

    y_true = y_true.astype(str)
    y_pred = y_pred.astype(str)

    labels = sorted(y_true.unique().tolist())
    label2id = {label: i for i, label in enumerate(labels)}

    y_true = y_true.map(label2id)
    y_pred = y_pred.map(label2id)

    # Drop rows where prediction is not a known label
    mask = y_pred.notna()
    y_true = y_true[mask].astype(int)
    y_pred = y_pred[mask].astype(int)

    precision = precision_metric.compute(
        predictions=y_pred, references=y_true, average="weighted"
    )["precision"]
    recall = recall_metric.compute(
        predictions=y_pred, references=y_true, average="weighted"
    )["recall"]
    f1 = f1_metric.compute(
        predictions=y_pred, references=y_true, average="weighted"
    )["f1"]
    accuracy = accuracy_metric.compute(
        predictions=y_pred, references=y_true
    )["accuracy"]

    return {
        "accuracy": accuracy,
        "precision_weighted": precision,
        "recall_weighted": recall,
        "f1_weighted": f1,
    }

from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

model_path = "microsoft/Phi-4-multimodal-instruct"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True,
    _attn_implementation='eager'
)
model.eval()
model.requires_grad_(False)

generation_config = GenerationConfig.from_pretrained(model_path)

def get_prompt_with_examples_phi(base_prompt:str, df_example: pd.DataFrame):
    prompt = ''
    for i in df_example.index:
        prompt = f"{prompt}<|user|><|image_{i+1}|>{base_prompt}<|end|><|assistant|>'{df_example.at[i, 'label']}'<|end|>"
    prompt = f"{prompt}<|user|><|image_{df_example.shape[0]+1}|>{base_prompt}<|end|><|assistant|>"
    return prompt

def predict_phi(inference_img_path:str, prompt:str, example_images:list):
    images = example_images.copy()
    image_predict = Image.open(inference_img_path).convert('RGB').resize((224, 224))
    images.append(image_predict)

    inputs = processor(text=prompt, images=images, return_tensors='pt').to(device)
    with torch.inference_mode():
        generate_ids = model.generate(
            **inputs,
            max_new_tokens=16,
            generation_config=generation_config,
        )
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
    return response

## 4. Load Your Dataset

Load the CSV file containing your labeled images. Your dataset CSV must contain:
- **image_path**: Path to each image file
- **label**: The correct class label for that image

The dataset is validated and displayed to confirm it loaded correctly.

In [None]:
"""

Loading your dataset

"""

df = pd.read_csv(DATASET_PATH)

required_cols = {'image_path', 'label'}
assert required_cols.issubset(df.columns), \
    'Please make sure that your dataset contains both columns: image_path and label.'

df = df.sample(23, random_state=1) # sample 23, 3 few-shot examples, 20 for inference

df.head()

## 5. Few-Shot Learning and Model Evaluation

This cell performs the main classification and evaluation:

1. **Extract Examples**: Selects one example image from each class to use for few-shot learning
2. **Create Prompt**: Builds a prompt that includes the example images and their labels
3. **Remove Examples from Evaluation Set**: Ensures examples don't appear in the test set
4. **Inference**: Runs the model on each image in the dataset using the example images as context
5. **Post-Processing**: Cleans up predictions by removing extra quotes and whitespace
6. **Save Predictions**: Saves predictions to a CSV file
7. **Compute Metrics**: Calculates performance metrics (accuracy, precision, recall, F1) and saves them

The model uses the examples to understand the classification task and applies that knowledge to new images.

In [None]:
classes = list(df.label.unique())

prompt = f'Assign the best fitting class to describe the image by choosing one of the following classes: {classes}'


df_examples = df.groupby('label').apply(lambda x: x.sample(1, random_state=1)).reset_index(drop=True)
df = df[~df["image_path"].isin(df_examples["image_path"])].reset_index(drop=True)
assert not set(df_examples.image_path).intersection(set(df.image_path)), "The example images should not be part of the evaluation dataset."
n_examples = len(df_examples.label.unique())


prompt_phi = get_prompt_with_examples_phi(prompt, df_examples)
example_images = []
for index in df_examples.index:
    example_images.append(Image.open(df_examples.at[index, 'image_path']).convert('RGB').resize((224, 224)))

for i in tqdm(df.index):
    inference_image_path = df.at[i, 'image_path']
    df.at[i, 'Pred_Phi'] = predict_phi(inference_image_path, prompt_phi, example_images)
    torch.cuda.empty_cache()
    if i % 10 == 0:
        df.to_csv('dataset_vlm_predictions.csv', index=False)

df['Pred_Phi'] = df['Pred_Phi'].apply(lambda x: x.strip().strip("'\"") if isinstance(x, str) else x)
df.to_csv('./dataset_vlm_predictions.csv', index=False)

metrics_rows = []
for pred_col in [col for col in df.columns if 'Pred_' in col]:
    if pred_col in df.columns:
        mask = df[pred_col].notna()
        metrics = compute_weighted_metrics(df.loc[mask, "label"], df.loc[mask, pred_col])
        metrics_rows.append({"model": pred_col, **metrics})

df_metrics = pd.DataFrame(metrics_rows)
df_metrics.to_csv(PERFORMANCE_SUMMARY_PATH, index=False)

## 6. Prepare Unlabeled Images for Prediction

Load all images from the directory containing new, unseen examples that you want to classify. A DataFrame is created to store the image paths for inference.

In [None]:
import glob
prediction_image_paths = glob.glob('./Datasets/Brand_Selfies/Unseen_Samples/*')
df_predictions = pd.DataFrame(prediction_image_paths, columns=['image_path'])

## 7. Run Inference on Unlabeled Images

Process each unlabeled image through the trained vision language model to generate predictions:
1. Uses the example images and prompt from earlier
2. Runs the model in inference mode on each new image
3. Extracts and cleans the predicted class label
4. Saves all predictions to a CSV file

The output CSV contains the image paths and their corresponding predicted labels.

In [None]:
for i in tqdm(df_predictions.index):
    df_predictions.at[i, 'Pred'] = predict_phi(df_predictions.at[i, 'image_path'], prompt_phi, example_images)
    torch.cuda.empty_cache()

df_predictions['Pred'] = df_predictions['Pred'].apply(lambda x: x.strip().strip("'\"") if isinstance(x, str) else x)
df_predictions.to_csv('./prediction_unseen_data_vlm.csv', index=False)