# Junk Food Detection with CLIP using ViT

This pipeline implements a junk food image classification system using the CLIP model. It downloads a COCO-format dataset from Roboflow, merges annotations, preprocesses data, runs inference using CLIP, and evaluates performance with metrics such as accuracy, precision, recall, and confusion matrices.

## Before you start

Make sure you have access to GPU. In case of any problems, navigate to `Edit` -> `Notebook settings` -> `Hardware accelerator`, set it to `GPU`, click `Save` and try again.

In [None]:
!nvidia-smi

Sun Dec 28 08:45:17 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   65C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
import os
HOME = os.getcwd()
print("HOME:", HOME)
!mkdir -p {HOME}/datasets
%cd {HOME}/datasets

HOME: /content
/content/datasets


## Install packages using pip


In [None]:
!pip install roboflow==1.2.11 open-clip-torch==3.2.0 pillow==11.3.0 torch==2.9.0 torchvision==0.24.0

Collecting roboflow==1.2.11
  Downloading roboflow-1.2.11-py3-none-any.whl.metadata (9.7 kB)
Collecting open-clip-torch==3.2.0
  Downloading open_clip_torch-3.2.0-py3-none-any.whl.metadata (32 kB)
Collecting idna==3.7 (from roboflow==1.2.11)
  Downloading idna-3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting opencv-python-headless==4.10.0.84 (from roboflow==1.2.11)
  Downloading opencv_python_headless-4.10.0.84-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting pi-heif<2 (from roboflow==1.2.11)
  Downloading pi_heif-1.1.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (6.5 kB)
Collecting pillow-avif-plugin<2 (from roboflow==1.2.11)
  Downloading pillow_avif_plugin-1.5.2-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Collecting filetype (from roboflow==1.2.11)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting ftfy (from open-clip-torch==3.2.0)
  Downloading ftfy-6.3.1-py3-none-any.whl.metadata

## Download dataset from Roboflow

Don't forget to change the `API_KEY` with your dataset key.

The dataset from Roboflow comes in COCO format

In [None]:
from roboflow import Roboflow
from google.colab import userdata

rf = Roboflow(api_key=userdata.get('ROBOFLOW_API_KEY'))
project = rf.workspace(userdata.get('ROBOFLOW_WORKSPACE_ID')).project(userdata.get('ROBOFLOW_PROJECT_ID'))
version = project.version(userdata.get('ROBOFLOW_DATASET_VERSION'))
dataset = version.download("coco")

DATASET_PATH = "datasets/Junk-Food-Detection-10/"

loading Roboflow workspace...
loading Roboflow project...


Downloading Dataset Version Zip in Junk-Food-Detection-10 to coco:: 100%|██████████| 293482/293482 [00:17<00:00, 17259.73it/s]





Extracting Dataset Version Zip to Junk-Food-Detection-10 in coco:: 100%|██████████| 5280/5280 [00:01<00:00, 4384.40it/s]


In [None]:
%cd {HOME}

/content


## Classes recollection


In [None]:
import json
from pathlib import Path

folders = ['train', 'test', 'valid']
all_categories = {}

for folder in folders:
    json_path = DATASET_PATH + folder + "/_annotations.coco.json"

    try:
        with open(json_path, 'r') as f:
            data = json.load(f)
            categories = data.get('categories', [])

            for category in categories:
                cat_id = category['id']
                cat_name = category['name']
                cat_supercategory = category.get('supercategory', 'none')

                # Skip the junk-food category for this particular dataset (it means nothing)
                if cat_name == 'junk-food':
                  continue

                if cat_name not in all_categories:
                    all_categories[cat_name] = {
                        'id': cat_id,
                        'name': cat_name,
                        'supercategory': cat_supercategory
                    }

    except FileNotFoundError:
        print(f"Warning: {json_path} not found")
    except json.JSONDecodeError:
        print(f"Error: {json_path} is not a valid JSON file")

NATURAL_LANGUAGE_TO_CLASS_MAP = {
  'junk food': 'junk-food',
  'french fries': 'french_fries',
  'fried chicken': 'fried_chicken',
  'hamburger': 'hamburger',
  'ice cream': 'ice_cream',
  'junk food logo': 'junk_food_logo',
  'pizza': 'pizza',
  'soda': 'soda'
}
CLASS_TO_NATURAL_LANGUAGE_MAP = {v: k for k, v in NATURAL_LANGUAGE_TO_CLASS_MAP.items()}

classes_from_dataset = [cat_name for cat_name in CLASS_TO_NATURAL_LANGUAGE_MAP.keys()]

print("Classes:", classes_from_dataset)

Classes: ['junk-food', 'french_fries', 'fried_chicken', 'hamburger', 'ice_cream', 'junk_food_logo', 'pizza', 'soda']


## Prepare OpenCLIP model

In [None]:
import torch
import open_clip
from PIL import Image
from collections import defaultdict
import json

device = "cuda" if torch.cuda.is_available() else "cpu"
model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-16",
    pretrained="laion2b_s34b_b88k"
)
tokenizer = open_clip.get_tokenizer("ViT-B-16")
model = model.to(device)
model.eval()

def classify_images_with_clip(dataset_part, classes):

    with open(DATASET_PATH + dataset_part + "/" + "_annotations.coco.json", "r") as f:
        coco_data = json.load(f)

    # Build mappings
    category_id_to_name = {cat["id"]: cat["name"] for cat in coco_data["categories"]}
    image_id_to_filename = {img["id"]: img["file_name"] for img in coco_data["images"]}

    # Group annotations by image_id
    image_annotations = defaultdict(list)
    for ann in coco_data["annotations"]:
        image_id = ann["image_id"]
        category_name = category_id_to_name[ann["category_id"]]
        image_annotations[image_id].append(category_name)

    # Get ground truth labels per image (unique categories)
    ground_truth = {}
    for image_id, categories in image_annotations.items():
        filename = image_id_to_filename[image_id]
        unique_categories = list(set(categories))
        ground_truth[filename] = unique_categories

    # Prompt ensembling
    templates = [
        "a photo of {}",
        "a picture of {}",
        "an image containing {}",
        "a close-up photo of {}",
        "an ad containing {}"
    ]

    # Encode text features once
    with torch.no_grad():
        text_features = []

        for cls in classes:
            prompts = [t.format(CLASS_TO_NATURAL_LANGUAGE_MAP[cls]) for t in templates]
            tokens = tokenizer(prompts).to(device)

            embeddings = model.encode_text(tokens)
            embeddings /= embeddings.norm(dim=-1, keepdim=True)

            class_embedding = embeddings.mean(dim=0)
            class_embedding /= class_embedding.norm()

            text_features.append(class_embedding)

        text_features = torch.stack(text_features)  # [num_classes, embed_dim]

    # Multi-label threshold (cosine similarity space)
    classification_threshold = 0.2  # Adjust this based on your data

    results = {}
    image_paths = [img["file_name"] for img in coco_data["images"]]

    for image_path in image_paths:
        try:
            image = Image.open(DATASET_PATH + dataset_part + "/" + image_path).convert("RGB")
            image_tensor = preprocess(image).unsqueeze(0).to(device)

            with torch.no_grad():
                image_features = model.encode_image(image_tensor)
                image_features /= image_features.norm(dim=-1, keepdim=True)

                # Cosine similarity per class
                similarities = (image_features @ text_features.T).squeeze(0).cpu()

            # Independent multi-label decision
            predicted_labels = []
            predicted_scores = {}

            for idx, score in enumerate(similarities):
                score_value = score.item()
                predicted_scores[classes[idx]] = score_value

                if score_value >= classification_threshold:
                    predicted_labels.append(classes[idx])

            # Store all predicted labels and scores
            results[image_path] = {
                "labels": predicted_labels,
                "scores": predicted_scores
            }

        except FileNotFoundError:
            raise FileNotFoundError(
                f"Image file not found: {image_path}. Cannot continue processing."
            )

    return results, ground_truth, image_paths


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


open_clip_model.safetensors:   0%|          | 0.00/599M [00:00<?, ?B/s]

## Metrics

In [None]:
def evaluate_predictions(image_paths, ground_truth, results, classes):
    # Per-class metrics
    class_metrics = {cls: {'tp': 0, 'fp': 0, 'fn': 0} for cls in classes}

    # Overall metrics (micro-averaged)
    total_tp = 0
    total_fp = 0
    total_fn = 0

    for image_path in image_paths:
        true_labels = set(ground_truth.get(image_path, []))
        pred_entry = results.get(image_path, {})
        pred_labels = set(pred_entry.get("labels", []))

        # Evaluate each class independently
        for cls in classes:
            true_positive = cls in true_labels
            pred_positive = cls in pred_labels

            if true_positive and pred_positive:
                class_metrics[cls]['tp'] += 1
                total_tp += 1
            elif not true_positive and pred_positive:
                class_metrics[cls]['fp'] += 1
                total_fp += 1
            elif true_positive and not pred_positive:
                class_metrics[cls]['fn'] += 1
                total_fn += 1

    # Calculate macro-F1
    macro_f1 = 0
    for cls in classes:
        tp = class_metrics[cls]['tp']
        fp = class_metrics[cls]['fp']
        fn = class_metrics[cls]['fn']

        precision = (tp / (tp + fp)) if (tp + fp) > 0 else 0
        recall = (tp / (tp + fn)) if (tp + fn) > 0 else 0
        f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0
        macro_f1 += f1

    macro_f1 = (macro_f1 / len(classes) * 100) if len(classes) > 0 else 0

    # Calculate micro-F1
    micro_precision = (total_tp / (total_tp + total_fp)) if (total_tp + total_fp) > 0 else 0
    micro_recall = (total_tp / (total_tp + total_fn)) if (total_tp + total_fn) > 0 else 0
    micro_f1 = (2 * micro_precision * micro_recall / (micro_precision + micro_recall)) if (micro_precision + micro_recall) > 0 else 0

    # Calculate subset accuracy (exact match)
    exact_matches = sum(
        1 for path in image_paths
        if set(ground_truth.get(path, [])) == set(results.get(path, {}).get("labels", []))
    )
    subset_accuracy = (exact_matches / len(image_paths) * 100) if len(image_paths) > 0 else 0

    return {
        'micro_f1_score': micro_f1 * 100,
        'macro_f1_score': macro_f1,
        'subset_accuracy': subset_accuracy
    }

## Prediction on validation set

In [None]:
valid_results, valid_ground_truth, valid_image_paths = classify_images_with_clip(
    dataset_part="valid",
    classes=classes_from_dataset,
)

valid_metrics = evaluate_predictions(
    image_paths=valid_image_paths,
    ground_truth=valid_ground_truth,
    results=valid_results,
    classes=classes_from_dataset,
)

print(f"Subset Accuracy: {valid_metrics['subset_accuracy']:.2f}%")
print(f"Micro F1: {valid_metrics['micro_f1_score']:.2f}%")
print(f"Macro F1: {valid_metrics['macro_f1_score']:.2f}%")

Subset Accuracy: 43.18%
Micro F1: 44.60%
Macro F1: 48.56%


## Run model on test set

In [None]:
test_results, test_ground_truth, test_image_paths = classify_images_with_clip(
    dataset_part="test",
    classes=classes_from_dataset,
)

test_metrics = evaluate_predictions(
    image_paths=test_image_paths,
    ground_truth=test_ground_truth,
    results=test_results,
    classes=classes_from_dataset,
)

print(f"Subset Accuracy: {test_metrics['subset_accuracy']:.2f}%")
print(f"Micro F1: {test_metrics['micro_f1_score']:.2f}%")
print(f"Macro F1: {test_metrics['macro_f1_score']:.2f}%")

Subset Accuracy: 42.20%
Micro F1: 43.24%
Macro F1: 48.99%
