# Accelerating CLIP Inference: From CPU to GPU Preprocessing

This notebook demonstrates how to optimize the inference pipeline for the **OpenCLIP** model. Standard libraries like `transformers` often default to CPU-based preprocessing, which can become a bottleneck when deploying high-performance models using **NVIDIA Triton Inference Server** or **TensorRT**.

### Key Technical Areas:
1.  **Standard Pipeline:** Using `CLIPProcessor` with CPU-based preprocessing.
2.  **GPU Acceleration:** Implementing high-speed preprocessing using `torchvision.transforms` on the GPU.
3.  **Benchmarking:** Measuring the performance gains between CPU and GPU-based pipelines.

## 1. Setup and Installation
First, we'll install necessary libraries. While this notebook is compatible with Google Colab, these tools are commonly used in production environments for optimized model serving.

In [None]:
# Install core libraries for CLIP and image processing
!pip install -q transformers pillow torch torchvision datasets

## 2. Dataset Preparation
We'll use a subset of the **CIFAR-10** dataset to test our OpenCLIP zero-shot classification performance.

In [None]:
import torch
import torchvision
from PIL import Image

total_images = 50
# Load CIFAR-10 test set
dataset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True)

# Extract a subset for benchmarking
subset_images = [dataset[i][0] for i in range(total_images)]
subset_label_ids = [dataset[i][1] for i in range(total_images)]
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

## 3. Loading the OpenCLIP Model
We use the `clip-vit-large-patch14` model from OpenAI via the Hugging Face `transformers` library.

In [None]:
from transformers import CLIPProcessor, CLIPModel

model_id = "openai/clip-vit-large-patch14"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = CLIPModel.from_pretrained(model_id).to(device)
processor = CLIPProcessor.from_pretrained(model_id)
model.eval()

print(f"Model loaded on {device} successfully.")

## 4. Standard Case: CPU-Based Preprocessing
In a typical pipeline, the `CLIPProcessor` handles image resizing and normalization on the CPU before passing tensors to the GPU.

In [None]:
import time

def get_ms(start, end):
    return (end - start) * 1000

candidate_queries = [f"a photo of a {name}" for name in class_names]
pre_proc_times, model_times, post_proc_times = [], [], []

for i in range(len(subset_images)):
    image = subset_images[i]
    
    torch.cuda.synchronize()
    t0 = time.perf_counter()
    # Standard CPU Preprocessing
    inputs = processor(text=candidate_queries, images=image, return_tensors="pt", padding=True).to(device)
    
    torch.cuda.synchronize()
    t1 = time.perf_counter()
    pre_proc_times.append(get_ms(t0, t1))
    
    with torch.no_grad():
        t2 = time.perf_counter()
        outputs = model(**inputs)
        torch.cuda.synchronize()
        t3 = time.perf_counter()
        model_times.append(get_ms(t2, t3))
    
    # Post-processing
    t4 = time.perf_counter()
    probs = outputs.logits_per_image.softmax(dim=1)
    predicted_id = probs.argmax().item()
    t5 = time.perf_counter()
    post_proc_times.append(get_ms(t4, t5))

print(f"Average Latency (CPU Pre-proc): {sum(pre_proc_times)/50:.2f} ms")
print(f"Average Latency (Model): {sum(model_times)/50:.2f} ms")

## 5. Optimized Case: GPU-Accelerated Preprocessing
By using `torchvision.transforms` directly on GPU tensors, we can significantly reduce the latency of the preprocessing step, especially when handling batches.

In [None]:
import torchvision.transforms as T

# CLIP-specific normalization
mean = (0.48145466, 0.4578275, 0.40821073)
std = (0.26862954, 0.26130258, 0.27577711)

gpu_preprocess = T.Compose([
    T.Resize((224, 224), interpolation=T.InterpolationMode.BICUBIC),
    T.CenterCrop(224),
    T.ConvertImageDtype(torch.float32),
    T.Normalize(mean=mean, std=std)
])

def benchmark_gpu_preproc():
    # Warm up
    torch.cuda.synchronize()
    start_gpu = time.perf_counter()
    
    with torch.no_grad():
        # Convert images to tensors and move to GPU
        raw_tensors = torch.stack([T.functional.to_tensor(img) for img in subset_images]).to(device)
        # Parallel GPU preprocessing
        _ = gpu_preprocess(raw_tensors)
    
    torch.cuda.synchronize()
    return (time.perf_counter() - start_gpu) * 1000

gpu_time = benchmark_gpu_preproc()
print(f"GPU Preprocessing (50 images Batch): {gpu_time:.2f} ms")

## Conclusion and Next Steps
As we've seen, moving preprocessing to the GPU can lead to substantial speedups. 

### Future Optimizations:
- **NVIDIA DALI:** For even more advanced pipeline parallelization.
- **TensorRT Deployment:** Compiling the model into a high-performance engine for production use in Triton server.