# CLIP Inference Pipelines With DeepSparse


[CLIP](https://github.com/mlfoundations/open_clip/tree/main) models can be used for zero-shot image classification and generating captions given an image. This notebook illustrates how to perform these two tasks on a CPU using CLIP and [DeepSparse](https://github.com/neuralmagic/deepsparse).

To run this notebook succesfully, you need to install `open_clip_torch==2.20.0`. This can be achieved by installing `sparseml[clip]`. Other required installations are:

- `deepsparse[clip]`
- `torch-nightly`

In [None]:
pip install sparseml-nightly[clip]

In [None]:
# Set this to today's torch nightly version
import os
os.environ["MAX_TORCH"] = "2.2.0.dev20230911+cpu"

In [None]:
pip install deepsparse-nightly[clip]

In [None]:
pip uninstall -y  torch

In [None]:
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/

## Download Test Images

In [None]:
%%bash
wget -O basilica.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
wget -O buddy.jpeg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg
wget -O thailand.jpg https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg

![basilica](https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolo/sample_images/basilica.jpg
)
![dog](https://raw.githubusercontent.com/neuralmagic/deepsparse/main/tests/deepsparse/pipelines/sample_images/buddy.jpeg
)
![dog](https://raw.githubusercontent.com/neuralmagic/deepsparse/main/src/deepsparse/yolact/sample_images/thailand.jpg
)

## Zero-shot Image Classification

You need to provide the CLIP models in the ONNX format to DeepSparse. You can obtain these fikes by exporting the CLIP models using the provided export scripts.

### Export Models Using SparseML

First, download the export scripts:

In [None]:
%%bash
wget https://raw.githubusercontent.com/neuralmagic/sparseml/main/integrations/clip/clip_models.py
wget https://raw.githubusercontent.com/neuralmagic/sparseml/main/integrations/clip/clip_onnx_export.py

Export the CLIP models for zero-shot classification.

The pre-trained models can be found on the [OpenCIP GitHub repository](https://github.com/mlfoundations/open_clip/tree/main).


Running the export script exports a visual model and a text model which are then passed to the DeepSparse Pipeline for inference.

In [None]:
%%bash
python clip_onnx_export.py --model convnext_base_w_320 \
            --pretrained laion_aesthetic_s13b_b82k --export-path convnext_onnx

  warn(


Perform zero-shot image classification using the CLIP models and DeepSparse by providing the images, possible classes and the path to the CLIP models while specifying the task as `clip_zeroshot`.

In [None]:
import numpy as np

from deepsparse import BasePipeline
from deepsparse.clip import (
    CLIPTextInput,
    CLIPVisualInput,
    CLIPZeroShotInput
)

possible_classes = ["ice cream", "an elephant", "a dog", "a building", "a church"]
images = ["basilica.jpg", "buddy.jpeg", "thailand.jpg"]

model_path_text = "convnext_onnx/clip_text.onnx"
model_path_visual = "convnext_onnx/clip_visual.onnx"

kwargs = {
    "visual_model_path": model_path_visual,
    "text_model_path": model_path_text,
}
pipeline = BasePipeline.create(task="clip_zeroshot", **kwargs)

pipeline_input = CLIPZeroShotInput(
    image=CLIPVisualInput(images=images),
    text=CLIPTextInput(text=possible_classes),
)

output = pipeline(pipeline_input).text_scores
for i in range(len(output)):
    prediction = possible_classes[np.argmax(output[i])]
    print(f"Image {images[i]} is a picture of {prediction}")

  warn(
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20230906 COMMUNITY | (f5e597bf) (release) (optimized) (system=avx2, binary=avx2)


Image basilica.jpg is a picture of a church
Image buddy.jpeg is a picture of a dog
Image thailand.jpg is a picture of an elephant


## Image Caption Generation


Image caption generation can be done in a similar manner as zero-shot image classification.

### Export Models Using SparseML

The first step is to export the CLIP models. The provided script will export the visual, text and text endoder models.


In [None]:
%%bash
python clip_onnx_export.py --model coca_ViT-B-32 \
            --pretrained mscoco_finetuned_laion2b_s13b_b90k --export-path caption_models

  warn(
Downloading (…)ip_pytorch_model.bin: 100%|██████████| 1.01G/1.01G [01:25<00:00, 11.8MB/s]
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode


Next, run inference by providing the path of the downloaded CLIP models to the DeepSparse Pipeline while specifying the task as `clip_caption`. Then specify the image you'd like to run inference on.

In [None]:
from deepsparse import BasePipeline
from deepsparse.clip import CLIPCaptionInput, CLIPVisualInput

root = "caption_models"
model_path_visual = f"{root}/clip_visual.onnx"
model_path_text = f"{root}/clip_text.onnx"
model_path_decoder = f"{root}/clip_text_decoder.onnx"
engine_args = {"num_cores": 8}

kwargs = {
    "visual_model_path": model_path_visual,
    "text_model_path": model_path_text,
    "decoder_model_path": model_path_decoder,
    "pipeline_engine_args": engine_args
}
pipeline = BasePipeline.create(task="clip_caption", **kwargs)

pipeline_input = CLIPCaptionInput(image=CLIPVisualInput(images="thailand.jpg"))
output = pipeline(pipeline_input).caption
print(output[0])

an adult elephant and a baby elephant 


In [None]:
from deepsparse import BasePipeline
from deepsparse.clip import CLIPCaptionInput, CLIPVisualInput

root = "caption_models"
model_path_visual = f"{root}/clip_visual.onnx"
model_path_text = f"{root}/clip_text.onnx"
model_path_decoder = f"{root}/clip_text_decoder.onnx"
engine_args = {"num_cores": 8}

kwargs = {
    "visual_model_path": model_path_visual,
    "text_model_path": model_path_text,
    "decoder_model_path": model_path_decoder,
    "pipeline_engine_args": engine_args
}
pipeline = BasePipeline.create(task="clip_caption", **kwargs)

pipeline_input = CLIPCaptionInput(image=CLIPVisualInput(images="buddy.jpeg"))
output = pipeline(pipeline_input).caption
print(output[0])

a close up of the dog 's mouth is very happy 


## Where to go From Here

Join us on [Slack](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ) for any questions or create an issue on [GitHub](https://github.com/neuralmagic).