Zero-shot image classification is a task that involves classifying images into different categories using a model that hasn't been explicitly trained on those specific categories. The model's job is to predict the class it belongs to. This is useful when you have a small amount of labeled data, or when you want to integrate image classification into an application quickly. Instead of training a custom model, you can use a pre-existing, pre-trained model.

These models are usually multi-modal and have been trained on a huge dataset of images and descriptions. They can then be used for lots of different tasks. You might need to give the model some extra information about the classes it hasn't seen - this is called auxiliary information and could be descriptions or attributes. Zero-shot classification is a subfield of transfer learning.

The zero-shot image classification task consists of classifying an image based on your own labels during inference time. For example, you can pass a list of labels such as plane, car, dog, bird, and the image you want to classify. The model will choose the most likely label. In this case, it should classify it as a photo of a dog. Contrastive Language-Image Pretraining (CLIP) is one of the most popular models for zero-shot classification. It can classify images by common objects or characteristics of an image and doesn't need to be fine-tuned for each new use case.


https://huggingface.co/openai/clip-vit-large-patch14

**CLIP** is a neural network that learns visual concepts from natural language supervision. It's trained on pairs of images and texts and learns to predict the text corresponding to a given image. It can then be used for zero-shot classification of new images.

CLIP is flexible and can be applied to various visual classification benchmarks. It doesn't need to optimize for the benchmark's performance and has been shown to have state-of-the-art performance and distributional robustness. It outperforms existing models such as ImageNet on representation learning evaluation using linear probes.

The network consists of an image encoder and a text encoder, which are jointly trained to predict the correct pairings. During training, the image and text encoders are trained to maximize the cosine similarity of the image and text embeddings of the real pairs, while minimizing the cosine similarity of incorrect pairings. **CyCLIP is a framework** that builds on CLIP by formalizing consistency. It optimizes the learned representations to be geometrically consistent in the image and text space and has been shown to improve the performance of CLIP.

In [1]:
# Load model directly
from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

processor = AutoProcessor.from_pretrained("openai/clip-vit-large-patch14")
model = AutoModelForZeroShotImageClassification.from_pretrained("openai/clip-vit-large-patch14")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


preprocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/905 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/4.52k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/961k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.71G [00:00<?, ?B/s]

Notebook link : https://www.kaggle.com/code/youssef19/zero-shot-image-classification-using-clip-model