## Zero-shot / zero-data 

In 2021, OpenAI released CLIP model, which was trained on text & image pairs. 

This allowed the model to understand the relationships between visual concepts and their natural language descriptions.

This was a major breakthrough in building applications that do not require pre-training, dramatically lowering the cost of building vision applications

In [None]:
import cv2
import matplotlib.pyplot as plt
import torch
from transformers import AutoProcessor, CLIPModel
from PIL import Image

In [None]:
image_paths = [ "./assets/clip-text-image-encoding.jpg", "./assets/clip-text-image-prediction.jpg" ]
for image_path in image_paths:
    image = Image.open(image_path)
    plt.imshow(image)
    plt.show()

In [None]:
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")

In [None]:
def evaluate(image, prompts):
    inputs = processor(
        text = prompts,
        images = image, 
        return_tensors = "pt",
        padding=True
    )

    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image # sim_score
    probs = logits_per_image.softmax(dim=1) # probabilities
    
    for ndx, prob in enumerate(probs[0]):
        print(f"Probability of {prompts[ndx]} is {(prob.item() * 100):.2f}%")

In [None]:
image_paths = [
    "./assets/pickleball.jpg",
    "./assets/woman-apple.jpg",
    "./assets/football-on-beach.jpg"
]

for image_path in image_paths:
    image = Image.open(image_path)
    plt.imshow(image)
    plt.show()

In [None]:
image = Image.open("./assets/pickleball.jpg")
plt.imshow(image)
plt.show()

prompts = [
    "people playing cricket",
    "people playing pickleball",
    "people playing football"
]

evaluate(image, prompts)

In [None]:
image = Image.open("./assets/woman-apple.jpg")
plt.imshow(image)
plt.show()

prompts = [
    "woman eating apple",
    "man eating apple",
    "woman eating banana",
    "man cooking fish",
]

evaluate(image, prompts)

In [None]:
image = Image.open("./assets/football-on-beach.jpg")
plt.imshow(image)
plt.show()

prompts = [
    "people playing football on a beach",
    "people playing football in a ground",
    "people playing hockey on a beach",
    "people running on a beach",
]

evaluate(image, prompts)

# Text Association -> Generation

CLIP is fundamentally an encoder that creates text and image embeddings in a shared embedding space, enabling powerful associations between visual and textual content.

However, the utility increases dramatically when encoders like CLIP are paired with generative language models. 

This combination moves beyond simple image-text association to actually generating contextual descriptions based on visual content and additional user context.


In [None]:
import os
from google import genai

In [None]:
gemini_api_key = os.environ['GEMINI_API_KEY']

In [None]:
client = genai.Client(api_key=gemini_api_key)

In [None]:
image_file = client.files.upload(file="./assets/football-on-beach.jpg")

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[image_file, "Explain what is happening in the image"]
)

print(response.text)