# Document Classification with CLIP\n\nThis notebook demonstrates how to use CLIP for zero-shot document classification. We specifically address the common challenge of distinguishing between similar document types, such as **Receipts** and **Invoices**.\n\nWe will show how providing more descriptive prompts (Prompt Engineering) helps the model differentiate between semantically similar categories.

In [None]:
# Install dependencies if they are not already installed\n! pip install ftfy regex tqdm\n! pip install git+https://github.com/openai/CLIP.git

In [None]:
import torch\nimport clip\nfrom PIL import Image\nimport urllib.request\nimport io\n\n# Set up the device to use GPU if available, otherwise fallback to CPU\ndevice = "cuda" if torch.cuda.is_available() else "cpu"\nprint(f"Using device: {device}")\n\nmodel, preprocess = clip.load("ViT-B/32", device=device)

## 1. Loading a Sample Document\nWe will use a public image of a receipt to test the model. This image serves as a good example of a business document that can be ambiguous to classify.

In [None]:
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/0/0b/ReceiptSwiss.jpg/800px-ReceiptSwiss.jpg"\n\nwith urllib.request.urlopen(url) as url_response:\n    image_data = url_response.read()\n\n# Preprocess the image to match CLIP's expected input format\nimage = preprocess(Image.open(io.BytesIO(image_data))).unsqueeze(0).to(device)

## 2. The Baseline: Simple Labels\nFirst, we attempt to classify the image using simple, single-word labels. This often confuses the model because terms like "receipt" and "invoice" overlap significantly in meaning.

In [None]:
basic_categories = ["receipt", "invoice", "form", "document"]\ntext_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in basic_categories]).to(device)\n\nwith torch.no_grad():\n    # Calculate features for both image and text\n    image_features = model.encode_image(image)\n    text_features = model.encode_text(text_inputs)\n\n# Normalize the features\nimage_features /= image_features.norm(dim=-1, keepdim=True)\ntext_features /= text_features.norm(dim=-1, keepdim=True)\n\n# Calculate similarity scores\nsimilarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)\nvalues, indices = similarity[0].topk(len(basic_categories))\n\nprint("Baseline Results (Simple Prompts):")\nfor value, index in zip(values, indices):\n    print(f"{basic_categories[index]:>16s}: {100 * value.item():.2f}%")

## 3. The Solution: Detailed Prompts\nBy adding descriptive details to our categories, we guide CLIP to focus on specific visual or structural characteristics. For example, specifying that an invoice is a "full page" document with "billing details" helps distinguish it from a small transaction receipt.

In [None]:
refined_categories = [\n    "a small sales receipt from a store transaction",\n    "a full page commercial invoice with billing details",\n    "a blank bureaucratic form",\n    "a general text document"\n]\n\ntext_inputs = torch.cat([clip.tokenize(c) for c in refined_categories]).to(device)\n\nwith torch.no_grad():\n    image_features = model.encode_image(image)\n    text_features = model.encode_text(text_inputs)\n\nimage_features /= image_features.norm(dim=-1, keepdim=True)\ntext_features /= text_features.norm(dim=-1, keepdim=True)\n\nsimilarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)\nvalues, indices = similarity[0].topk(len(refined_categories))\n\nprint("\\nImproved Results (Detailed Prompts):")\nfor value, index in zip(values, indices):\n    print(f"{refined_categories[index]:>45s}: {100 * value.item():.2f}%")