# How to use the SigLIP2 Model for Embeddings and Text Similarity Search

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz

dataset = foz.load_zoo_dataset("quickstart")

# Setup Zoo Model

In [None]:
import fiftyone.zoo as foz

foz.register_zoo_model_source("https://github.com/harpreetsahota204/siglip2")

## Selecting a checkpoint


- Size matters: Giant > So400m > Large > Base models for retrieval quality
- Resolution matters: Higher resolutions (384, 512) consistently outperform lower ones (224, 256)

#### Document vs. Natural Images:

- For natural images (photos, etc.): Standard fixed-resolution models perform best
- For document-like, text-heavy, or screen images: NaFlex variants perform better

#### Specific Use Case Recommendations:

- General image-text retrieval: Use larger models (So400m, Giant) with higher resolutions
- Document/OCR/screen content: Use NaFlex variants, especially at lower resolutions
- Multilingual applications: SigLIP 2 works well across languages

You can choose from one of the available checkpoints:

##### Base:

- `google/siglip2-base-patch16-224`
- `google/siglip2-base-patch16-256`  
- `google/siglip2-base-patch16-384`
- `google/siglip2-base-patch16-512`
- `google/siglip2-base-patch32-256`
- `google/siglip2-base-patch16-naflex`

##### Large:

- `google/siglip2-large-patch16-256`
- `google/siglip2-large-patch16-384`
- `google/siglip2-large-patch16-512`

##### Giant:

- `google/siglip2-giant-opt-patch16-256`
- `google/siglip2-giant-opt-patch16-384`

##### Shape optimized:

So400m variants generally achieve higher retrieval performance (Recall@1) on benchmarks like COCO and Flickr compared to the Base and Large models.

- `google/siglip2-so400m-patch14-224`
- `google/siglip2-so400m-patch14-384`
- `google/siglip2-so400m-patch16-256`
- `google/siglip2-so400m-patch16-384`
- `google/siglip2-so400m-patch16-512`
- `google/siglip2-so400m-patch16-naflex`


In [None]:
foz.download_zoo_model(
    "https://github.com/harpreetsahota204/siglip2",
    model_name="google/siglip2-so400m-patch16-naflex",
)

In [None]:
import fiftyone.zoo as foz
model = foz.load_zoo_model(
    "google/siglip2-so400m-patch16-naflex"
    )

# Compute embeddings

In [None]:
dataset.compute_embeddings(
    model=model,
    embeddings_field="siglip2_embeddings",
)

# Compute visualization of embeddings

Note requires that `umap-learn` is installed. Currently, `umap-learn` only supports `numpy<=2.1.0`  

In [9]:
import fiftyone.brain as fob

results = fob.compute_visualization(
    dataset,
    embeddings="siglip2_embeddings",
    method="umap",
    brain_key="siglip2_viz",
    num_dims=2,
)

Generating visualization...
UMAP( verbose=True)
Mon Apr 21 13:27:55 2025 Construct fuzzy simplicial set
Mon Apr 21 13:27:55 2025 Finding Nearest Neighbors
Mon Apr 21 13:27:58 2025 Finished Nearest Neighbor Search
Mon Apr 21 13:28:00 2025 Construct embedding


Epochs completed:   0%|            0/500 [00:00]

	completed  0  /  500 epochs
	completed  50  /  500 epochs
	completed  100  /  500 epochs
	completed  150  /  500 epochs
	completed  200  /  500 epochs
	completed  250  /  500 epochs
	completed  300  /  500 epochs
	completed  350  /  500 epochs
	completed  400  /  500 epochs
	completed  450  /  500 epochs
Mon Apr 21 13:28:02 2025 Finished embedding


# Build a similarity index for natural language search

You can [visit the docs](https://docs.voxel51.com/api/fiftyone.brain.html?highlight=compute_similarity#fiftyone.brain.compute_similarity) for more information on similarity search.

In [10]:
import fiftyone.brain as fob

text_img_index = fob.compute_similarity(
    dataset,
    model="google/siglip2-so400m-patch16-naflex", #or just pass in the already instantiated model
    brain_key="siglip2_sim",
)

Computing embeddings...
 100% |█████████████████| 200/200 [6.2s elapsed, 0s remaining, 54.6 samples/s]      


Verify that we can support text search:

In [11]:
print(text_img_index.config.supports_prompts)  # True

True


In [12]:
sims = text_img_index.sort_by_similarity(
    "a dude on a horse"
)

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.


In [13]:
sims

Dataset:     quickstart
Media type:  image
Num samples: 200
Sample fields:
    id:                 fiftyone.core.fields.ObjectIdField
    filepath:           fiftyone.core.fields.StringField
    tags:               fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:           fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    created_at:         fiftyone.core.fields.DateTimeField
    last_modified_at:   fiftyone.core.fields.DateTimeField
    ground_truth:       fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    uniqueness:         fiftyone.core.fields.FloatField
    predictions:        fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    siglip2_embeddings: fiftyone.core.fields.VectorField
View stages:
    1. Select(sample_ids=['68068dcdb62578c64484be40', '68068dccb62578c64484bdd4', '68068dceb62578c64484be5e', ...], ordered=True)

Select your Dataset from the dropdown menu, open the embeddings panel by clicking the `+` next to the Samples viewer, and select the embeddings you want to display by selecting from the dropdown menu in the embeddings panel.

To search via natural language in the App click the `🔎` button and type in your query. The most similar samples to the query will be shown in decreasing order of similarity

In [None]:
fo.launch_app(dataset)