


# <img src="assets/voxel51_logo.png" alt="Image2" width="40"/> FiftyOne + Vector Search
This notebook demonstrates how to build a complete visual search workflow using **FiftyOne** and **Vector Search**.

You will learn how to:
- Load and index embeddings using FiftyOne
- Query by image and text
- Visualize results in the FiftyOne App

🧠 This integration helps you scale visual search over large datasets with a cloud-native vector database.

👉 As an example of vector search, see this official documentation [FiftyOne + Mosaic AI docs](https://docs.voxel51.com/integrations/mosaic.html)


<img src="assets/mosaic_fiftyone_recipe.png" alt="Image2" width="600"/>

https://github.com/user-attachments/assets/2f5f21b3-5f42-4ab5-8e29-e1cac3e8eeb1

In [None]:
# Install necessary packages
#!pip install fiftyone torch torchvision python-dotenv mlflow umap-learn


Wait until this endpoint is ready, any action before that can create a 500 or 400 HTTP Error.

## 📁 Load the BDD100K Dataset and Launch FiftyOne
We will use the `BDD100K` dataset from HuggingFace Hub.

In [None]:
# import fiftyone as fo

# # Replace with your actual dataset name
# dataset_name = "BDD100K"

# # Check first if it exists
# if dataset_name in fo.list_datasets():
#     fo.delete_dataset(dataset_name)
#     print(f"✅ Dataset '{dataset_name}' deleted successfully.")
# else:
#     print(f"⚠️ Dataset '{dataset_name}' does not exist.")

In [None]:
import fiftyone as fo
import fiftyone.zoo as foz
import fiftyone.brain as fob

import fiftyone as fo # base library and app
import fiftyone.utils.huggingface as fouh # Hugging Face integration

import os

# Increase both connection and read timeout values (in seconds)
os.environ["HF_HUB_DOWNLOAD_TIMEOUT"] = "60"  # default is 10
os.environ["HF_HUB_ETAG_TIMEOUT"] = "30"      # metadata fetch timeout
dataset = fouh.load_from_hub("dgural/bdd100k", persistent=True) #, overwrite=True)

# Define the new dataset name
dataset_name = "dgural/bdd100k"

# Check if the dataset exists
if dataset_name in fo.list_datasets():
    print(f"Dataset '{dataset_name}' exists. Loading...")
    dataset = fo.load_dataset(dataset_name)
else:
    print(f"Dataset '{dataset_name}' does not exist. Creating a new one...")
    # Clone the dataset with a new name and make it persistent
    #dataset = dataset.clone(dataset_name, persistent=True)



In [None]:
print(fo.list_datasets())

In [None]:
session = fo.launch_app(dataset, port=5151, auto=False)

![Image](assets/fiftyone_APP.png)

## Using the SKLearn backend (By default)
By default, calling ```compute_similarity()``` or ```sort_by_similarity()``` will use an sklearn backend.
To use the Mosaic backend, simply set the optional backend parameter of ```compute_similarity()``` to ```mosaic```:

## 🧠 Compute Embeddings and Index with SKLearn
Now we compute a similarity index using the Mosaic backend. This will:
- Use a CLIP model to generate embeddings
- Compute visualization
- Compute Similarity
- Text promt the dataset, create a view, find mistakes.

In [None]:
model = foz.load_zoo_model("clip-vit-base32-torch")
embeddings = dataset.compute_embeddings(model, embeddings_field="embedding_key")

In [None]:

# Compute visualization
results = fob.compute_visualization(
    dataset, embeddings=embeddings, seed=51, brain_key="bdd100k_key",
)

In [None]:
# # Steps 2 and 3: Compute embeddings and create a similarity index
sklear_idx = fob.compute_similarity(dataset, brain_index = "test_idx", model = "clip-vit-base32-torch", embeddings = "embedding_key")

In [None]:
session = fo.launch_app(dataset, port=5151, auto=False)


In [None]:
# Query by first image sample
query = dataset.id[]
view = dataset.sort_by_similarity(query, brain_key="similarity_index2", k=10)
session.view = view

In [None]:
dataset.reload()

print(dataset)
print(dataset.first())

In [None]:
# Query by text prompt
# DETECTIONS: bike  bus  car  motor  person  rider  traffic light  traffic sign  train  truck
# WEATHER: overcast  foggy  rainy  snowy  undefined  partly cloudy  clear
# SCENE: city street  gas stations  highway  parking lot  residential  tunnel 
# TIME OF DAY: daytime  night  dawn/dusk

query_txt = "bike" 
view_txt = dataset.sort_by_similarity(query_txt, k=50, brain_key="embedding_key")
session.view = view_txt

In [None]:
mosaic_index = fob.compute_similarity(
    dataset,
    model="clip-vit-base32-torch",
    backend="mosaic",
    brain_key="mosaic_index_5",
    index_name="fiftyone_index",
)

When you run ```compute_similarity()``` FiftyOne calcules embeddings on the fly, and you can see the vector values in the Databricks Schema that we previously setup.

![Image](assets/databricks_view.png)

https://github.com/user-attachments/assets/89ad39c8-baef-420b-a3a2-ccb074046b51

In [None]:
# Retrieve embeddings for a view
ids = dataset.take(10).values("id")
embeddings, sample_ids, _ = mosaic_index.get_embeddings(sample_ids=ids)
print(embeddings.shape)  # (10, 512)
print(sample_ids.shape)  # (10,)

In [None]:
# Get all embeddings from the MosaicSimilarityIndex
embeddings, sample_ids, _ = similarity_index.get_embeddings()

# Confirm shape
print("Embeddings shape:", embeddings.shape)  # (N, D) => N samples, D dimensions
print("Sample IDs shape:", sample_ids.shape)

### 📦 Install `umap-learn`
`umap-learn` is required to visualize high-dimensional embeddings in 2D or 3D.

```bash
pip install umap-learn
```

In [None]:
# Compute the visualization
fob.compute_visualization(
    dataset,                      # your FiftyOne dataset
    embeddings=embeddings,        # the N x D matrix
    brain_key="mosaic_viz",       # identifier for visualization (name it!)
    sample_ids=sample_ids         # make sure this matches the dataset
)
session = fo.launch_app(dataset)


![Image](assets/emb.png)

## Query the Similarity Index

In [None]:
# Query by first image sample
query = dataset.first().id
view = dataset.sort_by_similarity(query, brain_key="mosaic_index_5", k=10)
session.view = view

![Image](assets/similarity.png)

In [None]:
# Query by text prompt
query_txt = "a beach"
view_txt = dataset.sort_by_similarity(query_txt, k=50, brain_key="mosaic_index_5")
session.view = view_txt

![Image](assets/beach.png)

## Cleanup (Optional)

In [None]:
# Delete Mosaic index and run record
mosaic_index.cleanup()
dataset.delete_brain_run("mosaic_index")
#dataset.delete_brain_runs()