You'll make use of the reverse image search plug in this notebook. You can install it like so: 

In [None]:
!fiftyone plugins download https://github.com/jacobmarks/reverse-image-search-plugin

In [None]:
import os

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

For this example, you'll use a version of the [Stanford Cars dataset](https://ai.stanford.edu/~jkrause/papers/fgvc13.pdf) that a Hugging Face community member uploaded. 

FiftyOne has integrations with Hugging Face, which allow you to easily pull datasets from the hub! Learn more about the integration [here](https://docs.voxel51.com/integrations/huggingface.html) and how you can pull datasets from the hub [here](https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub).

In [None]:
import fiftyone.utils.huggingface as fouh

stanford_cars_dataset = fouh.load_from_hub(
    "Multimodal-Fatima/StanfordCars_train",
    split="train",
    format= "ParquetFilesDataset",
    max_samples=2551,
    name="stanford-cars",
    persist=True,
    overwrite=True,
    )

In [None]:
stanford_cars_dataset

Note that the dataset above is persisted to disk (via the `persist=True` argument). The next time you load the dataset, all you have to do is run the following:

```python

stanford_cars_dataset = fo.load_dataset("stanford-cars")
```

These are just some fields that are unnecessary for our example. So, I'm just going to remove them.

In [None]:
stanford_cars_dataset.delete_sample_fields(
    [
        "clip_tags_ViT_L_14",
        "LLM_Description_gpt3_downstream_tasks_ViT_L_14",
        "LLM_Description_gpt3_downstream_tasks_visual_genome_ViT_L_14",
        "blip_caption_beam_5",
        "Attributes_ViT_L_14_text_davinci_003_full",
        "Attributes_ViT_L_14_text_davinci_003_stanfordcars",
        "clip_tags_ViT_L_14_with_openai_classes",
        "clip_tags_ViT_L_14_wo_openai_classes",
        "clip_tags_ViT_L_14_simple_specific",
        "clip_tags_ViT_L_14_ensemble_specific",
        "clip_tags_ViT_B_16_simple_specific",
        "clip_tags_ViT_B_32_ensemble_specific",
        "Attributes_ViT_B_16_descriptors_text_davinci_003_full",
        "Attributes_LAION_ViT_H_14_2B_descriptors_text_davinci_003_full",
        "clip_tags_LAION_ViT_H_14_2B_simple_specific",
        "clip_tags_LAION_ViT_H_14_2B_ensemble_specific",
        "Attributes_ViT_L_14_descriptors_text_davinci_003_full",
        "clip_tags_ViT_B_16_ensemble_specific"
        ]
)

## **🦒 FiftyOne Model Zoo**

The [FiftyOne Model Zoo 🦒](https://docs.voxel51.com/user_guide/model_zoo/models.html) is a collection of pre-trained models that can be easily downloaded and run on FiftyOne Datasets. 

📂 It provides a convenient and consistent interface for a wide variety of models, making it simple to integrate pre-trained models into your workflow.


In [None]:
clip_model = foz.load_zoo_model(
    name="clip-vit-base32-torch",
    install_requirements=True,
)

mobilenet_model = foz.load_zoo_model(
    name="mobilenet-v2-imagenet-torch",
    install_requirements=True,
)

densenet_model = foz.load_zoo_model(
    name="densenet121-imagenet-torch",
    install_requirements=True,
)

[FiftyOne Brain](https://docs.voxel51.com/user_guide/brain.html) can generate embeddings and create indexes for images and objects or patches within images, which can be used for visualizations and indexes. It is compatible with various embedding models, dimensionality reduction techniques, and similarity backends.
    
🧠 With the Brain you can:

- 👁️ Visualizing your dataset in a low-dimensional embedding space to reveal patterns and clusters

- 🗂️ Indexing your data by similarity to easily find similar examples

- 🦄 Computing uniqueness measures for images to identify the most valuable unlabeled data to annotate

- ⚠️ Identifying possible label mistakes in your annotations

- 💡 Finding the hardest samples for your model to learn from






- Brain runs are tracked and can be listed, loaded, renamed and deleted via the `Dataset` methods like `list_brain_runs()`, `load_brain_results()`, `rename_brain_run()`, etc.

In [None]:
stanford_cars_dataset.compute_embeddings(
    model=clip_model,
    embeddings_field="clip_embeddings",
    progress=True,  
)

stanford_cars_dataset.compute_embeddings(
    model=mobilenet_model,
    embeddings_field="mobilenet_embeddings",
    progress=True,
)

stanford_cars_dataset.compute_embeddings(
    model=densenet_model,
    embeddings_field="densenet_embeddings",
    progress=True,  
)

### **📊 Compute Visualization**

The `compute_visualization()` method 📊 generates interactive visualizations of your dataset or patches in a low-dimensional space using UMAP, t-SNE, and PCA dimensionality reduction techniques.

Visualizing datasets in low-dimensional embedding spaces helps reveal:

🔹 Patterns and clusters that can help identify failure modes
🔹 Similar examples and outliers
🔹 New samples to add to your training set, helping you improve model performance

In [None]:
import fiftyone.brain as fob

fob.compute_visualization(
    stanford_cars_dataset,
    embeddings="clip_embeddings",
    method="umap",
    brain_key = "umap_2d_clip",
    num_dims=2,
    num_workers = os.cpu_count(),
    progress=True, 
)

fob.compute_visualization(
    stanford_cars_dataset,
    embeddings="densenet_embeddings",
    method="umap",
    brain_key = "umap_2d_densenet",
    num_dims=2,
    num_workers = os.cpu_count(),
    progress=True, 
)

fob.compute_visualization(
    stanford_cars_dataset,
    embeddings="mobilenet_embeddings",
    method="umap",
    brain_key = "umap_2d_mobilenet",
    num_dims=2,
    num_workers = os.cpu_count(),
    progress=True, 
)

### **🦄 Compute Uniqueness**

The `compute_uniqueness()` method 📊 computes a uniqueness score for each image, comparing its content to all other images in the dataset. In FiftyOne, uniqueness is calculated using a classification model-based approach:

1. **Embedding**: Each sample is embedded into a vector space using the model.

2. **k-Nearest Neighbors (kNN)**: For each sample, find its k nearest neighbors in the embedding space.

3. **Distance-based Scoring**: Uniqueness is proportional to the distances from a sample to its kNNs.

4. **Intuition**: Samples far from others in the embedding space are considered more unique.

Key points:
- `k` is a parameter of the uniqueness function
- This method prioritizes outliers rather than cluster representatives
- Contrasts with "representativeness," which would emphasize samples central to dense clusters

In [None]:
import fiftyone.brain as fob

fob.compute_uniqueness(
    stanford_cars_dataset,
    embeddings = "clip_embeddings",
    uniqueness_field="clip_uniqueness",
    num_workers = os.cpu_count(),
    progress=True,
)

fob.compute_uniqueness(
    stanford_cars_dataset,
    embeddings = "mobilenet_embeddings",
    uniqueness_field="mobilenet_uniqueness",
    num_workers = os.cpu_count(),
    progress=True,
)

fob.compute_uniqueness(
    stanford_cars_dataset,
    embeddings = "densenet_embeddings",
    uniqueness_field="densenet_uniqueness",
    num_workers = os.cpu_count(),
    progress=True,
)

### **🔍 Compute Similarity**

The `compute_similarity()` method 🔍 indexes your data by similarity, allowing you to efficiently search for similar samples or objects both programmatically and via point-and-click in the App.

Choose from multiple backends to power your similarity index, including:

🔹 LanceDB

🔹 Milvus

🔹 MongoDB

...[and more](https://docs.voxel51.com/user_guide/brain.html#similarity-backends)!

In [None]:
# adding model parameter here because the similarity index needs a model attached to it
# in order to be used for search. for qdrant run the following: docker run -p "6333:6333" -p "6334:6334" -d qdrant/qdrant
import fiftyone.brain as fob

clip_sim_index = fob.compute_similarity(
    stanford_cars_dataset,
    model="clip-vit-base32-torch",
    brain_key="clip_sim_index",
    embeddings = "clip_embeddings",
    backend="qdrant",
    metric="cosine",
    progress=True,
)

mobilenet_sim_index = fob.compute_similarity(
    stanford_cars_dataset,
    model="mobilenet-v2-imagenet-torch",
    brain_key="mbnet_sim_index",
    embeddings = "mobilenet_embeddings",
    backend="qdrant",
    metric="cosine",
    progress=True,
)

densenet_sim_index = fob.compute_similarity(
    stanford_cars_dataset,
    model="densenet121-imagenet-torch",
    embeddings = "densenet_embeddings",
    brain_key="densenet_sim_index",
    backend="qdrant",
    metric="cosine",
    progress=True,
)

# Representativeness

We can use the [`compute_representativeness`](https://docs.voxel51.com/api/fiftyone.brain.html#fiftyone.brain.compute_representativeness) method from the FiftyOne Brain to find representative samples in our dataset.

The core idea is that samples closer to cluster centers are considered more representative of the dataset, while the optional downweighting step helps ensure a diverse selection of representative samples.

1. **Embedding Generation**:
   - Samples are embedded into a vector space using a specified model or pre-computed embeddings.
   - For ROI-based tasks, embeddings of multiple regions are aggregated (e.g., by taking the mean).

2. **Clustering**:
   - The algorithm uses either K-means (default) or MeanShift clustering on the embeddings.
   - For K-means, the number of clusters (N) is set to 20 by default.

3. **Distance Calculation**:
   - For each sample, the distance to its nearest cluster center is computed.

4. **Representativeness Scoring**:
   - The initial representativeness score is calculated as: 1 / (1 + distance_to_cluster_center)
   - This gives higher scores to samples closer to cluster centers.

5. **Normalization**:
   - Scores are normalized, either globally or per-cluster (default is per-cluster or "local" normalization).

6. **Optional Redundancy Reduction**:
   - If the "cluster-center-downweight" method is used, an additional step reduces redundancy.
   - It iteratively downweights similar samples within a specified radius to promote diversity.



In [None]:
import fiftyone.brain as fob

fob.compute_representativeness(
    stanford_cars_dataset,
    embeddings = "clip_embeddings",
)

fob.compute_representativeness(
    stanford_cars_dataset,
    embeddings = "mobilenet_embeddings",
)

fob.compute_representativeness(
    stanford_cars_dataset,
    embeddings = "densenet_embeddings",
)

In [None]:
session = fo.launch_app(stanford_cars_dataset)