In [1]:
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

# Load the dataset from Hugging Face if it's your first time using it

# dataset = fouh.load_from_hub(
#     "Voxel51/Coursera_lecture_dataset_train", 
#     dataset_name="lecture_dataset_train", 
#     persistent=True
#     )

In [2]:
#because I have the dataset saved locally, I will load it like so
cloned_dataset = fo.load_dataset("lecture_dataset_train_clone")

### Similar images (near duplicates)

Removing duplicates and near-duplicates can improve model training by avoiding accidental concept imbalance. Duplicated data is a common problem in dataset creation and can be challenging to identify, especially when small data manipulations have occurred. For model training workflows, it's crucial to maximize the value of each data sample. Near-duplicates, which are very similar samples, are inherently less valuable for training models.

The FiftyOne Brain's [`compute_similarity`](https://docs.voxel51.com/api/fiftyone.brain.html#fiftyone.brain.compute_similarity) method indexes images or object patches by similarity. This allows you to:

1. Find similar examples to diagnose model failures
2. Mine data to augment training sets
3. Use `sort_by_similarity` to programmatically sort datasets by similarity to chosen images or patches

In [None]:
import fiftyone.brain as fob

similarity_index = fob.compute_similarity(
    samples=cloned_dataset,
    embeddings="mobilenet_v2_embeddings",
    backend="sklearn",
    brain_key="mobilenet_similarity",
    metric="cosine"
)

In [None]:
fo.launch_app(cloned_dataset)

You can also use [`sort_by_similarity`](https://docs.voxel51.com/api/fiftyone.brain.html#fiftyone.brain.compute_similarity) in the SDK:

In [None]:
#get some random id from the dataset

query_id = cloned_dataset.take(1).first().id

similarity_view = similarity_index.sort_by_similarity(query_id, k=10)

In [None]:
fo.launch_app(similarity_view)

We can also use our similarity index to detect near-duplicate images in the dataset. For example, let’s use the [`find_duplicates`](https://docs.voxel51.com/api/fiftyone.brain.similarity.html#fiftyone.brain.similarity.DuplicatesMixin.find_duplicates) method to identify the least similar images in our dataset:

In [None]:
similarity_index.find_duplicates(fraction=0.01)

The `neighbors_map` property in the `similarity_index` object summarizes the results. It contains keys representing the sample IDs of the nearest non-duplicate images and values consisting of lists of `(id, distance)` tuples.

In [None]:
print(similarity_index.neighbors_map)

When passing `fraction` into the function it's the desired fraction of images/patches to tag as duplicates, in [0, 1]:

In [None]:
similarity_index.find_duplicates(fraction=0.01)

You can view this information in the app using the [`duplicates_view`](https://docs.voxel51.com/api/fiftyone.brain.similarity.html#fiftyone.brain.similarity.DuplicatesMixin.duplicates_view) method, which arranges duplicate images next to their corresponding nearest in-sample image, along with additional fields for image type and nearest in-sample ID/distance.

In [None]:
duplicates_view = similarity_index.duplicates_view(
    type_field="dup_type",
    id_field="dup_id",
    dist_field="dup_dist",
)

In [None]:
fo.launch_app(duplicates_view)

You can also provide a specific embeddings distance threshold (via the `thresh` parameter) to `find_duplicates`, in which case the non-duplicate set will be the (approximately) largest set such that all pairwise distances between non-duplicate images are greater than this threshold, like so:

In [None]:
similarity_index.find_duplicates(thresh=0.51)

###  Exact duplicates

This method identifies exact duplicates using the same filehash. If duplicates are found, the first instance becomes the key in the returned dictionary, with subsequent duplicates as the corresponding list values.

In [None]:
import fiftyone.brain as fob
import os 

fob.compute_exact_duplicates(
    samples=cloned_dataset,
    num_workers=os.cpu_count(),
    progress=True
    )

In [None]:
exact_dups = cloned_dataset.select(['66a2f304ce2f9d11d9a17adc','66a2f315ce2f9d11d9a1a706'])

In [None]:
fo.launch_app(exact_dups)

## Unique images

Identifying the most unique samples helps in creating a high-quality, diverse training set.

This function adds a uniqueness score to each sample based on its distinctiveness from other samples. It processes pixel data and can handle labeled or unlabeled samples. If embeddings or a model are not provided, a default model is used to generate embeddings.

In [None]:
import fiftyone.brain as fob
import os 

fob.compute_uniqueness(
    samples=cloned_dataset,
    embeddings="mobilenet_v2_embeddings",
    num_workers=os.cpu_count(),
    progress=True
)

In [None]:
# Sort by uniqueness (most unique first)
dups_view = cloned_dataset.sort_by("uniqueness", reverse=True)

In [None]:
fo.launch_app(cloned_dataset)

### Using the deduplication plugin

As we've discussed in this lesson, creating a high-quality dataset for training machine learning models is challenging due to duplicate or similar data.  We've seen that duplicates come in two flavours:

1. **Exact duplicates:** pixel-perfect matches, where one image is a down-to-the-bit copy of another

2. **Approximate duplicates:** When evaluating images or other data for similarity, a threshold is set based on a similarity metric used to measure the closeness between samples.

Deduplication is the task of removing these exact and approximate duplicates from a dataset. 

With the [Image Deduplication Plugin](https://github.com/jacobmarks/image-deduplication-plugin), you can deduplicate your entire dataset from within the FiftyOne App, without writing any code. This plugin makes it easy to:

- Identify exact duplicates using hash functions

- Detect near-duplicates using embedding models and similarity thresholds

- Interactively view duplicate images in the FiftyOne App

- Options to remove all duplicates or retain representative images

## Main Operators

**Duplicate Detection:**

- `find_approximate_duplicate_images`: Locates near-duplicate images using similarity indices

- `find_exact_duplicate_images`: Identifies exact duplicates using hash functions

**Visualization:**

- `display_approximate_duplicate_groups`: Shows groups of near-duplicate images

- `display_exact_duplicate_groups`: Presents groups of exact duplicate images

**Duplicate Removal:**

- `remove_all_approximate_duplicates`: Eliminates all near-duplicate images

- `remove_all_exact_duplicates`: Removes all exact duplicate images

**Selective Deduplication:**

- `deduplicate_approximate_duplicates`: Removes near-duplicates while keeping representative images

- `deduplicate_exact_duplicates`: Eliminates exact duplicates while retaining representative images



In [None]:
from fiftyone import plugins

plugins.download_plugin(
    url_or_gh_repo="https://github.com/jacobmarks/image-deduplication-plugin"
)

plugins.install_plugin_requirements(
    plugin_name="@jacobmarks/image_deduplication"
)

In [None]:
fo.launch_app(cloned_dataset)

Of course, what we've discussed here can be done on the patches level.