In [None]:
# in case you haven't prepared your enviornment, install the following packages
# pip install fiftyone

FiftyOne has an integration with Hugging Face. You can read the documentation for details [here](https://docs.voxel51.com/integrations/huggingface.html)

In [None]:
import fiftyone as fo
import fiftyone.utils.huggingface as fouh

# Load the dataset from Hugging Face if it's your first time using it

dataset = fouh.load_from_hub(
    "Voxel51/Coursera_lecture_dataset_train", 
    dataset_name="lecture_dataset_train", 
    persistent=True
    )

test_dataset = fouh.load_from_hub(
    "Voxel51/Coursera_lecture_dataset_test", 
    dataset_name="lecture_dataset_test", 
    persistent=True
    )

dataset.compute_metadata()

test_dataset.compute_metadata()

Note that downloading the datasets in Google Colab tends to take a very long time. I've tested local downloads and those take ~10 minutes, however it always ends up being 1-2 hours on Google Colab. Colab is known to be quite slow, and my recommendation is to do everything locally if you can.

# Brief introduction to the FiftyOne App

In [None]:
cloned_dataset = dataset.clone(name="lecture_dataset_train_clone", persistent=True) #clone the dataset to avoid modifying the original dataset

test_dataset = test_dataset.clone(name="lecture_dataset_test_clone", persistent=True)

In [None]:
# Launch the App
session = fo.launch_app(cloned_dataset)

## Profiling Your FiftyOne Dataset: A Quick Overview

This profiling strategy should provide you with valuable insights into about your FiftyOne dataset, helping you understand its characteristics and potential challenges or opportunities for further analysis or model training.

There are several important aspects of dataset profiling, particularly for a dataset that seems to contain image samples with object detections. Here's what you'll cover:

1. Basic dataset information: Getting an overview of the dataset size and structure.

2. Sample examination: Looking at individual samples to understand their fields and content.

3. Detection statistics: Analyzing the number and distribution of detections across the dataset.

4. Label distribution: Examining the frequency and variety of labels in our dataset.

5. Image resolution analysis: Understanding the typical sizes of images in our dataset.

6. Advanced profiling: Using FiftyOne's built-in tools for comprehensive dataset analysis.

By the end of this section, you'll have a solid grasp on your dataset's composition, which will inform your subsequent analysis and model development steps.

#### Metadata about the dataset


In [None]:
cloned_dataset

#### Get the first sample of a dataset, what fields do you see?


In [None]:
first_sample = cloned_dataset.first()

In [None]:
first_sample

In [None]:
first_sample.field_names

#### Metadata about the sample


In [None]:
first_sample.filepath

In [None]:
first_sample.metadata

#### Count of detections on sample level

In [None]:
first_sample.ground_truth.detections

In [None]:
len(first_sample.ground_truth.detections)

#### Count of unique labels on sample level


In [None]:
label_counts = {}

for detection in first_sample.ground_truth.detections:
    label = detection.label
    if label in label_counts:
        label_counts[label] += 1
    else:
        label_counts[label] = 1

In [None]:
label_counts

Or, you can create a `DatasetView` by selecting the ID of first sample. 

Dataset views are ordered collections of sample subsets from a dataset.
 
You can chain operations on dataset views to get the subset you want. Then, you can iterate over this subset to access the sample views directly. Each step in creating a dataset view is represented by a `fiftyone.core.stages.ViewStage` instance.

The stages of a dataset view define:
 
 - Which samples (and their order) to include
 - Which fields of each sample to include, possibly filtered

You'll learn A LOT about [`DatasetView`](https://docs.voxel51.com/api/fiftyone.core.view.html?highlight=view#module-fiftyone.core.view) and [`ViewExpressions`](https://docs.voxel51.com/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression) throughout the lessons.

In [None]:
# Create a view with just this sample
sample_view = cloned_dataset.select(first_sample.id)

In [None]:
# Count detections in this sample
sample_detection_counts = sample_view.count("ground_truth.detections")
print(f"Detection counts for sample {first_sample.id}:")
print(sample_detection_counts)

In [None]:
# Count labels in this sample
sample_label_counts = sample_view.count_values("ground_truth.detections.label")

print(f"Label counts for sample {first_sample.id}:")
print(sample_label_counts)

Note, if you try any of the above on just a `Sample` object you'll encounter errors. Whatever you can do on a `Dataset` you can do on a `View`. 



#### How many samples are in the dataset


In [None]:
len(cloned_dataset)

In [None]:
cloned_dataset.count()

In [None]:
%%timeit
len(cloned_dataset)

In [None]:
%%timeit
cloned_dataset.count()

#### How many labels in the whole dataset


In [None]:
distinct_labels = cloned_dataset.distinct("ground_truth.detections.label") #this will return a list of distinct labels

print(f"Number of distinct labels: {len(distinct_labels)}")
print("\n")
print(f"The distinct labels are: {distinct_labels}")

#### How many detections in the whole dataset


In [None]:
count_of_detections = cloned_dataset.count("ground_truth.detections")

print(f"Total number of detections: {count_of_detections}") 

In [None]:
# First, get the unsorted counts
unsorted_counts = cloned_dataset.count_values("ground_truth.detections.label")

# Then, sort the dictionary by keys (labels)
sorted_counts = dict(sorted(unsorted_counts.items(), key=lambda item: item[1], reverse=True))

print("Sorted label counts:")
sorted_counts

You can also explore the dataset with a variety of interactive plots. [Learn more in the docs](https://docs.voxel51.com/user_guide/plots.html).

Here's a [categorical histogram](https://docs.voxel51.com/api/fiftyone.core.plots.views.html#fiftyone.core.plots.views.CategoricalHistogram) of count of detections by label.

In [None]:
from fiftyone.core.plots.views import CategoricalHistogram

CategoricalHistogram(
    init_view=cloned_dataset,
    field_or_expr="ground_truth.detections.label",
    xlabel="Label",
    title="Count of detections per label",
    order="frequency"
)

You can also get the distribution of detection counts per image.

In [None]:
from fiftyone import ViewField as F

CategoricalHistogram(
    init_view=cloned_dataset,
    field_or_expr="ground_truth",
    expr=F("detections").length(),
    title="Count of Images by Number of Detections",
    xlabel="Number of Detections per image",
    template={
        "layout": {
            "xaxis": {
                "range": [0, 30]  # This sets the x-axis range from 0 to 30
            }
        }
    }
)

Maybe now you're curious what those images are that have so many detections in them. You can make use of a [`ViewStage`](https://docs.voxel51.com/api/fiftyone.core.stages.html#fiftyone.core.stages.Match) to create a [`View`](https://docs.voxel51.com/api/fiftyone.core.view.html#fiftyone.core.view.DatasetView) and inspect that.

In [None]:
lots_of_detections_stage = fo.Match(F("ground_truth.detections").length() > 15) 
lots_of_detections_view = cloned_dataset.add_stage(lots_of_detections_stage)

# equivalent to the above but in one line
# lots_of_detections_stage = cloned_dataset.match(F("ground_truth.detections").length() > 15) 

fo.launch_app(lots_of_detections_view)

And, perhaps, your exploration leads you wonder what contexts the tomatos class occurs in. For that, you can filter the Dataset by label.

In [None]:
select_tomato_stage = fo.SelectBy("ground_truth.detections.label", "tomato")
tomato_view = dataset.add_stage(select_tomato_stage)
tomato_view

In [None]:
tomato_view.count_values("ground_truth.detections.label")

In [None]:
fo.launch_app(tomato_view)

You can do something similar, yet with different behaviour, like so:

In [None]:
filter_tomato_stage = fo.FilterLabels(("ground_truth.detections"), F("label")=="tomato")

filter_tomato_view = cloned_dataset.add_stage(filter_tomato_stage)

filter_tomato_view

In [None]:
filter_tomato_view.count_values("ground_truth.detections.label")

Alternatively, you can do the following:

In [None]:
from fiftyone import ViewField as F
tomato_view_using_filter_label = cloned_dataset.filter_labels("ground_truth.detections", F("label")=="tomato")
tomato_view_using_filter_label

In [None]:
tomato_view_using_filter_label.count_values("ground_truth.detections.label")

Now you might be wondering how many images do you have per label, the easiest way to do that is:

In [None]:
all_label_in_dataset = cloned_dataset.distinct("ground_truth.detections.label")

counts_of_images_with_label = {}

for _label in all_label_in_dataset:
    _label_view = cloned_dataset.filter_labels("ground_truth.detections", F("label")==_label)
    counts_of_images_with_label[_label] = _label_view.count()

counts_of_images_with_label

We'll build on this foundation as we being to explore the data further.

Take sometime to review the following documentation (as you can expect questions on the quiz about them):

- [`Stages`](https://docs.voxel51.com/api/fiftyone.core.stages.html)

- [`DatasetView`](https://docs.voxel51.com/api/fiftyone.core.view.html)

- [`ViewExpression`](https://docs.voxel51.com/api/fiftyone.core.expressions.html?highlight=viewexpression#fiftyone.core.expressions.ViewExpression)

- [`ViewField`](https://docs.voxel51.com/api/fiftyone.core.expressions.html#fiftyone.core.expressions.ViewField)

- Blog: [pandas v FiftyOne](https://docs.voxel51.com/cheat_sheets/pandas_vs_fiftyone.html)

- Blog: [pandas-style queries in FiftyOne](https://docs.voxel51.com/tutorials/pandas_comparison.html)

- Blog: [Filtering Cheat Sheet](https://docs.voxel51.com/cheat_sheets/filtering_cheat_sheet.html)

- Blog: [Views Cheat Sheet](https://docs.voxel51.com/cheat_sheets/views_cheat_sheet.html)

If you ever need assistance, have more complex questions, or want to keep in touch, feel free to join the Voxel51 community Discord server [here](https://discord.gg/QAyfnUhfpw)