# Preliminaries
## Imports

In [None]:
import tempfile
from pathlib import Path

import holoviews as hv
import hvplot.pandas  # noqa
import pandas as pd
import panel as pn

hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
    import google.colab  # noqa

    def _render(self, **kwargs):
        hv.extension("bokeh")
        return hv.Store.render(self)

    hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
    pass

TMP_NOTEBOOK_ROOT = Path(tempfile.mkdtemp()) / "basics" / "coco_eda_demo"

## Loading a dataset

To create BridgeDS Dataset objects, it's recommended to utilize a **DatasetProvider**. In this instance, we'll employ the Coco2017Detection provider:

In [None]:
from bridge.display.vision import Holoviews
from bridge.providers.vision import Coco2017Detection

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir)
ds = provider.build_dataset()
ds

# Real-life example: Exploratory Data Analysis on COCO

In the previous tutorials we've made a brief introduction into using the Sample and Table APIs. In this demo we'll perform a short step-by-step analysis on COCO, using different toolings available in BridgeDS; emphasizing its ease of use.

## Assigning a column
Let's take a brief look at our samples and annotations:

In [None]:
ds.samples.head()

In [None]:
ds.annotations.head()

In the annotations table, class names are represented by numerical labels, which may impede readability during dataset analysis. To address this, we may choose to use a third-party file that maps these integer labels to their corresponding text labels.

In [None]:
from urllib.request import urlopen

url = "https://raw.githubusercontent.com/amikelive/coco-labels/master/coco-labels-paper.txt"

classnames = urlopen(url).read().decode("utf-8").splitlines()
classnames = {i + 1: c for i, c in enumerate(classnames)}
print(classnames)

Like we've seen in the Table API tutorial, we can use `ds.assign_annotations` to replace our bounding box class labels with new ones:

In [None]:
from bridge.utils.data_objects import BoundingBox, ClassLabel


def map_bbox_class_names(bbox, classnames):
    coords = bbox.coords
    class_idx = bbox.class_label.class_idx
    class_name = classnames[class_idx]
    return BoundingBox(coords, ClassLabel(class_idx, class_name))


ds = ds.assign_annotations(
    data=lambda samples, anns: anns.data.apply(lambda bbox: map_bbox_class_names(bbox, classnames))
)
ds.annotations.head()

Another issue is that  `ds.samples.date_captured` is actually made of strings, instead of pd.Timestamps. Let's fix that:

In [None]:
print(ds.samples.date_captured.dtype)
ds = ds.assign_samples(date_captured=lambda samples, anns: pd.to_datetime(samples.date_captured))
print(ds.samples.date_captured.dtype)

This is a short example of where the Table API shines. Most frameworks and libraries implement some variant of our Sample API, which in practice would mean that to do these assignement operations they would have to iterate through the dataset using a nested loop:

```
for sample in samples:
    for annotation in samples:
        <do...>
```

Which is both slow and verbose.

## Plotting some graphs
With our dataframes now in appropriate formats, let's generate some basic plots to gain insights into our data.

Note: While our preferred plotting API is [hvplot](https://hvplot.holoviz.org/), [Pandas Plotting](https://pandas.pydata.org/docs/user_guide/visualization.html) remains a viable option.

In [None]:
plot = ds.annotations.data.apply(lambda bb: str(bb.class_label)).value_counts().hvplot.bar()

plot.opts(
    title="Class-histogram, COCO Train",
    width=900,
    xrotation=90,
    xlabel="class",
    ylabel="n_bboxes",
)

In [None]:
ds.samples.license.value_counts().hvplot.bar().opts(title="Image Licenses, Histogram")

In [None]:
(ds.samples.groupby(pd.Grouper(freq="d", key="date_captured")).size()).hvplot.bar().opts(
    xrotation=45, title="Date Captured Histogram, COCO Train"
)

In [None]:
ds.annotations.area.hvplot.density().opts(
    title="KDE of annotation area, COCO Train",
    xlabel="area (px)",
    ylabel="density",
    tools=[],
)

## Investigating a bbox with abnormally large area

Observing the KDE plot, we notice an unnatural leftward squeezing. This behavior is likely due to `hvplot` setting the x-axis limits based on the minimum and maximum values present in the data. Could this suggest that one of our annotations has an area on the order of 8.0e+5 px^2?

In [None]:
large_ann = ds.annotations.loc[ds.annotations.area.idxmax()]
large_ann

We can see the area of this annotation is 787151, so indeed in the order of 8.0e+5

At this juncture, we've identified a specific sample with `id=400410` that warrants further examination. Utilizing the `ds.get` and `sample.show()` methods from the Sample API allows us to visualize this sample

(Reminder: `ds.get` and `ds.iget` serve as equivalents to `df.loc` and `df.iloc`, respectively, _for single samples_).

In [None]:
sample_id = large_ann.name[0]  # MultiIndex loc causes the name to be tuples (<sample_id>,<element_id>)
ds.get(sample_id).show()

To gain a deeper understanding of the image and the size of the dining table annotation in question, we introduce DisplayEngines, which you've seen briefly in the Sample API tutorial. These objects are injected into Datasets  Samples, and Elements, enabling us to manipulate the behavior of the `ds.show() / sample.show / element.show()` methods.

By default, the **SimplePrints** engine is utilized. Let's switch to the **Holoviews** engine for enhanced visualization:

In [None]:
# Datasets are immutable, so we'll build a new dataset from the existing provider
# with a new rendering engine, and then re-run the assignments we made.

ds = provider.build_dataset(display_engine=Holoviews(bbox_format="xywh"))
ds = ds.assign_annotations(
    data=lambda samples, anns: anns.data.apply(lambda bbox: map_bbox_class_names(bbox, classnames))
)
ds = ds.assign_samples(date_captured=lambda samples, anns: pd.to_datetime(samples.date_captured))

ds.show()

Now, we have a more user-friendly way to observe our data. You can freely scroll through using the slider and visualize different samples from the COCO, right in your notebook.

Next up, let's visualize the specific sample (400410) that piqued our interest:

In [None]:
ds.get(sample_id).show()

## Sorting COCO dataset by bbox sizes
Upon inspection, it's evident that the `dining table` annotation encompasses the entire image.

To assess the frequency of such occurrences, let's render the samples in our dataset in descending order of annotation size.

To achieve this:
1. Assign a new column to `ds.samples` representing the area value of its largest annotation.
2. Sort the samples by this column.
3. Run `ds.show()`.

In [None]:
def get_largest_area_annotation_per_sample(samples, anns):
    return (
        anns.sort_values("area", ascending=False)
        .groupby("sample_id")
        .area.first()
        .reindex(
            samples.index.get_level_values("sample_id")
        )  # without reindex, the areas may have a different sample order than our `ds.samples` index
        .values
    )


ds = ds.assign_samples(top_ann_area=get_largest_area_annotation_per_sample)
ds.samples.head()

In [None]:
ds.sort_samples("top_ann_area", ascending=False).show()

By scrolling the slider, we observe images with very large annotations on the left, followed by images with very small annotations, and then images without annotations on the right.

## Filtering out images with large bboxes
An alternative approach is to remove samples with bounding boxes that cover the majority of the image. We can accomplish this using `ds.select_samples` and `ds.select_annotations`, which similarly to `ds.assign_samples` / `ds.assign_annotations`, work with a Pandas-like API:

In [None]:
print("Original dataset:", ds)
ds_smaller = ds.select_samples(lambda samples, anns: samples.top_ann_area < 1e5)
print("Filtered dataset:", ds_smaller)
ds_smaller.sort_samples("top_ann_area", ascending=False).show()

For completeness, let's plot the KDE from before on `ds_smaller`:

In [None]:
ds_smaller.annotations.area.hvplot.density().opts(
    title="KDE of annotation area, COCO Train",
    xlabel="area (px)",
    ylabel="density",
    tools=[],
)

As we can see, there's still a leftward squeezing - although significantly less than before. We've gained some insight into the distribution of our bbox sizes, but there's always more to do. Feel free to change the bbox area threshold to something even smaller, or plot this KDE for individual classes (rather than all of them), etc.