# Preliminaries

## Imports

In [None]:
import tempfile
from pathlib import Path

import holoviews as hv
import hvplot.pandas  # noqa
import panel as pn

hv.extension("bokeh")
pn.extension()
# If in google colab, run hack that allows holoviews to work properly
try:
    import google.colab  # noqa

    def _render(self, **kwargs):
        hv.extension("bokeh")
        return hv.Store.render(self)

    hv.core.Dimensioned._repr_mimebundle_ = _render
except ModuleNotFoundError:
    pass

TMP_NOTEBOOK_ROOT = Path(tempfile.mkdtemp()) / "processing_data" / "cache_mechanism"

# CacheMechanisms

## Motivation
Consider the following Dataset:

In [None]:
from bridge.display.vision import Holoviews
from bridge.providers.vision import Coco2017Detection

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir, split="val", img_source="stream")
stream_ds = provider.build_dataset(display_engine=Holoviews(bbox_format="xywh"))
stream_ds

In [None]:
stream_ds.samples.head(3)

This Dataset has samples with url sources, which means we need to request them on each `sample.data` call, which is takes a long time:

In [None]:
%%timeit
stream_ds.iget(0).data

One way to speed this up is to use a `CacheMechanism`: an object that, once `image_element.data` is called once, stores the data in a different location (e.g. a local file or in-memory). This action is transparent to the user but making subsequent `.data` calls significantly faster. 

In our scenario, we can assign a cache mechanism for every `etype`. The Dataset has two etypes:
1. `'bbox'` - already stored in-memory, no need to re-cache them
2. `'image'` - we want to cache them in the filesystem.

In [None]:
from bridge.primitives.element.data.cache_mechanism import CacheMechanism
from bridge.primitives.element.data.uri_components import URIComponents

root_dir = TMP_NOTEBOOK_ROOT / "coco"

provider = Coco2017Detection(root_dir, split="val", img_source="stream")
stream_ds = provider.build_dataset(
    display_engine=Holoviews(bbox_format="xywh"),
    cache_mechanisms={
        "image": CacheMechanism(
            root_uri=URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / "my_local_cache")),
        ),
        "bbox": None,
    },
)
stream_ds

NOTE: `cache_mechanism == None` means we don't cache anything and keep the original LoadMechanism. `cache_mechanism==CacheMechanism()` means we save to memory. for bboxes, they're already in-memory so there's no point in saving them again.

In [None]:
stream_ds.samples.head(3)

In [None]:
stream_ds.iget(0).data
stream_ds.samples.head(3)

See how the first sample's `data` column has changed to a local path?

In [None]:
%%timeit
stream_ds.iget(0).data

So now, subsequent loads of data will be a fraction of the original download-from-url scenario.

## CacheMechanism Roles
The `CacheMechanism` object has two responsibilities:

1. Use a `CacheMethod` to store the data to a certain location (disk, RAM, etc.) and to return a `LoadMechanism` which can load this data back:

```python
def store(
    self,
    element,
    data,
    as_category: str | None = None,
    should_update_elements: bool = False,
) -> LoadMechanism:
    ...
```

2. Update the `ds.elements` table (of which `ds.samples` and `ds.annotations` are derived) when we call `element.data`, with the new LoadMechanism we got from `cache_mechanism.store()` (So the **TableAPI** will align with the new source)

In fact, every element holds a reference to a CacheMechanism just like it holds a LoadMechanism. Using this knowledge, here is the actual code for `element.data`:

```python
@property
def data(self) -> Any:
    data = self._load_mechanism.load_data()
    if self._cache_mechanism:
        new_load_mechanism = self._cache_mechanism.store_image(self.id, self.type, data)
        self._load_mechanism = new_load_mechanism
        return data
    return data

```

## CacheMechanisms and Transforms
How does this relate back to transforms? Well, when we execute `sample.transform()`, here's what happens:
1. We apply the transform to each element to get new data
2. We _store_ this new data using a CacheMechanism
3. We create a _new sample_ from the old one, but replace the LoadMechanisms for every element with the ones returned from this CacheMechanism.

By default, `sample.transform()` saves outputs as variables in-memory. However, this doesn't scale for large datasets, so it's better to use something like we've used above, such as saving to path. This way, when we call `ds.transform_samples()`, the method will iterate over all samples, transform them, and save them. All while allowing us to treat this newly created Dataset just like the original one.

In the following snippet, we will transform samples from COCO. We will limit the Dataset to a few samples because it is remote so most of the time is spent just downloading images:

In [None]:
import albumentations as A

from bridge.primitives.sample.transform.vision import AlbumentationsCompose

transform = AlbumentationsCompose(albm_transforms=[A.HorizontalFlip(always_apply=True)], bbox_format="coco")
# cache = LocalCache(TMP_NOTEBOOK_ROOT / "flipped", extension=".jpg")

flipped_ds = stream_ds.select_samples(lambda samples, anns: samples.index[:20]).transform_samples(
    transform=transform,
    cache_mechanisms={
        "image": CacheMechanism(
            URIComponents.from_str(str(TMP_NOTEBOOK_ROOT / "flipped")),
        )
    },
    display_engine=Holoviews(bbox_format="xywh"),
)

In [None]:
flipped_ds.show()

In [None]:
list(Path(TMP_NOTEBOOK_ROOT / "flipped").iterdir())