# Preliminaries
## Installation
To be able to run this tutorial, please install the following libraries:

In [None]:
!pip install bridge-ds
!pip install pycocotools

## Downloading the demo dataset
In this tutorial, we'll integrate the [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/), a text classification dataset, with Bridge.

In [None]:
import tempfile
from pathlib import Path

TMP_NOTEBOOK_ROOT = Path(tempfile.mkdtemp()) / "custom_data" / "dataset_provider"

In [None]:
from bridge.utils import download_and_extract_archive

download_and_extract_archive(
    "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", TMP_NOTEBOOK_ROOT / "imdb"
)

### File Tree

After extracting, we can observe the following file structure:

```
├── README
├── imdb.vocab
├── imdbEr.txt
├── test
│   ├── labeledBow.feat
│   ├── neg  [12500 entries]
│   ├── pos  [12500 entries]
│   ├── urls_neg.txt
│   └── urls_pos.txt
└── train
    ├── labeledBow.feat
    ├── neg  [12500 entries]
    ├── pos  [12500 entries]
    ├── unsup  [50000 entries]
    ├── unsupBow.feat
    ├── urls_neg.txt
    ├── urls_pos.txt
    └── urls_unsup.txt
```

In the next steps, we will learn how to load this dataset to BridgeDS.

# DatasetProvider
The recommended way to create Bridge Datasets is by using DatasetProviders. They implement a single method, `provider.build_dataset()`.

Here is the outline:

```python
class YourDatasetProvider(DatasetProvider):
    def __init__(self, *args,**kwargs):
        """
        Load the original dataset. This usually means downloading the dataset from a source, storing samples in a list, etc.
        Remember that in Bridge it's enough to store references to your data, not necessarily the actual data.
        """
        super().__init__(dataset_dir, download)

    def build_dataset(self, display_engine=None, cache_mechanisms=None):
        """
        Convert the dataset from raw format into our own Dataset type.

        Parameters:
        - display_engine (DisplayEngine): The display engine to use for visualization.
        - cache_mechanisms (Dict[str, CacheMechanism | None] | None): Cache mechanisms for different types of elements.
        NOTE: Learn more about cache mechanisms and display engines in more advanced tutorials.
        """
        # Implement dataset building logic here
        pass
```

Let's start by writing the basic layout of the class, and the `__init__`:

In [None]:
import os

from bridge.primitives.dataset import SingularDataset
from bridge.providers import DatasetProvider


class LargeMovieReviewDataset(DatasetProvider):
    dataset_url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

    def __init__(self, root: str | os.PathLike, split: str = "train", download: bool = False):
        root = Path(root)

        if download:
            if (root / "aclImdb_v1.tar.gz").exists():
                print("Archive file aclImdb_v1.tar.gz already exists, skipping download.")
            else:
                download_and_extract_archive(self.dataset_url, str(root))
        self._split_root = root / "aclImdb" / split

    def build_dataset(
        self,
        display_engine=None,
        cache_mechanisms=None,
    ) -> SingularDataset:
        pass

Now we can instantiate this provider and verify that it points to the right directory:

In [None]:
provider = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / "imdb", split="train", download=False)
provider._split_root

In [None]:
os.listdir(provider._split_root)

The next step will be to implement `build_dataset()`, which will load the relevant metadata from this directory into a Bridge Dataset.

Concretely, we will iterate over the directories and convert every text file into **two elements**: a text element, and a class label element. To get the convenient API where the text elements are _samples_ and the class elements are _annotations_, we will keep two separate lists for elements during this process, but we will ensure elements from the same sample _share a sample id_:

In [None]:
import os
from pathlib import Path

from bridge.primitives.dataset.singular_dataset import SingularDataset
from bridge.primitives.element.data.load_mechanism import LoadMechanism
from bridge.primitives.element.element import Element
from bridge.utils.data_objects import ClassLabel


class LargeMovieReviewDataset(DatasetProvider):
    dataset_url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

    def __init__(self, root: str | os.PathLike, split: str = "train", download: bool = False):
        root = Path(root)

        if download:
            if (root / "aclImdb_v1.tar.gz").exists():
                print("Archive file aclImdb_v1.tar.gz already exists, skipping download.")
            else:
                download_and_extract_archive(self.dataset_url, str(root))
        self._split_root = root / "aclImdb" / split

    def build_dataset(
        self,
        display_engine=None,
        cache_mechanisms=None,
    ) -> SingularDataset:
        samples = []
        annotations = []

        class_dir_list = [d for d in list(self._split_root.iterdir()) if d.is_dir()]
        for class_idx, class_dir in enumerate(sorted(class_dir_list)):
            for textfile in class_dir.iterdir():
                load_mechanism = LoadMechanism.from_url_string(str(textfile), "text")
                text_element = Element(
                    element_id=f"text_{textfile.stem}",
                    sample_id=textfile.stem,
                    etype="text",
                    load_mechanism=load_mechanism,
                )
                load_mechanism = LoadMechanism(ClassLabel(class_idx, class_dir.name), category="obj")
                label_element = Element(
                    element_id=f"label_{textfile.stem}",
                    sample_id=textfile.stem,
                    etype="class_label",
                    load_mechanism=load_mechanism,
                )
                samples.append(text_element)
                annotations.append(label_element)

        return SingularDataset.from_lists(
            samples, annotations, display_engine=display_engine, cache_mechanisms=cache_mechanisms
        )

There's quite a bit of code here, so let's break it down a little:

#### Iterating Class Dirs

```python
    class_dir_list = [d for d in list(self._split_root.iterdir()) if d.is_dir()] 
    for class_idx, class_dir in enumerate(sorted(class_dir_list)):
        for textfile in class_dir.iterdir():
```
Create a nested loop, where for every class, we iterate on all samples of that class.


#### Create Text Element
```python
load_mechanism = LoadMechanism.from_url_string(str(textfile), 'text')
text_element = Element(
    element_id=f"text_{textfile.stem}",
    sample_id=textfile.stem,
    etype='text',
    load_mechanism=load_mechanism,
)
```

* Create a LoadMechanism for our text file
* Create the text element. This means defining a unique element id, a sample id, and using the load mechanism we just defined.

### Create Class Element

```python
load_mechanism = LoadMechanism(ClassLabel(class_idx, class_dir.name), category='obj')
label_element = Element(
    element_id=f"label_{textfile.stem}",
    sample_id=textfile.stem,
    etype='class_label',
    load_mechanism=load_mechanism,
)
```

* The LoadMechanism in this case will simply store the class label we define in-memory, rather than a URL.
* The element is defined with a different element id than the one above, but the same sample id, so we know they relate.

NOTE: Each sample comprises of one text element and one label element, because we're doing a classification task. In the SingularDataset regime, these will be separated into `samples` (text) and `annotations` (class labels).

### Wrapping up

Let's create a Bridge Dataset and see what we've got:

In [None]:
from bridge.display.basic import SimplePrints

ds = LargeMovieReviewDataset(TMP_NOTEBOOK_ROOT / "imdb", split="train", download=False).build_dataset(
    display_engine=SimplePrints()
)
ds

In [None]:
ds.samples.head(3)

In [None]:
ds.annotations.head(3)

In [None]:
ds.annotations.data.value_counts()

In [None]:
sample = ds.iget(0)
sample.data  # SingularSample exposes the sample element data directly

In [None]:
ds.select_samples(lambda samples, anns: samples.index[:2]).show()

We have an operational Bridge Dataset, which we can manipulate as we see fit. 

For example, observe the [tree](#File-Tree) above, and note that the `unsup` dir only exists in the training set, and not in the test set. This is because this "class" is not a class per se, but rather unlabeled data which is included in our archive.

Let's use a simple selection to clear out all samples of this class from our dataset:

In [None]:
ds = ds.select_samples(lambda samples, anns: anns[anns.data != "unsup"].index.get_level_values("sample_id"))

ds.annotations.data.value_counts()

## In Summary
1. To create our own custom datasets, it's recommended to use a **DatasetProvider**
2. For SingularDatasets, we create two lists of elements - one for _samples_ and one for _annotations. For any other kind of Dataset, we will create a single list of elements.
3. Elements have unique IDs across the Dataset, and share Sample IDs with Elements of the same Sample.
4. Elements are the low-level object which contains raw data, by using a **LoadMechanism**.

## Up Next
In this tutorial, we've used a primitive DisplayEngine called **SimplePrints**. If you would prefer a more sophisticated one like the Holoviews one in previous tutorials, continue to the next tutorial where we learn how to create our own **DisplayEngine** for a text dataset. 