# Building a Dataset

In this notebook we will explore how we can build a `Dataset` for interfacing with a dataset.

## 00. Getting Started

Lets download a the Fashion MNIST dataset from https://github.com/zalandoresearch/fashion-mnist.

In [26]:
import cv2
import gzip
import urllib
import numpy as np
from tqdm import tqdm
from pathlib import Path
from typing import *

# URLs for MNIST dataset
base_url = "http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/"
files = {
    "test_images": "t10k-images-idx3-ubyte.gz",
    "test_labels": "t10k-labels-idx1-ubyte.gz"
}

# Specify the output directory
dataset_dir = Path("../data/datasets")
dataset_dir.mkdir(parents=True, exist_ok=True)

# Download the files
for key, filename in files.items():
    src = base_url + filename # source url
    dst = dataset_dir.joinpath(filename) # destination file
    if not dst.exists():
        urllib.request.urlretrieve(src, dst)

# extract the images and save them
imgs = dataset_dir.joinpath(files["test_images"])
with gzip.open(imgs, "rb") as file:
    imgs = np.frombuffer(file.read(), np.uint8, offset=16)
    imgs = imgs.reshape(-1,28,28)

# extract the labels and save them
lbls = dataset_dir.joinpath(files["test_labels"])
with gzip.open(lbls, "rb") as file:
    lbls = np.frombuffer(file.read(), np.uint8, offset=8)
    
# lets export them into a specific structure : just use the first 10 for the sake of it
for idx in range(10):
    # create a subdirectory
    idx_dir = dataset_dir.joinpath(str(idx))
    idx_dir.mkdir(exist_ok=True, parents=True)

    # save the image using cv2
    cv2.imwrite(idx_dir.joinpath(f"img.png"), imgs[idx])

    # save the labe in plain text
    with open(idx_dir.joinpath("label.txt"), "w") as f:
        f.writelines([str(lbls[idx])])

## 01. Introduction

Spend some time exploring the structure of the dataset using `pathlib`.

In [None]:
#
... 

### Activity

Create a regular expression that allows you to get all of the sample directories i.e. `<idx>/` in the dataset.

In [None]:
# define the root directory
...

# create a regular expression and `.glob` to access the sample directories i.e. <idx>/
...

## 02. Building an Index

We're now going to build our own dataset that we can use to iterate over the items in the dataset. First, lets load all of the sample directories into a `List`, we will call this our index.

In [18]:
# store the samples in a list
...

In [30]:
# custom generator
class FashionMnist:
    def __init__(self, samples: List[Path]) -> None:
        super(FashionMnist, self).__init__()
        self.samples = samples
        assert isinstance(self.samples, List)
        assert len(self.samples) > 0
        assert isinstance(self.samples[0], Path)

    def __iter__(self) -> Tuple[str, str]:
        samples = self.samples
        for sample in samples:
            # construct the filepaths of the data
            img_filepath = sample.joinpath("img.png")
            lbl_filepath = sample.joinpath("lbl.txt")

            yield img_filepath, lbl_filepath

In [None]:
# initialize the dataset with the files you provided
dataset = FashionMnist(...)

# iterate over the dataset
...

## 03. Accessing the Dataset

Now we have a way to programatically access each set of files in the dataset, lets write some functions to read the data from the files and return this.

In [None]:
# create a function to read the image
img_filepath = Path("../data/files/wave.png")

def read_image(path: str) -> Any:
    ...


# read the image using the function
img = read_image(...)

# plot the image
...

In [None]:
# create a function to read the label
lbl_filepath = Path("../data/files/label.txt")

def read_label(path: str) -> Any:
    ...

# read the label using the function
lbl = read_label(...)

# plot the image
...

## 04. Putting it Together

Now we have a way to retrieve the directories in the dataset and to read each type of file, lets pull this together into a single class we can use.

In [31]:
# custom generator
class FashionMnist:
    def __init__(self, samples: List[Path]) -> None:
        super(FashionMnist, self).__init__()
        self.samples = samples
        assert isinstance(self.samples, List)
        assert len(self.samples) > 0
        assert isinstance(self.samples[0], Path)

    def __iter__(self) -> Tuple[Any, str]:
        samples = self.samples
        for sample in samples:
            # construct the filepaths of the data
            img_filepath = sample.joinpath("img.png")
            lbl_filepath = sample.joinpath("lbl.txt")

            # read the data
            ...

            # yield the data
            yield ...
    
    def read_image(self, path: str) -> Any:
        ...

    def read_label(self, path: str) -> int:
        ...

In [None]:
# initialize the dataset
dataset = FashionMnist(...)

# iterate over the dataset
for idx, (img, lbl) in enumerate(dataset):
    # create a plot to display the image and label
    ...    