# Simple Tar Dataset - examples

This notebook will go through a few common use cases. All the needed Tar files are very minimal and included with the library.


### Just load the images

The default `TarDataset` simply loads all PNG, JPG and JPEG images from a Tar file, and allows you to iterate them.

Images are returned as `Tensor`. Here some RGB values are printed.

In [1]:
from tardataset import TarDataset

dataset = TarDataset('example-data/colors.tar')

for (idx, image) in enumerate(dataset):
    print(f"Image #{idx}, color: {image[:,0,0]}")

Image #0, color: tensor([0., 0., 1.])
Image #1, color: tensor([0., 1., 0.])
Image #2, color: tensor([1., 0., 0.])




### Folders as class labels (like torchvision's ImageFolder)

Similarly to [`ImageFolder`](https://pytorch.org/vision/stable/datasets.html#imagefolder), `TarImageFolder` assumes that each top-level folder contains all samples of a different class.

In this example, the Tar archive has this structure:
- `red/a.png`
- `green/b.png`
- `blue/c.png`

In [2]:
from tarimagefolder import TarImageFolder

dataset = TarImageFolder('example-data/colors.tar')

for (idx, (image, label)) in enumerate(dataset):
  print(f"Image #{idx}, label: {label} "
        f"({dataset.idx_to_class[label]}), color: {image[:,0,0]}")

Image #0, label: 0 (blue), color: tensor([0., 0., 1.])
Image #1, label: 1 (green), color: tensor([0., 1., 0.])
Image #2, label: 2 (red), color: tensor([1., 0., 0.])




### Use a DataLoader (multiple processes) and return a mini-batch

Using a `DataLoader` is the same as with a standard `Dataset`. The library supports various multiprocessing configurations without extra code.

In [4]:
from torch.utils.data import DataLoader

if __name__ == '__main__':  # needed for dataloaders
  dataset = TarImageFolder('example-data/colors.tar')
  loader = DataLoader(dataset, batch_size=3, num_workers=2, shuffle=True)

  for (image, label) in loader:
    print(f"Dimensions of image batch: {image.shape}")
    print(f"Labels in batch: {label}")

Dimensions of image batch: torch.Size([3, 3, 8, 8])
Labels in batch: tensor([2, 1, 0])




### Load videos as stacks of frames (custom Tar structures)

To have more control over how files in the Tar archive are related to iterated samples, you can subclass `TarDataset`.

Here we consider each folder starting with `'vid'` as a sample, load 3 sequentially-named frames from it, and return the concatenated frames.

In [3]:
import torch

class VideoDataset(TarDataset):
  """Example video dataset, each folder has the frames of a video"""
  def __init__(self, archive):
    super().__init__(archive=archive,
      is_valid_file=lambda m: m.isdir() and m.name.startswith('vid'))

  def __getitem__(self, index):
    """Load and return a stack of 3 frames from this folder"""
    folder = self.samples[index]
    images = [self.get_image(f"{folder}/{frame:02}.png")
      for frame in range(3)]
    return torch.stack(images)


dataset = VideoDataset('example-data/videos.tar')

for (idx, video) in enumerate(dataset):
  print(f"Video #{idx}, stack of frames with dims: {video.shape}")

Video #0, stack of frames with dims: torch.Size([3, 3, 8, 8])
Video #1, stack of frames with dims: torch.Size([3, 3, 8, 8])




### Load non-image files, such as pickled Python objects

You can choose the loaded file types with `extensions` (or the more advanced `is_valid_file`, as above).

You can also use `get_file` to load arbitrary files as data streams, completely in-memory (without writing them to disk). You can plug this in to Pickle or JSON modules.

In [6]:
import pickle

class PickleDataset(TarDataset):
  """Example non-image dataset"""
  def __init__(self, archive):
    super().__init__(archive=archive, extensions=('.pickle'))

  def __getitem__(self, index):
    """Return a pickled Python object"""
    filename = self.samples[index]
    return pickle.load(self.get_file(filename))


dataset = PickleDataset('example-data/objects.tar')

for (idx, obj) in enumerate(dataset):
  print(f"Sample #{idx}, object: {obj}")

Sample #0, object: {'id': 0, 'content': 'one sample'}
Sample #1, object: {'id': 1, 'content': 'another sample'}




### Load custom meta-data files (e.g. ground truth information)

Often datasets come with various pieces of information in different files. You can easily read a text file from the Tar archive into a string with `get_text_file`, either at initialisation or during iteration. For more general binary files, use `get_file` as above.

In this example we read a text file from the archive, which contains the file name of each image and its label `'red'` or `'not-red'` (one per line). When the dataset is iterated, `__getitem__` then returns the image and this custom label as a boolean.

In [7]:
class RedDataset(TarDataset):
  """Example dataset, which loads from a text file a binary label of
  whether each image is red or not."""
  def __init__(self, archive):
    super().__init__(archive=archive)
    
    self.image_is_red = {}
    for line in self.get_text_file('custom-data.txt').splitlines():
      (name, redness) = line.split(',')
      self.image_is_red[name] = (redness == 'red')

  def __getitem__(self, index):
    """Return the image and the binary label"""
    filename = self.samples[index]
    image = self.get_image(filename)
    is_red = self.image_is_red[filename]
    return (image, is_red)


dataset = RedDataset('example-data/colors.tar')

for (idx, (image, label)) in enumerate(dataset):
  print(f"Image #{idx}, redness: {label}, color: {image[:,0,0]}")

Image #0, redness: False, color: tensor([0., 0., 1.])
Image #1, redness: False, color: tensor([0., 1., 0.])
Image #2, redness: True, color: tensor([1., 0., 0.])


That's it! For more information, refer to the documentation of the classes.