- Title: Tips on Dataset in PyTorch
- Slug: python-pytorch-dataset
- Date: 2020-02-26 22:40:06
- Category: Programming
- Tags: programming, Python, AI, data science, machine learning, deep learning, PyTorch, Dataset
- Author: Ben Du

1. It is a good practice to save your data (e.g., images) into one pickle file
   (or other format that you know how to deserialize).
    This comes with several advantages.
    First, it is easier and faster to read from a single big file rather than many small files. 
    Second, it avoids the system error of openning too many files.
    Some example datasets (e.g., MNIST)
    have separate training and testing files (i.e., 2 pickle files), 
    so that research work based on it can be easily reproduced.
    I personally suggest that you keep only 1 file containing all data
    when implementing your own Dataset class.
    You can always use the function `torch.utils.data.random_split`
    to split your dataset into training and testing datasets later.
    For more details, 
    please refer to 
    [http://www.legendu.net/misc/blog/python-ai-split-dataset/](http://www.legendu.net/misc/blog/python-ai-split-dataset/#PyTorch).
    
1. When you implement your own Dataset class,
    you need to inherit from 
    [torch.utils.data.Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset)
    (or one of its subclasses).
    You must overwrite the 2 methods `__len__` and `__getitem__`.

2. When you implement your own Dataset class for image classification,
    it is best to inherit from 
    [torchvision.datasets.vision.VisionDataset](https://github.com/pytorch/vision/blob/master/torchvision/datasets/vision.py#L6)
    .
    For example, 
    [torchvision.datasets.MNIST](https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py)
    subclasses 
    [torchvision.datasets.vision.VisionDataset](https://github.com/pytorch/vision/blob/master/torchvision/datasets/vision.py#L6)
    . 
    You can use it as a template.
    Notice you still only have to overwrite the 2 methods `__len__` and `__getitem__`
    (even though the implementation of 
    [torchvision.datasets.MNIST](https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py)
    is much more complicated than that).
    [torchvision.datasets.MNIST](https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py)
    downloads data into the directory `MNIST/raw` 
    and make a copy of ready-to-use data into the directory `MNIST/processed`. 
    It doesn't matter whether you follow this convention or not
    as long as you overwrite the 2 methods `__len__` and `__getitem__`.
    What's more, the parameter `root` for the constructor of 
    [torchvision.datasets.vision.VisionDataset](https://github.com/pytorch/vision/blob/master/torchvision/datasets/vision.py#L6)
    is not critical 
    as long as your Dataset subclass knows where and how to load the data
    (e.g., you can pass the full path of the data file as parameter for your Dataset subclass). 
    You can set it to `None` if you like. 
  

3. When you implement a Dataset class for image classification,
    it is best to have the method `__getitem__` return `(PIL.Image, target)`
    and then use `torchvision.transforms.ToTensor` to convert `PIL.Image` to tensor
    in the DataLoader.
    The reason is that transforming modules in `trochvision.transforms` 
    behave differently on `PIL.Image` 
    and their equivalent numpy array. 
    You might get surprises if you have `__getitem__` return `(torch.Tensor, target)`.
    If you do have `__getitem__` return `(torch.Tensor, target)`,
    make sure to double check that they tensors are as expected 
    before feeding them into your model for training/prediction.

4. `torchvision.transforms.ToTensor` (refered to as `ToTensor` in the following) 
    converts a `PIL.Image` to a numerical tensor with each value between [0, 1].
    `ToTensor` on a boolean numpy array (representing a black/white image) 
    returns a boolean tensor (instead of converting it to a numeric tensor). 
    This is one reason that you should return `(PIL.Image, target)` 
    and avoid returning `(numpy.array, target)`
    when implement your own Dataset class for image classification.
        
5. There is no need to return the target as a `torch.Tensor` (even though you can)
    when you implement the method `__getitem__` of your own Dataset class.
    The DataLoader will convert the batch of target values to `torch.Tensor` automatically.

In [97]:
import numpy as np
import torch
import torchvision

In [98]:
trans = torchvision.transforms.ToTensor()

In [99]:
arr = np.array([[True, True, False], [True, False, True]])
arr

array([[ True,  True, False],
       [ True, False,  True]])

In [100]:
x = trans(arr)
x

tensor([[[ True,  True, False],
         [ True, False,  True]]])

## References

https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py

[VisionDataset](https://github.com/pytorch/vision/blob/master/torchvision/datasets/vision.py#L6)

https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset

https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

https://pytorch.org/docs/stable/data.html