## KMNIST Dataset
- <https://github.com/rois-codh/kmnist>
- <https://pytorch.org/vision/stable/datasets.html#kmnist>
- <https://www.tensorflow.org/datasets/catalog/kmnist>

In [1]:
from pathlib import Path
from torchvision.datasets import KMNIST
from torchvision.transforms import ToTensor

In [2]:
type(KMNIST)

type

```python
KMNIST(
    root="data",
    train=True,
    download=True,
    transform=ToTensor(),
)
```

will

- call the `__init__()` method of the class `KMNIST`, create a folder under the path given in the `root` parameter if it does not already exist, and download the dataset of KMNIST into this folder
- `train=True` will set the returned value to the training dataset
- **(?)** `transform=ToTensor()` **(R)** Cf. next cell.
- One single download will download both the training and the test sets
  - In other words, later on, when you want to access to the test set by calling the same class with parameter `train=False`, you can set the `download` parameter to `False`
  - Actually, if you mistakenly set `download=True`, as long as the `root` path has already had a copy of the download, the code will not make another unnecessary download


```
help(ToTensor)

Converts a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] if the PIL Image belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) or if the numpy.ndarray has dtype = np.uint8.

In the other cases, tensors are returned without scaling.
```

In [3]:
path_pytorch_dataset = Path.home() / "datasets/pytorch"
train_data = KMNIST(
    root=path_pytorch_dataset,
    train=True,
    download=True,
    transform=ToTensor(),
)
train_data, type(train_data)

(Dataset KMNIST
     Number of datapoints: 60000
     Root location: /home/phunc20/datasets/pytorch
     Split: Train
     StandardTransform
 Transform: ToTensor(),
 torchvision.datasets.mnist.KMNIST)

In [4]:
test_data = KMNIST(
    root=path_pytorch_dataset,
    train=False,
    download=False,
    transform=ToTensor(),
)
test_data, type(test_data)

(Dataset KMNIST
     Number of datapoints: 10000
     Root location: /home/phunc20/datasets/pytorch
     Split: Test
     StandardTransform
 Transform: ToTensor(),
 torchvision.datasets.mnist.KMNIST)

In [5]:
def non_dunder(obj):
    return [s for s in dir(obj) if not s.startswith("__")]

In [7]:
non_dunder(train_data)

['_check_exists',
 '_format_transform_repr',
 '_repr_indent',
 'class_to_idx',
 'classes',
 'data',
 'download',
 'extra_repr',
 'processed_folder',
 'raw_folder',
 'resources',
 'root',
 'target_transform',
 'targets',
 'test_data',
 'test_file',
 'test_labels',
 'train',
 'train_data',
 'train_labels',
 'training_file',
 'transform',
 'transforms']

In [8]:
train_data.data

tensor([[[  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         ...,
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0]],

        [[  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         ...,
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0]],

        [[  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         ...,
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0]],

        ...,

        [[  0,   0,   0,  ...,   0,   0,   0],
         [  0,   0,   0,  ...,   0,   0,   0]

In [9]:
train_data.data.shape, test_data.data.shape

(torch.Size([60000, 28, 28]), torch.Size([10000, 28, 28]))

In [10]:
len(train_data), len(test_data)

(60000, 10000)

In [15]:
batch_size = 64

In [11]:
# We will further divide train_data into training and validation sets
import torch
from torch.utils.data import random_split

train_size = 0.75
n_train = int(len(train_data) * train_size) if isinstance(train_size, float) else train_size
n_val = len(train_data) - n_train
train_data, val_data = random_split(
    train_data,
    [n_train, n_val],
    generator=torch.Generator().manual_seed(42),
)
type(train_data), type(val_data)

(torch.utils.data.dataset.Subset, torch.utils.data.dataset.Subset)

In [12]:
len(train_data), len(val_data)

(45000, 15000)

In [16]:
from torch.utils.data import DataLoader

loader_train = DataLoader(
    train_data,
    shuffle=True,
    batch_size=batch_size,
)
loader_val = DataLoader(
    val_data,
    batch_size=batch_size,
)
loader_test = DataLoader(
    test_data,
    batch_size=batch_size,
)

In [19]:
len(loader_train), len(loader_train.dataset)

(704, 45000)

Note that

- the first length is the number of batches
- the second length is the number of instances

In [22]:
len(loader_train), len(loader_val), len(loader_test)

(704, 235, 157)

In [21]:
len(loader_train.dataset) // batch_size, \
len(loader_val.dataset) // batch_size, \
len(loader_test.dataset) // batch_size

(703, 234, 156)