# Dataset prepare

Following examples require *MNIST* and *CIFAR-10* datasets.

In order for other processes to run properly, the datasets below must all exist in the `../data` directory.

Your currently in here

In [None]:
!echo "Current Directory: $PWD" 
!echo "Dataset Directory: $PWD/../data"

Move dataset directory to here

In [None]:
!mv {dataset_directory} $PWD/../data

As a result, you should see the dataset below.

### If you run the code below, you should see `cifar-10-batches-py`

In [None]:
!ls $PWD/../data | grep cifar-10-batches-py

### If you run the code below, you should see four files which are:

- t10k-images-idx3-ubyte
- t10k-labels-idx1-ubyte
- train-images-idx3-ubyte
- train-labels-idx1-ubyte

In [None]:
!ls $PWD/../data/MNIST/raw

## Download dataset (if possible)

Many machine learning frameworks support scripts that automatically download popular datasets such as MNIST and CIFAR.
But some environments blocks internet connections, So you have to move the dataset manually as in the code above.

After you moving `../data` directory corretly, then below codes will pass download.

In [None]:
data_dir = '../data'

In [None]:
from torchvision import datasets

In [None]:
_ = datasets.MNIST(data_dir, train=True, download=True)
_ = datasets.MNIST(data_dir, train=False, download=True)
_ = datasets.CIFAR10(data_dir, train=False, download=True)
_ = datasets.CIFAR10(data_dir, train=False, download=True)

## MNIST Dataset

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It's contains of 70000 28x28 gray scale images in 10 classes(0..9).

![](../assets/mnist.png)

## Load dataset

The following files must exist in the relevant path `../data/MNIST/raw` to load the dataset.

- t10k-images-idx3-ubyte
- t10k-labels-idx1-ubyte
- train-images-idx3-ubyte
- train-labels-idx1-ubyte

In [None]:
from pathlib import Path
import numpy as np

In [None]:
train_dataset = datasets.MNIST(data_dir, train=True, download=True)
test_dataset = datasets.MNIST(data_dir, train=False, download=True)

In [None]:
from PIL import Image
from IPython.display import display

def show(ary):
    display(Image.fromarray(ary))

In [None]:
for _, (image, label) in zip(range(3), train_dataset):
    display(image)
    print(label)

## CIFAR Dataset
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

![](../assets/cifar-10.png)

## Load dataset

The following files must exist in the relevant path `../data` to load the dataset.

- cifar-10-batches-py

In [None]:
train_dataset = datasets.CIFAR10(data_dir, download=True, train=True)
test_dataset = datasets.CIFAR10(data_dir, download=True, train=False)

In [None]:
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

In [None]:
for _, (image, label) in zip(range(3), train_dataset):
    display(image)
    print(classes[label])