## MNIST Dataset

The MNIST database (Modified National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It's contains of 70000 28x28 gray scale images in 10 classes(0..9).

![](../assets/mnist.png)

Many machine learning frameworks support scripts that automatically download popular datasets such as MNIST.

## Download dataset (if possible)

some environments blocks internet connections.

In [None]:
from torchvision import datasets

In [None]:
_ = datasets.MNIST('../data', train=True, download=True)
_ = datasets.MNIST('../data', train=False, download=True)
_ = datasets.CIFAR10('../data', train=False, download=True)
_ = datasets.CIFAR10('../data', train=False, download=True)

## Load dataset

The following files must exist in the relevant path `../data/MNIST/raw` to load the dataset.

- t10k-images-idx3-ubyte
- t10k-labels-idx1-ubyte
- train-images-idx3-ubyte
- train-labels-idx1-ubyte

In [None]:
from pathlib import Path

In [None]:
import idx2numpy
import numpy as np

In [None]:
data_dir = Path('../data/MNIST/raw')

In [None]:
train_images = idx2numpy.convert_from_file(str(data_dir.joinpath('train-images-idx3-ubyte')))
train_labels = idx2numpy.convert_from_file(str(data_dir.joinpath('train-labels-idx1-ubyte')))
test_images = idx2numpy.convert_from_file(str(data_dir.joinpath('t10k-images-idx3-ubyte')))
test_labels = idx2numpy.convert_from_file(str(data_dir.joinpath('t10k-labels-idx1-ubyte')))

In [None]:
from PIL import Image
from IPython.display import display

def show(ary):
    display(Image.fromarray(ary))

In [None]:
for image, label, _ in zip(train_images, train_labels, range(3)):
    show(image)
    print(label)

## CIFAR Dataset
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

![](../assets/cifar-10.png)

## Load dataset

The following files must exist in the relevant path `../data` to load the dataset.

- cifar-10-batches-py

In [None]:
import pickle

In [None]:
data_dir = Path('../data/cifar-10-batches-py')

In [None]:
for batch in data_dir.glob('data_batch_*'):
    train_images = np.empty((0, 3072))
    with open(str(batch), 'rb') as f:
        v = pickle.load(f, encoding='bytes')
    break

In [None]:
len(v[b'labels'])

In [None]:
v.keys()

In [None]:
v[b'data'].shape

In [None]:

with open('cifar-10-batches-py\\data_batch_1','rb') as f:
    dict1 = pickle.load(f,encoding='bytes')