# NIST Dataset

To learn more, we explore these datasets:

- NIST Special Database 19
- The MNIST Dataset
- The EMNIST Dataset
- The EMNIST dataset from kaggle

## NIST Special Database 19

dataset homepage: https://www.nist.gov/srd/nist-special-database-19

This is the main dataset that people use to create their own datasets.


## The MNIST Dataset

dataset homepage: http://yann.lecun.com/exdb/mnist

The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.


I got help from this link to read the dataset files: https://stackoverflow.com/questions/39969045/parsing-yann-lecuns-mnist-idx-file-format

In [None]:
# make directory
!mkdir -p dataset/mnist
%cd dataset/mnist

In [None]:
# download MNIST dataset from web.archive.org because main site didn't work
!wget "https://web.archive.org/web/20240424102229/https://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
!wget "https://web.archive.org/web/20240424102229/https://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz"
!wget "https://web.archive.org/web/20240424102229/https://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz"
!wget "https://web.archive.org/web/20240424102229/https://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz"

In [None]:
# decompress .gz files
!gzip -d *.gz

In [None]:
# check
!pwd
!ls -l

In [None]:
# TRAINING SET IMAGE
with open("train-images-idx3-ubyte","rb") as f:
    magic, size = struct.unpack(">II", f.read(8))
    nrows, ncols = struct.unpack(">II", f.read(8))
    data = np.fromfile(f, dtype=np.dtype(np.uint8).newbyteorder('>'))
    x_train = data.reshape((size, nrows, ncols))


# TRAINING SET LABEL
with open("train-labels-idx1-ubyte","rb") as f:
    magic, size = struct.unpack(">II", f.read(8))
    data = np.fromfile(f, dtype=np.dtype(np.uint8).newbyteorder('>'))
    y_train = data.reshape((size,)) # or reshape to (1, size)


# TEST SET IMAGE
with open("t10k-images-idx3-ubyte","rb") as f:
    magic, size = struct.unpack(">II", f.read(8))
    nrows, ncols = struct.unpack(">II", f.read(8))
    data = np.fromfile(f, dtype=np.dtype(np.uint8).newbyteorder('>'))
    x_test = data.reshape((size, nrows, ncols))

# TEST SET LABEL
with open("t10k-labels-idx1-ubyte","rb") as f:
    magic, size = struct.unpack(">II", f.read(8))
    data = np.fromfile(f, dtype=np.dtype(np.uint8).newbyteorder('>'))
    y_test = data.reshape((size,)) # or reshape to (1, size)

In [None]:
print("Shape of x_train: ", x_train.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of x_test: ", x_test.shape)
print("Shape of y_test: ", y_test.shape)

**POINT** This assumes you uncompressed the `.gz` file. You can also work with the compressed file:

```python
import gzip
import struct
import numpy as np

with gzip.open('t10k-images-idx3-ubyte.gz','rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    nrows, ncols = struct.unpack(">II", f.read(8))
    data = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
    data = data.reshape((size, nrows, ncols))
```

In [None]:
# plot train dataset
fig,axes = plt.subplots(5,5,figsize=(10,10))
for i,ax in enumerate(axes.flat):
    ax.imshow(x_train[i], cmap='gray')
    ax.set_title(y_train[i])

In [None]:
# plot test dataset
fig,axes = plt.subplots(5,5,figsize=(10,10))
for i,ax in enumerate(axes.flat):
    ax.imshow(x_test[i], cmap='gray')
    ax.set_title(y_test[i])

In [None]:
# delete mnist dataset files and directory
%cd ../..
!rm -rf dataset/mnist

## The EMNIST Dataset

dataset homepage: https://www.nist.gov/itl/products-and-services/emnist-dataset

The EMNIST dataset is a set of handwritten character digits derived from the NIST Special Database 19  and converted to a 28x28 pixel image format and dataset structure that directly matches the MNIST dataset.

This dataset is provided in the same binary format as the original MNIST dataset. So we can repeat the same thing we did above for this data set.


In [None]:
# make directory
!mkdir -p dataset/emnist
%cd dataset/emnist

In [None]:
# download the dataset
!wget "https://biometrics.nist.gov/cs_links/EMNIST/gzip.zip"

In [None]:
# decompress
!unzip gzip.zip

In [None]:
!ls -l . gzip

In [None]:
with gzip.open('gzip/emnist-letters-train-images-idx3-ubyte.gz','rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    nrows, ncols = struct.unpack(">II", f.read(8))
    data = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
    x_train = data.reshape((size, nrows, ncols))

with gzip.open('gzip/emnist-letters-train-labels-idx1-ubyte.gz','rb') as f:
    magic, size = struct.unpack(">II", f.read(8))
    data = np.frombuffer(f.read(), dtype=np.dtype(np.uint8).newbyteorder('>'))
    y_train = data.reshape((size,)) # or reshape to (1, size)


In [None]:
print("Shape of x_train: ", x_train.shape)
print("Shape of y_train: ", y_train.shape)

In [None]:
fig,axes = plt.subplots(5,5,figsize=(10,10))
for i,ax in enumerate(axes.flat):
    ax.imshow(x_train[i], cmap='gray')
    ax.set_title(chr(y_train[i]+96))

In [None]:
# delete dataset directory
%cd ../..
!rm -rf dataset/emnist

<p style="text-align: center; font-size: 30px;">NOTE</p>
EMNIST dataset images are <b>transposed</b>.

## The EMNIST dataset from kaggle

dataset homepage: https://www.kaggle.com/datasets/crawford/emnist

This dataset is exactly the same as the EMNIST dataset, but it has csv format and is easier to read.


In [None]:
# make directory
!mkdir -p dataset/kaggleemnist
%cd dataset/kaggleemnist

In [None]:
#
!wget "https://storage.googleapis.com/kaggle-data-sets/7160/10705/compressed/emnist-balanced-train.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240904%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240904T141525Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=9f9f910cca73ec6ebf6c213d9a372a6245bafdb840b37ac28889adbacb8ba740f9462ffb2a212cd8c8ef5df907388523641f6b825d449de1ff718cd76127b6911ac014b0fefc3c47b0f0860f78b920724fe078cfcdcc71d426a69cb182b3492fd7afb9f8cdd614d27c79013d283eb11fc5515924b03eeb61a8864cb491dacfc9608363e04824e293f71f2de79ea539b0ebfdc8fe51673073579c7d1d2d57471dc9348c8dbd722f3dc96ca4eded034fe9928047484081b6e2608466eb71c17f6d4d48f2cab69d2ff036e487b4866131f408e0ce31227ac3361e536fe3a4c30605b072b8398d4bc0e5e7c9faed3b3d94b087a7fe9f063f4d813360b927ba4bde05"
!wget "https://storage.googleapis.com/kaggle-data-sets/7160/10705/compressed/emnist-balanced-test.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240904%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240904T141554Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=8901a0ea41b9455ce79a038d5311f014af6546e6aefc2d8cbd525d566f8c1d1bee14aed00b42595f3c39b5cbb7f9aa9e985598f23eaa2baf0b5e16fe81ded26580a74fac17a76aeab810801d7c24380865f6a01eb4b2581c919d382f3da622b0b870c6f1be7a44601df5262f2cdc88acf1b39c08613f548870b7971377a1ddf82d2ee42f38191dcee9d0a60ae3718920f0e11e51822976d253e9a1ae00bdc92467fea648f34643ff3adef960625e719e34341636b6d9960231fcb0be693d26e380e53337085dc47e4cf43e146bebe6ba251f7a10a03bc18c615ef4521753c337fe32bec19fe30a3b01b84e06567b4130a9d11455b58e094a7193e281194eedee"

!wget "https://storage.googleapis.com/kaggle-data-sets/7160/10705/compressed/emnist-bymerge-test.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240904%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240904T141309Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=d030687dc6811f21fe9ba0bddf5dfea97f915c685a653bec71f1602cbbd0cd13301ee376985cfabaac29952a43a1692bacc29660052ee6852be2a01399a01bec454947a0835e3d957748541bbf9f13897c5fc625f2eeccf562ed9986d34a32751725e90c4d9fcbe130d68fc7cb5811e928564ed0d58abd84439409b42afbd362b8679f4d9ac0297b9466ad12698afd774a652878e2ab934f409753976bd30d7af111da80950ffa6111b7d6891cc4dac186973db9004423672732f6a77a0fcbfac7ce0f5690c21a007d9b15f5c12164693f14a2f88303ec133a15f5c2600abb3b4eb3dc4e1aa699735d846539982eec650b9f750b321fd232a497befe1ee500bd"
!wget "https://storage.googleapis.com/kaggle-data-sets/7160/10705/compressed/emnist-bymerge-train.csv.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240904%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240904T141109Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=031987f8bc5c84a2a58e29204c989087d5ffab904469405e2228cb58c98ff1f616af878c5a5c148788f31430df750374298a7acdb9e88c78ce93acafbd40a54c997c8ee64420e7df5016ce67e8555469b1aa5d6f3ec99916744e45583f925eda77a124e291edf28f2dddfa267448f2410e8c0733ff36d039410559baa53319d78464e2acfb4f96eb5a427fe845161624a4507e87c32e7fe69a05387157272558db9b5641ca8872fe2670ff74cd6c9acbfd477223ab8200be97f118ba71012ffdee6641125bf9060520ebea1ae8a6d53049346cba6c70c5eb1f4aa64feee77448fa5e82462c2baf108dfb91beb2ced0726445582bcd6e7784c010b0d35c53a193"

In [None]:
# run this cell only once
!for file in *; do unzip $file; rm $file; done

In [None]:
!ls -ltrha

In [None]:
trainset =  np.loadtxt('emnist-balanced-train.csv', delimiter=",", dtype=np.int64)
y_train = trainset[:,0]
x_train = trainset[:,1:]

testset =  np.loadtxt('emnist-balanced-test.csv', delimiter=",", dtype=np.int64)
y_test = testset[:,0]
x_test = testset[:,1:]

In [None]:
print("Shape of x_train: ", x_train.shape)
print("Shape of y_train: ", y_train.shape)
print("Shape of x_test: ", x_test.shape)
print("Shape of y_test: ", y_test.shape)

In [None]:
# plot train dataset
fig,axes = plt.subplots(5,5,figsize=(10,10))
for i,ax in enumerate(axes.flat):
    ax.imshow(x_train[i].reshape((28,28)), cmap='gray')
    ax.set_title(y_train[i])

In [None]:
# plot test dataset
fig,axes = plt.subplots(5,5,figsize=(10,10))
for i,ax in enumerate(axes.flat):
    ax.imshow(x_test[i].reshape((28,28)), cmap='gray')
    ax.set_title(y_test[i])

In [None]:
%cd ../..