# What is the MNIST Dataset

It is a dataset of images of handwritten digits. The dataset contains a training set of 60,000 examples, and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 1.

The handwritten digits themselves were written by high school students and employees of the United States Census Bureau; divided into Special Database 1 and Special Database 3 respectively; this combination of two NIST databases are what form the MNIST dataset.

# Reading the MNIST Dataset 

In this section I'm going to be showing you how to read the MNIST dataset into memory from scratch. This is often done in one or two lines if using a library like Tensorflow or Keras eg.

```
from keras.datasets import mnist
# get the dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
```

Pre-configured data files are fantastic and they save a lot of time, however it's important that we know how to prepare and load databases ourselves.

I tried to organize this Notebook into easy to digest sections:

1. Download and extract the four MNIST g-zipped files
2. Process the encoded images
3. Save the images to a dictionary
4. Store dictionary


# 1. Downloading and Extracting the Dataset

In [28]:
# Adapted from https://github.com/Ghosh4AI/Data-Processors/blob/master/MNIST/MNIST_Loader.ipynb

import os,urllib.request

# The path where the necessary files are downloaded to
datapath = './Data/MNISTData/'  

# Create the directory if it doesn't already exist
if not os.path.exists(datapath):
    os.makedirs(datapath)

# The necessary download URLS for the training and testing images/labels
urls = ['http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz']

# Loop through each URL 
for url in urls:
    # Get individual filenames
    filename = url.split('/')[-1]
    
    # Check if the file already exists
    if os.path.exists(datapath + filename):
        print(filename, ' already exists')
    else:
        print('Downloading ', filename)
        # Copy object specified in the URL to the local file
        urllib.request.urlretrieve (url, datapath + filename)
        
print('All files are available')

Downloading  train-images-idx3-ubyte.gz
Downloading  train-labels-idx1-ubyte.gz
Downloading  t10k-images-idx3-ubyte.gz
Downloading  t10k-labels-idx1-ubyte.gz
All files are available


In [29]:
import os,gzip,shutil

# The path containing the MNIST data
datapath = './Data/MNISTData/'  

# List all files in the directory
files = os.listdir(datapath)

# Read each file
for file in files:
    if file.endswith('gz'):
        print('Extracting ',file)   
        # Open the folder and read the file to copy from
        with gzip.open(datapath + file, 'rb') as file_in:
            # Define the new file to copy to (read and drop the .gz extension)
            with open(datapath + file.split('.')[0], 'wb') as file_out:
                # copy the object contents to the new file
                shutil.copyfileobj(file_in, file_out)
print('Extraction Complete\n')

# Clean up and remove the gz folders, run twice to remove all files
for file in files:
    print('Removing ', file)
    os.remove(datapath + file)
print ('All archives removed')

Extracting  t10k-images-idx3-ubyte.gz
Extracting  train-images-idx3-ubyte.gz
Extracting  train-labels-idx1-ubyte.gz
Extracting  t10k-labels-idx1-ubyte.gz
Extraction Complete

Removing  t10k-images-idx3-ubyte.gz
Removing  train-images-idx3-ubyte.gz
Removing  train-labels-idx1-ubyte.gz
Removing  t10k-labels-idx1-ubyte.gz
All archives removed


# 2. Processing the Encoded Images

Next step is processing the images. The primary challenge here is to isolate images from labels which we can do using the _magic number_, don't worry that will be explained below. 

Each file has a _magic number_, which is an integer. Integers in the files are stored in the MSB first (high endian) format. The first four bytes must therefore be converted to an integer, which will either be 2049 for labels or 2051 for images. Remember, a 32 bit integer is a 4 byte integer.

<img src="./images/image_label_layout_1.png" style="width: 500px; height: 342px; float: left;"/>


In [34]:
import os,codecs,numpy

# The path containing the extracted MNIST data
datapath = './Data/MNISTData/'

# List all files in the directory
files = os.listdir(datapath)

# Convert 4 bytes to an integer
def get_int(b):   
    return int(codecs.encode(b, 'hex'), 16)

# Dictionary to store the images and labels
data_dict = {}

for file in files:
    # Traverse all 'ubyte' files
    if file.endswith('ubyte'):  
        print('Reading ', file)
        
        # Define file path to read from
        with open (datapath + file,'rb') as f:
            # Store read data
            data = f.read() 
            # bytes 0-3 are the 'Magic Number': defines whether the type is an image or label 
            type = get_int(data[:4])
            # bytes 4-7 define the length of the array (Dimension 0)
            length = get_int(data[4:8])
            
            if (type == 2051):
                # Set the category as images
                category = 'images'
                # Bytes 8-11 define the number of rows (Dimension 1)
                num_rows = get_int(data[8:12])  
                # Bytes 12-15 define the number of columns (Dimension 2)
                num_cols = get_int(data[12:16])
                # Read the pixel values as integers
                parsed = numpy.frombuffer(data,dtype = numpy.uint8, offset = 16)  
                # Reshape the array as [number of samples * height * width]      
                parsed = parsed.reshape(length, num_rows, num_cols)       
            
            elif(type == 2049):
                # Set the category as images
                category = 'labels'
                # Convert the label values to integers
                parsed = numpy.frombuffer(data, dtype = numpy.uint8, offset = 8) 
                # Reshape the array as the number of samples (length)      
                parsed = parsed.reshape(length)                      
            
            if (length == 10000):
                set = 'test'
            elif (length == 60000):
                set = 'train'
                
            # Save the NumPy array to the corresponding key
            data_dict[set + '_' + category] = parsed  

Reading  t10k-images-idx3-ubyte
Reading  t10k-labels-idx1-ubyte
Reading  train-images-idx3-ubyte
Reading  train-labels-idx1-ubyte
