# What is the MNIST Dataset

It is a dataset of images of handwritten digits. The dataset contains a training set of 60,000 examples, and a test set of 10,000 examples. The digits have been size-normalized and centered in a fixed-size image (28x28 pixels) with values from 0 to 1.

The handwritten digits themselves were written by high school students and employees of the United States Census Bureau; divided into Special Database 1 and Special Database 3 respectively; this combination of two NIST databases are what form the MNIST dataset.

# Reading the MNIST Dataset 

In this section I'm going to be showing you how to read the MNIST dataset into memory from scratch. This is often done in one or two lines if using a library like Tensorflow or Keras eg.

```
from keras.datasets import mnist
# get the dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()
```

Pre-configured data files are fantastic and they save a lot of time, however it's important that we know how to prepare and load databases ourselves.

I tried to organize this Notebook into 4 easy to digest sections:

    1. Download and extract the four MNIST g-zipped files
    2. Process the encoded images
    3. Save the images to a dictionary
    4. Store dictionary

# 1. Downloading and Extracting the Dataset

Our first task is to download the necessary files. When ran, the code below will create two folders **Data** and **MNISTData** in the current directory. From there we can use the `urllib` package for working with URLs, specifically `urllib.request` for opening and reading URLs.

Now that we have a way to download the file, we need an easy way to name each file. The names for each file are contained in the URLs after the final forward slash. the `split` command coupled with negative index [-1] allows us to take what we need ie. _"train-images-idx3-ubyte.gz"_

The final step is to download the data and copy it to our named file in the directory. 

In [4]:
# Adapted from https://github.com/Ghosh4AI/Data-Processors/blob/master/MNIST/MNIST_Loader.ipynb

import os,urllib.request

# The path where the necessary files are downloaded to
datapath = './Data/MNISTData/'  

# Create the directory if it doesn't already exist
if not os.path.exists(datapath):
    os.makedirs(datapath)

# The necessary download URLS for the training and testing images/labels
urls = ['http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz',
       'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz']

# Loop through each URL 
for url in urls:
    # Get individual filenames from the list
    filename = url.split('/')[-1]
    
    # Check if the file already exists
    if os.path.exists(datapath + filename):
        print(filename, ' already exists')
    else:
        print('Downloading ', filename)
        # Copy object specified in the URL to the local file
        urllib.request.urlretrieve (url, datapath + filename)
        
print('All files are available')

Downloading  train-images-idx3-ubyte.gz
Downloading  train-labels-idx1-ubyte.gz
Downloading  t10k-images-idx3-ubyte.gz
Downloading  t10k-labels-idx1-ubyte.gz
All files are available


The second part of preparing our now downloaded dataset is to extract the necessary files. The .idx files we need are all stored in the zipped .gz format. Using the `open` function provided by the `gzip` package, we can store the file object within as `file_in`, remove the .gz extension and copy `file_in` to a new `file_out` object without the .gz extension.

Lastly we can clean up the .gz archives leftover so that we only have our meaningful data remaining.

In [5]:
import os,gzip,shutil

# The path containing the MNIST data
datapath = './Data/MNISTData/'  

# List all files in the directory
files = os.listdir(datapath)

# Read each file
for file in files:
    if file.endswith('gz'):
        print('Extracting ',file)   
        # Open the folder and read the file to copy from
        with gzip.open(datapath + file, 'rb') as file_in:
            # Define the new file to copy to (read and drop the .gz extension)
            with open(datapath + file.split('.')[0], 'wb') as file_out:
                # copy the object contents to the new file
                shutil.copyfileobj(file_in, file_out)
print('Extraction Complete\n')

# Clean up and remove the gz folders, run twice to remove all files
for file in files:
    print('Removing ', file)
    os.remove(datapath + file)
print ('All archives removed')

Extracting  t10k-images-idx3-ubyte.gz
Extracting  train-images-idx3-ubyte.gz
Extracting  train-labels-idx1-ubyte.gz
Extracting  t10k-labels-idx1-ubyte.gz
Extraction Complete

Removing  t10k-images-idx3-ubyte.gz
Removing  train-images-idx3-ubyte.gz
Removing  train-labels-idx1-ubyte.gz
Removing  t10k-labels-idx1-ubyte.gz
All archives removed


# 2. Processing the Encoded Images

Now that we have our extracted files, we need a way to process the images and labels encoded within. The primary challenge here is to isolate the images from labels and seperate them into their respective sets of either **test** or **train**.

- - -

There are several components of the idx format that are important to understand. Firstly, the _magic number_.

* **bytes 0-3:** Each file has a _magic number_, which is an integer. Integers in the files are stored in the MSB first (high endian) format. This integer will either be 2049, representing labels, or 2051, representing images. The first 32 bits (4 bytes) must therefore be converted to an integer to find out whether we're dealing with an image or a label.


* **bytes 4-7:** The next 4 bytes provide us with another 32 bit integer, this time representing the amount of items/  or images eg. for t10k-labels-idx1-ubyte this will be 10,000 labels


* **bytes 8-11:** Provide us with the number of rows.


* **bytes 12-15:** Provide us with the number of columns.

- - -

The function `get_int` is used to help convert these important bytes to the integers we need.


To handle the data from each file we read in, we can use the `frombuffer` function to parse the the data as an array.

Everything after the 16th byte will represent a pixel value so we set an offset of 16. We also want to convert each value to uint8 (integer). Starting here we will take one byte at a time which will give us one pixel intensity value of a specific location.

After we have this all that's left to do is to reshape the array according to whichever type we're dealing with. For example if we're dealing with an image then the array structure will look something like 

    test_images = [10,000, 28, 28]
    ie. 10,000, 28x28 images
    All of this data was obtained from bytes 4-16
    
Below is a helpful graphic from the official MNIST [site](http://yann.lecun.com/exdb/mnist/) to help solidify the concepts we've discussed already.

<img src="./images/image_label_layout_1.png" style="width: 500px; height: 342px; float: left;"/>


Now that we can isolate images and labels successfully we need to seperate them into their respective sets of **train** or **test**. For this all we need to do is look at the length of each array. If the length is 10,000 then we know these images are for testing. If the length is 60,000 then we know these images are for training. We can then store each set as a key in our data dictionary data structure for use later.

In [6]:
import os,codecs,numpy

# The path containing the extracted MNIST data
datapath = './Data/MNISTData/'

# List all files in the directory
files = os.listdir(datapath)

# Convert 4 bytes to an integer
def get_int(b):   
    return int(codecs.encode(b, 'hex'), 16)

# Dictionary to store the images and labels
data_dict = {}

# Loop through all files in the directory
for file in files:
    # Make sure we only read files ending with 'ubyte'
    if file.endswith('ubyte'):  
        # Verify we are reading the correct file
        print('Reading ', file)
        
        # Define file path to read from
        with open (datapath + file,'rb') as f:
            # Store read data
            data = f.read() 
            # bytes 0-3 are the 'Magic Number': defines whether the type is an image or label 
            type = get_int(data[:4])
            # bytes 4-7 define the length of the array (Dimension 0)
            length = get_int(data[4:8])
            
            if (type == 2051):
                # Set the category as images
                category = 'images'
                # Bytes 8-11 define the number of rows (Dimension 1)
                num_rows = get_int(data[8:12])  
                # Bytes 12-15 define the number of columns (Dimension 2)
                num_cols = get_int(data[12:16])
                # Read the pixel values as integers
                parsed = numpy.frombuffer(data,dtype = numpy.uint8, offset = 16)  
                # Reshape the array as [number of samples * height * width]      
                parsed = parsed.reshape(length, num_rows, num_cols)       
            
            elif(type == 2049):
                # Set the category as images
                category = 'labels'
                # Convert the label values to integers
                parsed = numpy.frombuffer(data, dtype = numpy.uint8, offset = 8) 
                # Reshape the array as the number of samples (length)      
                parsed = parsed.reshape(length)
                
            # Separate images/labels into their respective sets
            # The test set contains 10,000 examples
            if (length == 10000):
                set = 'test'
            # The training set contains 60,000 examples
            elif (length == 60000):
                set = 'train'
                
            # Save the NumPy array to the corresponding key
            data_dict[set + '_' + category] = parsed

Reading  t10k-images-idx3-ubyte
Reading  t10k-labels-idx1-ubyte
Reading  train-images-idx3-ubyte
Reading  train-labels-idx1-ubyte


Let's make sure our data dictionary was saved correctly.

In [7]:
data_dict.keys()

dict_keys(['test_images', 'test_labels', 'train_images', 'train_labels'])

In [8]:
data_dict['test_images'].shape

(10000, 28, 28)

In [9]:
data_dict['test_labels'].shape

(10000,)

In [10]:
data_dict['train_images'].shape

(60000, 28, 28)

In [11]:
data_dict['train_labels'].shape

(60000,)

Excellent, as we can see, all of our data is now nicely packaged into our dictionary data structure, which is a simple key:value pair store of 