## THE MNIST DATABASE

![Mnistdata.png](img/Mnistdata.png)

### 1.Description
This notebook contains information about the MNIST data set.The purpose of this notebook is to explain how to
read the MNIST dataset efficiently into memory in Python.

### 2.What is the MNIST dataset?
The MNIST problem is a dataset developed by Yann LeCun, Corinna Cortes and Christopher Burges for evaluating machine learning models on the handwritten digit classification problem. Images of digits were taken from a variety of scanned documents, normalized in size and centered. MNIST contains 70,000 images of handwritten digits: 60,000 for training and 10,000 for testing.

#### The dataset consists of pair, “handwritten digit image” and “label”. Digit ranges from 0 to 9, meaning 10 patterns in total.
- __handwritten digit image:__ This is gray scale image with size 28 x 28 pixel  square (784 pixels total). A standard spit of the dataset is used to evaluate and compare models, where 60,000 images are used to train a model and a separate set of 10,000 images are used to test it.
- __label :__ This is actual digit number this handwritten digit image represents. It is either  0 to 9. Results are reported using prediction error, which is nothing more than the inverted classification accuracy.

![mnist_plot.png](img/mnist_plot.png)

## 3. Prepare the data
Remember to download four files wich are required for this notebook:

- train-images-idx3-ubyte: training set images 
- labels-idx1-ubyte: training set labels 
- t10k-images-idx3-ubyte:  test set images 
- t10k-labels-idx1-ubyte:  test set labels

You can find them under this [link](http://yann.lecun.com/exdb/mnist/). Next create folder called __data__ and save those files. The training and testing data of the MNIST database is stored in compressed IDX formatted files. Information about __IDX file format__ can be found [here](http://www.fon.hum.uva.nl/praat/manual/IDX_file_format.html).



Reading the uncompressed file train-images-idx3-ubyte with 60000 images of 28×28 pixel data, will result in a new Matrix object with 60000 rows and 784 (=28×28) columns. Each cell will contain a number in the interval from 0 to 255.

Reading the uncompressed file train-labels-idx1-ubyte with 60000 labels will result in a new Matrix object with 1 row and 60000 columns. Each cell will contain a number in the interval from 0 to 9.

###  3.1 Unzip the files 
To unzip the files we can use Python Library which is called __gzip__ . Source code and more information about it can be found  [here](https://docs.python.org/3/library/gzip.html).

#### 3.1.1 TEST SET IMAGE FILE (t10k-images-idx3-ubyte)

In [1]:
# Source code adapted from https://docs.python.org/3/library/gzip.html
# Import gzip
import gzip 

# Read bytes from files
with gzip.open('data/t10k-images-idx3-ubyte.gz', 'rb') as f:
    file_content = f.read()
    

#### 3.1.2 TEST SET LABEL FILE (t10k-labels-idx1-ubyte)

In [2]:
# Source code adapted from https://docs.python.org/3/library/gzip.html
# Import gzip
import gzip 

# Read bytes from files
with gzip.open('data/t10k-labels-idx1-ubyte.gz', 'rb') as f:
    labels = f.read()

###  3.2 Determine file type
Python supports a range of types to store sequences. There are six sequence types: strings, byte sequences (bytes objects), byte arrays (bytearray objects), lists, tuples, and range objects.

#### 3.2.1 TEST SET IMAGE FILE (t10k-images-idx3-ubyte)

In [3]:
# Checks type of file content
type(file_content)

bytes

In [4]:
# Find out what first 4 bits are
file_content[0:4]

b'\x00\x00\x08\x03'

#### 3.2.2 TEST SET LABEL FILE (t10k-labels-idx1-ubyte)

In [5]:
# Checks type of file content
type(labels)

bytes

In [6]:
# Find out what first 4 bits are
labels[0:4]

b'\x00\x00\x08\x01'

### 3.3. Convert Bytes to Integer
The bytes type in Python is immutable and stores a sequence of values ranging from 0-255 (8-bits). You can get the value of a single byte by using an index like an array, but the values can not be modified.

#### 3.3.1 TEST SET IMAGE FILE (t10k-images-idx3-ubyte)

In [7]:
# Source code adapted from: https://docs.python.org/3/library/stdtypes.html
# First four bytes to int
int.from_bytes(file_content[0:4], byteorder='big')

2051

#### 3.3.2 TEST SET LABEL FILE (t10k-labels-idx1-ubyte)

In [19]:
# Source code adapted from: https://docs.python.org/3/library/stdtypes.html
# First four bytes to int
int.from_bytes(labels[0:4], byteorder='big')

2049

### OR

In [8]:
# Source code adapted from: https://docs.python.org/3/library/stdtypes.html
# First four bytes to int
int.from_bytes(b'\x00\x00\x08\x03', byteorder='big')

2051

In [20]:
# Source code adapted from: https://docs.python.org/3/library/stdtypes.html
# First four bytes to int
int.from_bytes(b'\x00\x00\x08\x01', byteorder='big')

2049

The __byteorder__ argument determines the byte order used to represent the integer. If byteorder is "big", the most significant byte is at the beginning of the byte array. If byteorder is "little", the most significant byte is at the end of the byte array. Check this [link](https://en.wikipedia.org/wiki/Endianness) if you want to find out more about little and big endian.  

In [18]:
# TEST SET IMAGE FILE (t10k-images-idx3-ubyte)
print("Number of images in a file: {}. ".format(int.from_bytes(file_content[4:8], byteorder='big')))
print("Number of rows in a file: {}. ".format(int.from_bytes(file_content[8:12], byteorder='big')))
print("Number of columns in a file: {}. ".format(int.from_bytes(file_content[12:16], byteorder='big')))

Number of images in a file: 10000. 
Number of rows in a file: 28. 
Number of columns in a file: 28. 


In [32]:
# TEST SET LABEL FILE (t10k-labels-idx1-ubyte)
print("Number of items in a file: {}. ".format(int.from_bytes(labels[4:8], byteorder='big')))
print("Integer : {} ".format(int.from_bytes(labels[8:9], byteorder='big')))
print("Integer : {}  ".format(int.from_bytes(labels[15:16], byteorder='big')))

Number of items in a file: 10000. 
Integer : 7 
Integer : 9  


### Determine file type

In [13]:
l = file_content[16:800]

In [14]:
type(l)

bytes

In [15]:
import numpy as np

In [16]:
image = ~np.array(list(file_content[16:800])).reshape(28,28).astype(np.uint8)

In [17]:
import matplotlib.pyplot as plt

plt.imshow(image,cmap='gray')




<matplotlib.image.AxesImage at 0x1ed7e8a0048>

### References:

https://corochann.com/mnist-dataset-introduction-1138.html

https://machinelearningmastery.com/handwritten-digit-recognition-using-convolutional-neural-networks-python-keras/

https://www.w3resource.com/python/python-bytes.php

https://docs.python.org/3/library/gzip.html

https://docs.python.org/3/library/stdtypes.html