# Data Manipulation
In this section we will learn to manipulate the MNIST image dataset into a format suitable for ML.

### Data Structure
The training set contains 60000 examples, and the test set 10000 examples. The first 5000 examples of the test set are taken from the original NIST training set. The last 5000 are taken from the original NIST test set. The first 5000 are cleaner and easier than the last 5000.

This data is avaliable from http://yann.lecun.com/exdb/mnist/ and is presented as gziped IDX files with the following structure.

<!---
|Images|Offset  |Type           |Value |Description       |Labels|Offset   |Type           |Value |Description       |
|------|:------:|:-------------:|:----:|:----------------:|------|:------:|:-------------:|:----:|:----------------:|
|      |0000    |32 bit integer |2051  |magic number      |      |0000    |32 bit integer |2049  |magic number      |
|      |0004    |32 bit integer |60000 |number of images  |      |0004    |32 bit integer |10000 |number of items   |
|      |0008    |32 bit integer |28    |number of rows    |      |0008    |unsigned byte  |??    |label             |
|      |0012    |32 bit integer |28    |number of columns |      |....... |               |      |                  |
|      |0016    |unsigned byte  |??    |pixel             |      |xxxx    |unsigned byte  |??    |label             |
|      |....... |               |      |                  |      |
|      |xxxx    |unsigned byte  |??    |pixel             |      |
Image pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). The labels values are 0 to 9.
--->

##### Train/Test Images

|Offset  |Type           |Value |Description       |
|:------:|:-------------:|:----:|:----------------:|
|0000    |32 bit integer |2051  |magic number      |
|0004    |32 bit integer |60000 |number of images  |
|0008    |32 bit integer |28    |number of rows    |
|0012    |32 bit integer |28    |number of columns |
|0016    |unsigned byte  |??    |pixel             |
|....... |               |      |                  |
|xxxx    |unsigned byte  |??    |pixel             |

Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

##### Train/Test Labels

|Offset  |Type           |Value |Description       |
|:------:|:-------------:|:----:|:----------------:|
|0000    |32 bit integer |2049  |magic number      |
|0004    |32 bit integer |10000 |number of items   |
|0008    |unsigned byte  |??    |label             |
|....... |               |      |                  |
|xxxx    |unsigned byte  |??    |label             |

The labels values are 0 to 9. 

### Code
First of all we set up the required imports and define the location and filenames for the data.

In [None]:
import gzip
import numpy
from tensorflow.contrib.learn.python.learn.datasets import base
from tensorflow.python.platform import gfile

TRAIN_IMAGES = 'train-images-idx3-ubyte.gz'
TRAIN_LABELS = 'train-labels-idx1-ubyte.gz'
TEST_IMAGES = 't10k-images-idx3-ubyte.gz'
TEST_LABELS = 't10k-labels-idx1-ubyte.gz'

source_url = 'https://storage.googleapis.com/cvdf-datasets/mnist/'
train_dir = "../data/"

Define a function to read the IDX to an appropriate numpy type, reading 4 bytes per call.

In [None]:
def _read32(bytestream):
    """
    Args:
    bytestream: A file object opened with a gzip reader.
    Returns:
    data: A 4D uint8 numpy array [index, y, x, depth].
    """
    dt = numpy.dtype(numpy.uint32).newbyteorder('>')
    data = numpy.frombuffer(bytestream.read(4), dtype=dt)[0]
    return data

Define a function to extract all images from the gzip and return a 4D uint8 numpy array [index, y, x, depth].

In [None]:
def extract_images(f):
    """
    Args:
    f: A file object that can be passed into a gzip reader.
    Returns:
    data: A 4D uint8 numpy array [index, y, x, depth].
    Raises:
    ValueError: If the bytestream does not start with 2051.
    """
    print('Extracting', f.name)
    with gzip.GzipFile(fileobj=f) as bytestream:
        magic = _read32(bytestream)
        # Check IDX magic number to ensure we have the correct data:
        if magic != 2051:
            raise ValueError('Invalid magic number %d in MNIST image file: %s' %
                           (magic, f.name))
        num_images = _read32(bytestream)
        rows = _read32(bytestream)
        cols = _read32(bytestream)
        buf = bytestream.read(rows * cols * num_images)
        data = numpy.frombuffer(buf, dtype=numpy.uint8)
        # Convert the 1D buffer read into 4D image array
        data = data.reshape(num_images, rows, cols, 1)
        return data

Define a function to extract all labels from the gzip and return a 1D uint8 numpy array.

In [None]:
def extract_labels(f, one_hot=False, num_classes=10):
    """
    Args:
    f: A file object that can be passed into a gzip reader.
    one_hot: Does one hot encoding for the result.
    num_classes: Number of classes for the one hot encoding.
    Returns:
    labels: a 1D uint8 numpy array.
    Raises:
    ValueError: If the bystream doesn't start with 2049.
    """
    print('Extracting', f.name)
    with gzip.GzipFile(fileobj=f) as bytestream:
        magic = _read32(bytestream)
        # Check IDX magic number to ensure we have the correct data:
        if magic != 2049:
            raise ValueError('Invalid magic number %d in MNIST label file: %s' %
                           (magic, f.name))
        num_items = _read32(bytestream)
        buf = bytestream.read(num_items)
        labels = numpy.frombuffer(buf, dtype=numpy.uint8)
        if one_hot:
            return dense_to_one_hot(labels, num_classes)
        return labels

Convert class labels from scalars to one-hot encoded vectors.

In [None]:
def dense_to_one_hot(labels_dense, num_classes):
    print("Processing one_hot...")
    print("Original label structure (first 3):")
    print(labels_dense[:3])
    num_labels = labels_dense.shape[0]
    index_offset = numpy.arange(num_labels) * num_classes
    labels_one_hot = numpy.zeros((num_labels, num_classes))
    labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
    print("One-hot encoded label structure (first 3):")
    print(labels_one_hot[:3])
    return labels_one_hot

Download (and extract) data from the MNIST dataset.

The following calls to the base.maybe_download function downloads the data if necessary, and returns the pathnames of the resulting files.

In [None]:
# Specify use of one-hot encoding
one_hot = True

train_images_file = base.maybe_download(TRAIN_IMAGES, train_dir, source_url + TRAIN_IMAGES)
train_labels_file = base.maybe_download(TRAIN_LABELS, train_dir, source_url + TRAIN_LABELS)
test_images_file = base.maybe_download(TEST_IMAGES, train_dir, source_url + TEST_IMAGES)
test_labels_file = base.maybe_download(TEST_LABELS, train_dir, source_url + TEST_LABELS)



with gfile.Open(train_images_file, 'rb') as f:
    train_images = extract_images(f)

with gfile.Open(train_labels_file, 'rb') as f:
    train_labels = extract_labels(f, one_hot=one_hot)

with gfile.Open(test_images_file, 'rb') as f:
    test_images = extract_images(f)

with gfile.Open(test_labels_file, 'rb') as f:
    test_labels = extract_labels(f, one_hot=one_hot)

## End of Data Manipulation Notebook