In [25]:
import numpy as np
import pickle

# Fetch Data

Each of the batch files in the dataset contains a **dictionary** with the following elements:

* **data**: a 10,000 x 3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.<br><br>

* **labels**: a list of 10,000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

Additionally, it includes a `batches.meta` file, which contains:

* **label_names**: a 10-element list which gives meaningful names to the numeric labels in the labels array described above. For example, `label_names[0]=="airplane"`, `label_names[1]=="automobile"`, etc.

### Clean Data

Data cleaning is imoprtatnt to minimize error, specially when it comes to overfitting. Some strategies are:

1. If you have an image in color, convert it to grayscale to lower the dimensionality of the input data, and consequently lower the number of parameters.<br><br>

2. Also, consider center-cropping the image, since edges of an image may not provide useful information.<br><br>

3. The input should also be normalized by subtracting the mean and dividing by the standard deviation of each data sample so that the gradients during back-propagation don't change too dramatically.

In [30]:
def clean_data(data):
    
    print("Before resizing: " + str(data.shape))
    
    # the data is now a 32x32 matrix with 3 channels
    all_images = data.reshape(data.shape[0], 3, 32, 32)
    
    print("After resizing: " + str(all_images.shape))
    
    # grayscale the image by averaging the color intensities
    grayscale_images = all_images.mean(1)
    
    # crop the 32x32 image to a 24x24 image
    cropped_images = grayscale_images[:, 4:28, 4:28]

In [27]:
def unpickle(file):
    
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict

In [28]:
def read_data(dir_name):
    
    meta_dir = ("{}/batches.meta".format(directory))
    names = unpickle(meta_dir)[b"label_names"]

    # want to collect all batches into a single data and label matrixes
    data, labels = [], []

    # 5 = number of batches
    for i in range(1, 6): # iterate through them

        filename = "{}/data_batch_{}".format(directory, i)

        # for each data batch, unpickle it. we get a dictionary back.
        batch_data = unpickle(filename)

        # if theres already content in the data array
        if len(data) > 0:
            data = np.vstack((data, batch_data[b"data"]))
            labels = np.hstack((labels, batch_data[b"labels"]))
        else:
            data = batch_data[b"data"]
            labels = batch_data[b"labels"]

    return names, data, labels

# Implementation

In [31]:
directory = "./cfar10"
names, data, labels = read_data(directory)
clean_data = clean_data(data)

Before resizing: (50000, 3072)
After resizing: (50000, 3, 32, 32)
(50000, 576)
576
