## Step 4: Toward a Realistic Data Set - SVHN Pre-Processing

Once you have settled on a good architecture, you can train your model on real data. In particular, the [Street View House Numbers (SVHN)](http://ufldl.stanford.edu/housenumbers/) dataset is a good large-scale dataset collected from house numbers in Google Street View. Training on this more challenging dataset, where the digits are not neatly lined-up and have various skews, fonts and colors, likely means you have to do some hyperparameter exploration to perform well.

In [1]:
from os import chdir; chdir('..')

### Retrieve the SVHN dataset

Define code and/or use the predefined code in the `lib.retrieval` module in order to download the SVHN dataset if you do not already have a local copy. You can use the urls, http://ufldl.stanford.edu/housenumbers/train.tar.gz, http://ufldl.stanford.edu/housenumbers/test.tar.gz, and http://ufldl.stanford.edu/housenumbers/extra.tar.gz, to download the test, train, and extra datasets, respectively.

#### Data Overview
10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10.

|      set | digits | expected bytes |
|:--------:|:------:|:--------------:|
| training |  73257 |      404141560 |
| testing  |  26032 |      276555967 |
| extra    | 531131 |     1955489752 | 

### Prepare the SVHN dataset

Define code in the `lib.preprocess` module that will prepare a data dictionary containing your shuffled and split SVHN dataset as well as do any additional preprocessing you find necessary.

In [2]:
from lib.preprocess import download_and_extract_svhn_datasets, svhn_meta_from_mat, prepare_svhn_dataset

Using TensorFlow backend.


In [3]:
download_and_extract_svhn_datasets()

train already present - Skipping extraction of train.tar.gz.
[]
test already present - Skipping extraction of test.tar.gz.
[]


In [4]:
from lib.data import pickle_data_dictionary

In [5]:
# test_images_info = svhn_meta_from_mat('test/digitStruct.mat')
# train_images_info = svhn_meta_from_mat('train/digitStruct.mat')
test_images_info = svhn_meta_from_mat('d-trn.mat')
train_images_info = svhn_meta_from_mat('d-tst.mat')

In [13]:
import numpy as np
import os
from scipy import misc
def prepare_svhn_dataset(path, info):
    images_list = os.listdir(path)
    #images_list.remove('digitStruct.mat')
    #images_list.remove('see_bboxes.m')
    png = []
    labels = []
    
    for i, image in enumerate(images_list):
        read_image = misc.imread(path+image)
        image_np = np.asarray(read_image)
        image_np = np.concatenate(np.concatenate(image_np))
        png.append(image_np)   
        
        image_num = int(image[:-4]) - 1
        image_label = info[image_num]['labels']
        image_length = info[image_num]['length']
        if image_length >5:
            image_label = image_label[:5]
        for i in range(image_length, 5):
            image_label = np.append(image_label,0)
        image_label = np.append(image_label,image_length)
        labels.append(image_label)
        
    #images = np.asarray(png,dtype=np.uint16)
    labels =  np.asarray(labels)
    return png, labels

In [15]:
test_set[0]

array([121, 131, 156, ..., 236, 236, 236], dtype=uint8)

In [14]:
test_set, test_labels = prepare_svhn_dataset('test/', test_images_info)
train_set, train_labels = prepare_svhn_dataset('train/', train_images_info)

IndexError: list index out of range

In [None]:
len(test_labels), len(train_labels)

In [None]:
data_dictionary = {
    'test_set' : test_set,
    'train_set' : train_set,
    'test_labels' : test_labels,
    'train_labels' : train_labels
}

In [None]:
pickle_data_dictionary(data_dictionary, 'data/svhn.pickle')

### Prepare a few plots of samples of your Synthetic Dataset

### Pickle your preprocessed SVHN dataset