# Street View House Numbers

SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images. 

These are the original, variable-resolution, color house-number images with character level bounding boxes, as shown in the examples images above. (The blue bounding boxes here are just for illustration purposes. The bounding box information are stored in digitStruct.mat instead of drawn directly on the images in the dataset.) Each tar.gz file contains the orignal images in png format, together with a digitStruct.mat file, which can be loaded using Matlab. The digitStruct.mat file contains a struct called digitStruct with the same length as the number of original images. Each element in digitStruct has the following fields: name which is a string containing the filename of the corresponding image. bbox which is a struct array that contains the position, size and label of each digit bounding box in the image. Eg: digitStruct(300).bbox(2).height gives height of the 2nd digit bounding box in the 300th image. 

* Dataset link:
http://ufldl.stanford.edu/housenumbers/

In [None]:
To correctly detect a series of numbers given an image of house
numbers by training a convolutional neural network with multiple
layers.


## Data Pre Processing

### 1. Import Libraries

The future statement is intended to ease migration to future versions of Python that introduce incompatible changes to the language. It allows use of the new features on a per-module basis before the release in which the feature becomes standard.
__future__ is a pseudo-module which programmers can use to enable new language features which are not compatible with the current interpreter. For example, the expression 11/4 currently evaluates to 2. If the module in which it is executed had enabled true division by executing:

from __future__ import division

the expression 11/4 would evaluate to 2.75.
from __future__ import print_function


* six
Six is a Python 2 and 3 compatibility library. It provides utility functions for smoothing over the differences between the Python versions with the goal of writing Python code that is compatible on both Python versions. 

In [13]:
from urllib.request import urlretrieve
import pickle
from IPython.display import display, Image
from scipy import ndimage
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import h5py

In [3]:
tf.__version__

'1.6.0'

In [9]:
train_folder = '/train'
test_folder ='/test'
extra_folder = '/extra'

In [10]:
pwd

'F:\\Aegis\\Python\\Jupyter Notebook\\Projects\\SVHM'

In [12]:
### 1. Creating Dictionary for bounded box information 

The filename, directory name, or volume label syntax is incorrect.


* h5 file
An H5 file is a data file saved in the Hierarchical Data Format (HDF). It contains multidimensional arrays of scientific data. ... Two commonly used versions of HDF include HDF4 and HDF5 

* HDF5 for Python
The h5py package is a Pythonic interface to the HDF5 binary data format.

It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

H5py uses straightforward NumPy and Python metaphors, like dictionary and NumPy array syntax.

mat files are actually saved using the HDF5 format by default 

In [34]:
import numpy as np
import h5py
struct_file = './train/digitStruct.mat'
file = h5py.File(struct_file,'r')
data = file.get('/digitStruct/name').value[0].item()
d2 = np.array(data)
d2

array(<HDF5 object reference>, dtype=object)

In [51]:
import numpy as np
import h5py

with h5py.File('./train/digitStruct.mat','r') as hdf:
    ls = list(hdf.keys())
    print(" List of dataset in this file is ",ls)
    data = hdf.get('/digitStruct/name')
    data1 = hdf.get('/digitStruct/bbox')
    d1 = np.array(data)
    d2 = np.array(data1)
    print(" Shape of file is ",d1.shape)
    print(" Shape of file is ",d2.shape)

 List of dataset in this file is  ['#refs#', 'digitStruct']
 Shape of file is  (33402, 1)
 Shape of file is  (33402, 1)


In [55]:
import h5py
struct_file = './train/digitStruct.mat'
file = h5py.File(struct_file)
name = file.get('/digitStruct/name').value[1][0] # <HDF5 object reference>
bbox = file.get('/digitStruct/bbox').value[1][0]
print(file[name].value)
print(file[bbox])

[[ 50]
 [ 46]
 [112]
 [110]
 [103]]
<HDF5 group "/#refs#/7Qi" (5 members)>


In [66]:
class DigitStructFile:
    '''The DigitStructFile references the following 
    file:              The input h5 matlab file
    digitStructName   The h5 reference to all the file names
    digitStructBbox   The h5 reference to all struc data'''
    def __init__(self,file):
        self.file = h5py.File(file, 'r') #Create a new file object.
        self.digitStructName = self.file['digitStruct']['name']
        self.digitStructBox = self.file['digitStruct']['bbox']
        
    def getName(self,n):
        '''Returns name string for the nth digitStruct '''
        return ''.join([chr(c[0]) for c in self.file[self.digitStructName[n][0]].value])
    
    def bboxHelper(self,attr):
        if len(attr) > 1:
            attr = [self.file[attr.value[j].item()].value[0][0] for j in range(len(attr))]
        else:
            attr = [attr.value[0][0]]
        return attr
    
    def getBbox(self,n):
        '''Returns a dict of data for the n(th) bbox '''
        bbox ={}
        bb = self.digitStructBox[n].item()    
        bbox['height'] = self.bboxHelper(self.file[bb]['height'])
        bbox['label'] = self.bboxHelper(self.file[bb]['label'])
        bbox['left'] = self.bboxHelper(self.file[bb]['left'])
        bbox['top'] = self.bboxHelper(self.file[bb]['top'])
        bbox['width'] = self.bboxHelper(self.file[bb]['width'])
        return bbox
    
    def getDigitStruct(self,n):
        '''Returns the structure of the digitStruct'''
        struct = self.getBbox(n)
        struct['name'] = self.getName(n)
        return struct
    
    def getAllDigitStruct(self):
        '''Returns all the digitStruct from the input file'''
        return [self.getDigitStruct(i) for i in range(len(self.digitStructName))]
    
    def getAllDigitStruct_Digit(self):
        digit_pic = self.getAllDigitStruct()
        result=[]
        structCnt = 1
        for i in range(len(digit_pic)):
            item = {'filename':digit_pic[i]['name']}
            figures=[]
            for j in range(len(digit_pic[i]['height'])):
                figure = {}
                figure['height'] = digit_pic[i]['height'][j]
                figure['label'] = digit_pic[i]['label'][j]
                figure['left'] = digit_pic[i]['left'][j]
                figure['top'] = digit_pic[i]['top'][j]
                figure['width'] = digit_pic[i]['width'][j]
                figures.append(figure)
            structCnt+=1
            item['boxes'] = figures
            result.append(item)
        return result

In [67]:
import datetime

digitStructFileTrain = DigitStructFile(os.path.join('train','digitStruct.mat'))
digitStructFileTest = DigitStructFile(os.path.join('test','digitStruct.mat'))
digitStructFileExtra = DigitStructFile(os.path.join('extra','digitStruct.mat'))

print("Start of Processing Data ",datetime.datetime.now())
train_data = digitStructFileTrain.getAllDigitStruct_Digit()
test_data = digitStructFileTest.getAllDigitStruct_Digit()
extra_data = digitStructFileExtra.getAllDigitStruct_Digit()

print("End of Processing Data ",datetime.datetime.now())

Start of Processing Data  2019-04-09 13:58:53.670203
End of Processing Data  2019-04-09 14:42:52.198866


In [70]:
# Note train_data is list of dictionary
train_data[0]

{'boxes': [{'height': 219.0,
   'label': 1.0,
   'left': 246.0,
   'top': 77.0,
   'width': 81.0},
  {'height': 219.0, 'label': 9.0, 'left': 323.0, 'top': 81.0, 'width': 96.0}],
 'filename': '1.png'}

In [75]:
train_data[24] # This has digit 601 on it, Note how this is misclassified

{'boxes': [{'height': 50.0,
   'label': 6.0,
   'left': 60.0,
   'top': 11.0,
   'width': 24.0},
  {'height': 50.0, 'label': 10.0, 'left': 87.0, 'top': 9.0, 'width': 24.0},
  {'height': 50.0, 'label': 1.0, 'left': 113.0, 'top': 7.0, 'width': 21.0}],
 'filename': '25.png'}

In [76]:
train_data[21] # Has digit 515 on it. Note how height and top are the same (makes absolute sense)

{'boxes': [{'height': 16.0,
   'label': 5.0,
   'left': 24.0,
   'top': 4.0,
   'width': 10.0},
  {'height': 16.0, 'label': 1.0, 'left': 34.0, 'top': 5.0, 'width': 7.0},
  {'height': 16.0, 'label': 5.0, 'left': 40.0, 'top': 5.0, 'width': 11.0}],
 'filename': '22.png'}

In [84]:
train_data[21]['boxes']

[{'height': 16.0, 'label': 5.0, 'left': 24.0, 'top': 4.0, 'width': 10.0},
 {'height': 16.0, 'label': 1.0, 'left': 34.0, 'top': 5.0, 'width': 7.0},
 {'height': 16.0, 'label': 5.0, 'left': 40.0, 'top': 5.0, 'width': 11.0}]

In [83]:
train_data[21]['boxes'][0]['label']

5.0

In [91]:
train_data[21]['filename']

'22.png'

### Crop images using bounded box information

In [88]:
pwd

'F:\\Aegis\\Python\\Jupyter Notebook\\Projects\\SVHM'

In [115]:
from PIL import Image

#Creating a numpy array with rows = number of images and 2 columns
train_imsize = np.ndarray([len(train_data),2])
#print("Sample ",train_imsize[0:5,:])

for i in range(len(train_data)):
    filename = train_data[i]['filename']
    #print(filename)
    fullname = os.path.join('train',filename)
    im = Image.open(fullname)
    #im.show()
    train_imsize[i,:] = im.size[:]
    #if i == 10:
    #    break

#print(train_imsize[0:5,:])
print(np.amax(train_imsize[:,0]), np.amax(train_imsize[:,1]))
print(np.amin(train_imsize[:,0]), np.amin(train_imsize[:,1]))

876.0 501.0
25.0 12.0


In [111]:
a = np.ndarray([2,3])
a
l = im.size[:]
print(l)
im.show()

(63, 33)


In [113]:
test_imsize = np.ndarray([len(test_data),2])
for i in np.arange(len(test_data)):
    filename = test_data[i]['filename']
    fullname = os.path.join('test', filename)
    im = Image.open(fullname)
    test_imsize[i, :] = im.size[:]

print(np.amax(test_imsize[:,0]), np.amax(test_imsize[:,1]))
print(np.amin(test_imsize[:,0]), np.amin(test_imsize[:,1]))

1083.0 516.0
31.0 13.0


In [114]:
extra_imsize = np.ndarray([len(extra_data),2])
for i in np.arange(len(extra_data)):
    filename = extra_data[i]['filename']
    fullname = os.path.join('extra', filename)
    im = Image.open(fullname)
    extra_imsize[i, :] = im.size[:]

print(np.amax(extra_imsize[:,0]), np.amax(extra_imsize[:,1]))
print(np.amin(extra_imsize[:,0]), np.amin(extra_imsize[:,1]))

668.0 415.0
22.0 13.0


### Distribute extra image dataset and remove images with more than 5 digits

In [117]:
import PIL.Image as Image

def generate_dataset(data, folder):

    dataset = np.ndarray([len(data),32,32,1], dtype='float32')
    labels = np.ones([len(data),6], dtype=int) * 10
    for i in np.arange(len(data)):
        filename = data[i]['filename']
        fullname = os.path.join(folder, filename)
        im = Image.open(fullname)
        boxes = data[i]['boxes'] #returns list of dictionaries of number of digits
        num_digit = len(boxes) 
        labels[i,0] = num_digit
        print("labels ",labels)
        top = np.ndarray([num_digit], dtype='float32')
        print("top ",top)
        left = np.ndarray([num_digit], dtype='float32')
        height = np.ndarray([num_digit], dtype='float32')
        width = np.ndarray([num_digit], dtype='float32')
        print("width ",width)
        for j in np.arange(num_digit):
            if j < 5: 
                labels[i,j+1] = boxes[j]['label']
                if boxes[j]['label'] == 10:
                    labels[i,j+1] = 0
            else:
                print('Image number ',i,' has more than 5 digits')
            ''' Storing top, left, width, height of all the digits of the image in a list '''
            top[j] = boxes[j]['top']
            left[j] = boxes[j]['left']
            height[j] = boxes[j]['height']
            width[j] = boxes[j]['width']
        
        im_top = np.amin(top)
        im_left = np.amin(left)
        im_height = np.amax(top) + height[np.argmax(top)] - im_top
        im_width = np.amax(left) + width[np.argmax(left)] - im_left
        
        im_top = np.floor(im_top - 0.1 * im_height)
        im_left = np.floor(im_left - 0.1 * im_width)
        im_bottom = np.amin([np.ceil(im_top + 1.2 * im_height), im.size[1]])
        im_right = np.amin([np.ceil(im_left + 1.2 * im_width), im.size[0]])

        im = im.crop((im_left, im_top, im_right, im_bottom)).resize([32,32], Image.ANTIALIAS)
        im = np.dot(np.array(im, dtype='float32'), [[0.2989],[0.5870],[0.1140]])
        mean = np.mean(im, dtype='float32')
        std = np.std(im, dtype='float32', ddof=1)
        if std < 1e-4: std = 1.
        im = (im - mean) / std
        dataset[i,:,:,:] = im[:,:,:]

    return dataset, labels

train_dataset, train_labels = generate_dataset(train_data, 'train')
print(train_dataset.shape, train_labels.shape)

test_dataset, test_labels = generate_dataset(test_data, 'test')
print(test_dataset.shape, test_labels.shape)

extra_dataset, extra_labels = generate_dataset(extra_data, 'extra')
print(extra_dataset.shape, extra_labels.shape)

# 29929 image has more than 5 digits.
(33402, 32, 32, 1) (33402, 6)
(13068, 32, 32, 1) (13068, 6)
(202353, 32, 32, 1) (202353, 6)


In [120]:
t = [2,3,4,5]
np.argmax(t)+np.max(t)

8

In [None]:
#len(train_dataset)
train_dataset[0]

In [123]:
train_dataset = np.delete(train_dataset, 29929, axis=0)
train_labels = np.delete(train_labels, 29929, axis=0)

print(train_dataset.shape, train_labels.shape)

(33401, 32, 32, 1) (33401, 6)


In [124]:
import random

random.seed()

n_labels = 10
valid_index = []
valid_index2 = []
train_index = []
train_index2 = []
for i in np.arange(n_labels):
    valid_index.extend(np.where(train_labels[:,1] == (i))[0][:400].tolist())
    train_index.extend(np.where(train_labels[:,1] == (i))[0][400:].tolist())
    valid_index2.extend(np.where(extra_labels[:,1] == (i))[0][:200].tolist())
    train_index2.extend(np.where(extra_labels[:,1] == (i))[0][200:].tolist())

random.shuffle(valid_index)
random.shuffle(train_index)
random.shuffle(valid_index2)
random.shuffle(train_index2)

valid_dataset = np.concatenate((extra_dataset[valid_index2,:,:,:], train_dataset[valid_index,:,:,:]), axis=0)
valid_labels = np.concatenate((extra_labels[valid_index2,:], train_labels[valid_index,:]), axis=0)
train_dataset_t = np.concatenate((extra_dataset[train_index2,:,:,:], train_dataset[train_index,:,:,:]), axis=0)
train_labels_t = np.concatenate((extra_labels[train_index2,:], train_labels[train_index,:]), axis=0)

print(train_dataset_t.shape, train_labels_t.shape)
print(test_dataset.shape, test_labels.shape)
print(valid_dataset.shape, valid_labels.shape)

(230070, 32, 32, 1) (230070, 6)
(13068, 32, 32, 1) (13068, 6)
(5684, 32, 32, 1) (5684, 6)


In [126]:
"""Create a pickle file to store processed data"""
pickle_file = 'SVHN_multi.pickle'

try:
    f = open(pickle_file, 'wb')
    dict_data = {
    'train_dataset': train_dataset_t,
    'train_labels': train_labels_t,
    'valid_dataset': valid_dataset,
    'valid_labels': valid_labels,
    'test_dataset': test_dataset,
    'test_labels': test_labels,
    }
    pickle.dump(dict_data, f, pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
    print('Unable to save data to', pickle_file, ':', e)
    
statinfo = os.stat(pickle_file)
print('Compressed pickle size:', statinfo.st_size)

Compressed pickle size: 1025147176
