Deep Learning Final Project
=============

# §0 Data Download and Extraction
------------

[SVHN](http://ufldl.stanford.edu/housenumbers/) is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labelled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

Overview

* 10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10.
* 73257 digits for training, 26032 digits for testing, and 531131 additional, somewhat less difficult samples, to use as extra training data
* Comes in two formats:
 1. Original images with character level bounding boxes.
 2. MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).

## 0.1 Import Modules

In [1]:
from __future__ import print_function
import os
import sys
import tarfile
from six.moves.urllib.request import urlretrieve

## 0.2 Download

In [2]:
url = 'http://ufldl.stanford.edu/housenumbers/'
last_percent_reported = None

def download_progress_hook(count, blockSize, totalSize):
    '''A hook to report the progress of a download. Reports every 5% change in download progress.'''
    global last_percent_reported
    percent = int(count * blockSize * 100 / totalSize)
    
    if last_percent_reported != percent:
        if percent % 10 == 0:
           sys.stdout.write("%s%%" % percent)
           sys.stdout.flush()
        else:
           sys.stdout.write(".")
           sys.stdout.flush()
        
    last_percent_reported = percent

def maybe_download(folder, filename, expected_bytes, force=False):
    '''Check if the directory exists, if not make it.'''
    if not os.path.exists(folder):
        os.makedirs(folder)
    '''Download a file if not present, and make sure it's the right size.'''
    if force or not os.path.exists(folder+filename):
        print('Attempting to download:', filename) 
        localfilename, _ = urlretrieve(url + filename, folder+filename, reporthook=download_progress_hook)
        print('\nDownload Complete!')
    statinfo = os.stat(folder+filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        raise Exception('Failed to verify %s, expected size = %d, actual size = %d' %(filename, expected_bytes,
                                                                                                statinfo.st_size))
    return folder+filename

In [3]:
### Download the 32 x 32 MNIST-like single digit images
train32_filename = maybe_download('data/32x32/','train_32x32.mat', 182040794)
test32_filename  = maybe_download('data/32x32/','test_32x32.mat', 64275384)
extra32_filename = maybe_download('data/32x32/','extra_32x32.mat', 1329278602)

### Download the original images with full numbers
train_filename = maybe_download('data/full/','train.tar.gz', 404141560)
test_filename  = maybe_download('data/full/','test.tar.gz', 276555967)
extra_filename = maybe_download('data/full/','extra.tar.gz', 1955489752)

Found and verified train_32x32.mat
Found and verified test_32x32.mat
Found and verified extra_32x32.mat
Found and verified train.tar.gz
Found and verified test.tar.gz
Found and verified extra.tar.gz
