# Assignment 5

**Submission deadline: last lab session before or on Tuesday, 09.5.17**

**Points: 9 + 10 bonus points**


## Downloading this notebook

This assignment is an Jupyter notebook. Download it by cloning https://github.com/janchorowski/nn_assignments. Follow the instructions in its README for instructions.

For programming exerciese add your solutions to the notebook. For math exercies please provide us with answers on paper or type them in the notebook (it supports Latex-like equations).

Please do not hesitate to use GitHub’s pull requests to send us corrections!

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


# Modular network implementation

This assignment builds on code from Assignment 4, Problem 7. 
For your convenience, we have copied the code below. Please copy your solution from the old list, or fill in the blanks below to get a working network.

In the following cells, I implement in a modular way a feedforward neural network. Please study the code - many network implementations follow a similar pattern.

Please make sure that the network trains to nearly 100% accuracy on Iris.

## Task

Your job is to implement SGD training on MNIST with the following elements:
1. SGD + momentum
2. weight decay
3. early stopping

In overall, you should get below **2% testing errors**

In [2]:
#
# These are taken from https://github.com/mila-udem/blocks
# 

class Constant():
    """Initialize parameters to a constant.
    The constant may be a scalar or a :class:`~numpy.ndarray` of any shape
    that is broadcastable with the requested parameter arrays.
    Parameters
    ----------
    constant : :class:`~numpy.ndarray`
        The initialization value to use. Must be a scalar or an ndarray (or
        compatible object, such as a nested list) that has a shape that is
        broadcastable with any shape requested by `initialize`.
    """
    def __init__(self, constant):
        self._constant = numpy.asarray(constant)

    def generate(self, rng, shape):
        dest = numpy.empty(shape, dtype=np.float32)
        dest[...] = self._constant
        return dest


class IsotropicGaussian():
    """Initialize parameters from an isotropic Gaussian distribution.
    Parameters
    ----------
    std : float, optional
        The standard deviation of the Gaussian distribution. Defaults to 1.
    mean : float, optional
        The mean of the Gaussian distribution. Defaults to 0
    Notes
    -----
    Be careful: the standard deviation goes first and the mean goes
    second!
    """
    def __init__(self, std=1, mean=0):
        self._mean = mean
        self._std = std

    def generate(self, rng, shape):
        m = rng.normal(self._mean, self._std, size=shape)
        return m.astype(np.float32)


class Uniform():
    """Initialize parameters from a uniform distribution.
    Parameters
    ----------
    mean : float, optional
        The mean of the uniform distribution (i.e. the center of mass for
        the density function); Defaults to 0.
    width : float, optional
        One way of specifying the range of the uniform distribution. The
        support will be [mean - width/2, mean + width/2]. **Exactly one**
        of `width` or `std` must be specified.
    std : float, optional
        An alternative method of specifying the range of the uniform
        distribution. Chooses the width of the uniform such that random
        variates will have a desired standard deviation. **Exactly one** of
        `width` or `std` must be specified.
    """
    def __init__(self, mean=0., width=None, std=None):
        if (width is not None) == (std is not None):
            raise ValueError("must specify width or std, "
                             "but not both")
        if std is not None:
            # Variance of a uniform is 1/12 * width^2
            self._width = numpy.sqrt(12) * std
        else:
            self._width = width
        self._mean = mean

    def generate(self, rng, shape):
        w = self._width / 2
        m = rng.uniform(self._mean - w, self._mean + w, size=shape)
        return m.astype(np.float32)


In [3]:
class Layer(object):
    def __init__(self, rng=None):
        if rng is None:
            rng = numpy.random
        self.rng = rng
    
    @property
    def parameters(self):
        return []
    
    @property
    def parameter_names(self):
        return []
    
    def get_gradients(self, dLdY, fprop_context):
        return []
    

class AffineLayer(Layer):
    def __init__(self, num_in, num_out, weight_init=None, bias_init=None, dropout_percent=1.0, **kwargs):
        super(AffineLayer, self).__init__(**kwargs)
        if weight_init is None:
            #
            # TODO propose a default initialization scheme.
            # Type a sentence explaining why, and if you use a reference, 
            # cite it here
            #
            weight_init = IsotropicGaussian(mean=0., std=np.sqrt(2.0/(num_out*num_in)))
        if bias_init is None:
            bias_init = Constant(0.0)
        
        self.W = weight_init.generate(self.rng, (num_out, num_in))
        self.b = bias_init.generate(self.rng, (num_out, 1))
        self.dropout_percent = dropout_percent
    
    @property
    def parameters(self):
        return [self.W, self.b]
    
    @property
    def parameter_names(self):
        return ['W','b']
    
    def fprop(self, X, do_dropout=False):
        #Save X for later reusal

#         if do_dropout:
#             nes = self.W * (np.random.rand(*self.W.shape) < self.dropout_percent) / self.dropout_percent
#             Y = np.dot(nes, X) +  self.b
#         else:
#             Y = np.dot(self.W, X) +  self.b
            
#         if do_dropout and self.dropout_percent < 1.0:
#             X *= (np.random.rand(*X.shape) < self.dropout_percent) / self.dropout_percent

        fprop_context = dict(X=X)
        
        Y = np.dot(self.W, X) +  self.b
#         if False:
#         if do_dropout:
#             Y *= np.random.binomial([np.ones(Y.shape)],1-self.dropout_percent)[0] * (1.0/(1-self.dropout_percent))

    
        
#         if do_dropout:
#         if False:
#             dropout_percent = 0.5
#             Y *= np.random.binomial([np.ones(Y.shape)],1-dropout_percent)[0] * (1.0/(1-dropout_percent))
        
        return Y, fprop_context
    
    def bprop(self, dLdY, fprop_context):
        #
        # TODO: fill in gradient computation
        #
        dLdX = np.dot(self.W.T, dLdY)
        return dLdX
    
    def get_gradients(self, dLdY, fprop_context):
        X = fprop_context['X']
        dLdW = np.dot(dLdY, X.T)
        dLdb = dLdY.sum(1, keepdims=True)
        return [dLdW, dLdb]
    
class TanhLayer(Layer):
    def __init__(self, dropout_percent=1.0, **kwargs):
        super(TanhLayer, self).__init__(**kwargs)
        self.dropout_percent = dropout_percent
    
    def fprop(self, X, do_dropout=False):
        Y = np.tanh(X)

        if do_dropout and self.dropout_percent < 1.0:
            Y *= (np.random.rand(*Y.shape) < self.dropout_percent) / self.dropout_percent

        fprop_context = dict(Y=Y)
        return Y, fprop_context
    
    def bprop(self, dLdY, fprop_context):
        Y = fprop_context['Y']
        #
        # Fill in proper gradient computation
        #
        dLdX = (dLdY * (1 - Y**2))
        return dLdX

    
class ReLULayer(Layer):
    def __init__(self, dropout_percent=1.0, **kwargs):
        super(ReLULayer, self).__init__(**kwargs)
        self.dropout_percent = dropout_percent
    
    def fprop(self, X, do_dropout=False):
        Y = np.maximum(X, 0.0)
                
        if do_dropout and self.dropout_percent < 1.0:
            Y *= (np.random.rand(*Y.shape) < self.dropout_percent) / self.dropout_percent

        fprop_context = dict(Y=Y)
            
        return Y, fprop_context
    
    def bprop(self, dLdY, fprop_context):
        Y = fprop_context['Y']
        return dLdY * (Y>0)

    
class SoftMaxLayer(Layer):
    def __init__(self, **kwargs):
        super(SoftMaxLayer, self).__init__(**kwargs)
    
    def compute_probabilities(self, X):
        O = X - X.max(axis=0, keepdims=True)
        O = np.exp(O)
        O /= O.sum(axis=0, keepdims=True)
        return O
    
    def fprop_cost(self, X, Y):
        NS = X.shape[1]
        O = self.compute_probabilities(X)
        Cost = -1.0/NS * np.log(O[Y.ravel(), range(NS)]).sum()
        return Cost, O, dict(O=O, X=X, Y=Y)
    
    def bprop_cost(self, fprop_context):
        X = fprop_context['X']
        Y = fprop_context['Y']
        O = fprop_context['O']
        NS = X.shape[1]
        dLdX = O.copy()
        dLdX[Y, range(NS)] -= 1.0
        dLdX /= NS
        return dLdX
    
class FeedForwardNet(object):
    def __init__(self, layers=None):
        if layers is None:
            layers = []
        self.layers = layers
    
    def add(self, layer):
        self.layers.append(layer)
    
    @property
    def parameters(self):
        params = []
        for layer in self.layers:
            params += layer.parameters
        return params
    
    @parameters.setter
    def parameters(self, values):
        for ownP, newP in zip(self.parameters, values):
            ownP[...] = newP
    
    @property
    def parameter_names(self):
        param_names = []
        for layer in self.layers:
            param_names += layer.parameter_names
        return param_names
    
    def fprop(self, X, do_dropout=False):
        for layer in self.layers[:-1]:
            X, fp_context = layer.fprop(X, do_dropout)
        return self.layers[-1].compute_probabilities(X)
    
    def get_cost_and_gradient(self, X, Y, do_dropout=False):
        fp_contexts = []
        for layer in self.layers[:-1]:
            X, fp_context = layer.fprop(X, do_dropout)
            fp_contexts.append(fp_context)
        
        L, O, fp_context = self.layers[-1].fprop_cost(X, Y)
        dLdX = self.layers[-1].bprop_cost(fp_context)
        
        dLdP = [] #gradient with respect to parameters
        for i in xrange(len(self.layers)-1):
            layer = self.layers[len(self.layers)-2-i]
            fp_context = fp_contexts[len(self.layers)-2-i]
            dLdP = layer.get_gradients(dLdX, fp_context) + dLdP
            dLdX = layer.bprop(dLdX, fp_context)
        return L, O, dLdP


In [4]:
#training algorithms. They change the network!
def GD(net, X, Y, alpha=1e-4, max_iters=1000000, tolerance=1e-6):
    """
    Simple batch gradient descent
    """
    old_L = np.inf
    for i in xrange(max_iters):
        L, O, gradients = net.get_cost_and_gradient(X, Y)
        if old_L < L:
            print "Iter: %d, loss increased!!" % (i,)
        if (old_L - L)<tolerance:
            print "Tolerance level reached exiting"
            break
        if i % 1000 == 0:
            err_rate = (O.argmax(0) != Y).mean()
            print "At iteration %d, loss %f, train error rate %f%%" % (i, L, err_rate*100)
        for P,G in zip(net.parameters, gradients):
            P -= alpha * G
        old_L = L

In [5]:
from sklearn import datasets
iris = datasets.load_iris()
IrisX = iris.data.T
IrisX = (IrisX - IrisX.mean(axis=1, keepdims=True)) / IrisX.std(axis=1, keepdims=True)
IrisY = iris.target.reshape(1,-1)

In [15]:
#
# Here we verify that the network can be trained on Irises.
# Most runs should result in 100% accurracy
#

net = FeedForwardNet([
        AffineLayer(4,10),
        TanhLayer(),
        AffineLayer(10,3),
        SoftMaxLayer()
        ])
GD(net, IrisX,IrisY, 1e-1, tolerance=1e-7, max_iters=50000)

At iteration 0, loss 1.373673, train error rate 92.666667%
At iteration 1000, loss 0.053196, train error rate 2.000000%
At iteration 2000, loss 0.043734, train error rate 1.333333%
At iteration 3000, loss 0.041226, train error rate 1.333333%
At iteration 4000, loss 0.040079, train error rate 1.333333%
At iteration 5000, loss 0.039360, train error rate 1.333333%
At iteration 6000, loss 0.038813, train error rate 1.333333%
At iteration 7000, loss 0.038353, train error rate 1.333333%
At iteration 8000, loss 0.037945, train error rate 1.333333%
At iteration 9000, loss 0.037574, train error rate 1.333333%
At iteration 10000, loss 0.037232, train error rate 1.333333%
At iteration 11000, loss 0.036910, train error rate 1.333333%
At iteration 12000, loss 0.036601, train error rate 1.333333%
At iteration 13000, loss 0.036294, train error rate 1.333333%
At iteration 14000, loss 0.035982, train error rate 1.333333%
At iteration 15000, loss 0.035654, train error rate 1.333333%
At iteration 16000, 

## Reading data from Fuel

The following cell prepares the data pipeline in fuel. please see SGD template for usage example

In [7]:
from fuel.datasets.mnist import MNIST
from fuel.transformers import ScaleAndShift, Cast, Flatten, Mapping
from fuel.streams import DataStream
from fuel.schemes import SequentialScheme, ShuffledScheme

MNIST.default_transformers = (
    (ScaleAndShift, [2.0 / 255.0, -1], {'which_sources': 'features'}),
    (Cast, [np.float32], {'which_sources': 'features'}), 
    (Flatten, [], {'which_sources': 'features'}),
    (Mapping, [lambda batch: (b.T for b in batch)], {}) )

mnist_train = MNIST(("train",), subset=slice(None,50000))
#this stream will shuffle the MNIST set and return us batches of 100 examples
mnist_train_stream = DataStream.default_stream(
    mnist_train,
    iteration_scheme=ShuffledScheme(mnist_train.num_examples, 100))
                                               
mnist_validation = MNIST(("train",), subset=slice(50000, None))

# We will use larger portions for testing and validation
# as these dont do a backward pass and reauire less RAM.
mnist_validation_stream = DataStream.default_stream(
    mnist_validation, iteration_scheme=SequentialScheme(mnist_validation.num_examples, 250))
mnist_test = MNIST(("test",))
mnist_test_stream = DataStream.default_stream(
    mnist_test, iteration_scheme=SequentialScheme(mnist_test.num_examples, 250))

In [8]:
print "The streams return batches containing %s" % (mnist_train_stream.sources,)

print "Each trainin batch consits of a tuple containing:"
for element in next(mnist_train_stream.get_epoch_iterator()):
    print " - an array of size %s containing %s" % (element.shape, element.dtype)
    
print "Validation/test batches consits of tuples containing:"
for element in next(mnist_test_stream.get_epoch_iterator()):
    print " - an array of size %s containing %s" % (element.shape, element.dtype)

The streams return batches containing (u'features', u'targets')
Each trainin batch consits of a tuple containing:
 - an array of size (784, 100) containing float32
 - an array of size (1, 100) containing uint8
Validation/test batches consits of tuples containing:
 - an array of size (784, 250) containing float32
 - an array of size (1, 250) containing uint8


# Problem 1 [4p]

Implement the following additions to the SGD code below:
1. Momentum [2p]
2. Learning rate schedule [1p]
3. Weight decay [1p]. One way to implement it is to use the functions `net.params` and `net.param_names` to get all parameters whose names are "W" and not "b".

In [10]:
#
# Please note, the code blow is able to train a SoftMax regression model on mnist to poor results (ca 8%test error), 
# you must improve it
#

from copy import deepcopy


def compute_error_rate(net, stream):
    num_errs = 0.0
    num_examples = 0
    for X, Y in stream.get_epoch_iterator():
        O = net.fprop(X)
        num_errs += (O.argmax(0) != Y).sum()
        num_examples += X.shape[1]
    return num_errs/num_examples

def SGD(net, train_stream, validation_stream, test_stream):
#     beta1 = 0.9
#     beta2 = 0.999
#     epsilon = 1e-8
#     rho = 0.9
#     decay = 0.0001
#     lr = 0.001
#     alpha0 = 1e-1
#     alpha = alpha0
    i=0
    e=0
    
    #initialize momentum variables
    #
    # TODO
    #
    # Hint: you need one valocity matrix for each parameter
    momentum = 0.99
    decay = 0.0005
    alpha0 = 3e-1
    alpha = alpha0
    max_norm = 4

    velocities = [np.zeros(P.shape, dtype=P.dtype) for P in net.parameters]
#     velocities = [np.random.rand(P.shape[0]*P.shape[1]).reshape(P.shape)/100 for P in net.parameters]
#     moment = [np.zeros(P.shape) for P in net.parameters]
#     print velocities, len(net.parameters)
    
    best_valid_error_rate = np.inf
    best_params = deepcopy(net.parameters)
    best_params_epoch = 0
    
    train_erros = []
    train_loss = []
    validation_errors = []
    
    number_of_epochs = 3
    patience_expansion = 1.5
    
    try:
        while e<number_of_epochs: #This loop goes over epochs
            e += 1
            #First train on all data from this batch
            for X,Y in train_stream.get_epoch_iterator(): 
                i += 1
#                 lr *= (1. / (1. + decay * i))
                
                L, O, gradients = net.get_cost_and_gradient(X, Y, True)
                err_rate = (O.argmax(0) != Y).mean()
                train_loss.append((i,L))
                train_erros.append((i,err_rate))
                if i % 100 == 0:
                    print "At minibatch %d, batch loss %f, batch error rate %f%%, alpha %f" % (i, L, err_rate*100, alpha)
                for P, V, G, N in zip(net.parameters, velocities, gradients, net.parameter_names):
#                 for P, V, G, N, M in zip(net.parameters, velocities, gradients, net.parameter_names, moment):
#                     print V.shape, P.shape, G.shape
                    if N=='W':
                        G += - 0.0005 * G
#                         alpha = alpha0 * (1 - i/20000.)
                        pass
                        #
                        # TODO: implement the weight decay addition to gradient
                        #
                        #G += -alf*P
                    
                    #
                    # TODO: set a learning rate
                    #
                    # Hint, use the iteration counter i
                    
                    alpha = alpha0 * 0.9999**(i)
#                     alpha = 0.2
                    #
                    # TODO: set the momentum constant 
                    
                    #
                    # TODO: implement velocity update in momentum
                    #
#                     V = momentum * V - alpha * G
                    
#                     if i % 100 == 0:
#                         print 'G', np.where(np.linalg.norm(G, axis=1) > 0.1)
#                         print 'V', np.where(np.linalg.norm(V,axis=1) > 0.1)
                    row_norm = np.linalg.norm(G, axis=1)
                    max_norm = 0.1
                    mask = np.where(row_norm > max_norm)
                    row_norm[mask] = row_norm[mask]/max_norm
                    G[mask] = G[mask]/row_norm[mask].reshape((-1,1))
                    
                    # V[...] = TODO
                    V = momentum * V - alpha * G
                    P += V
#                     P += - alpha * G
#                         continue
#                     V = 0.9 * V + (1. - 0.9) * np.square(G)
#                     P += - alpha * G / (np.sqrt(V) + epsilon)
                    
                    #
                    # TODO: set a more sensible learning rule here,
                    # using your learning rate schedule and momentum
                    #
                    #!!!!! Need to modify the actual parameter here!
                    #print i, 'grad', len(G), len(P), len(G[0]), len(P[0])
#                     M = beta1 * M + (1 - beta1) * G
#                     V = beta2 * V + (1 - beta2) * G**2
#                     M2 = M/(1 - beta1**i)
#                     V2 = V/(1 - beta2**i)
#                     P = P - alpha * M2 / (np.sqrt(V2) + epsilon)
                    
#                     V = epsilon * V - alpha * G
#                     P += V
#                     P += - alpha * G
#                     if i < 2000:
#                         P += - alpha * G
#                         continue
    
#                     V = rho * V + (1. - rho) * np.square(G)
#                     print - alpha * G / (np.sqrt(V) + epsilon) 
#                     P += - alpha * G / (np.sqrt(V) + epsilon)
        
#                     M = (beta1 * M) + (1. - beta1) * G
#                     V = (beta2 * V) + (1. - beta2) * np.square(G)
#                     P = P - lr * M / (np.sqrt(V) + epsilon)
            
            
            # After an epoch compute validation error
            val_error_rate = compute_error_rate(net, validation_stream)
            if val_error_rate < best_valid_error_rate:
                number_of_epochs = np.maximum(number_of_epochs, e * patience_expansion+1)
                best_valid_error_rate = val_error_rate
                best_params = deepcopy(net.parameters)
                best_params_epoch = e
                validation_errors.append((i,val_error_rate))
            print "After epoch %d: valid_err_rate: %f%% currently going ot do %d epochs" %(
                e, val_error_rate, number_of_epochs)
            
    except KeyboardInterrupt:
        print "Setting network parameters from after epoch %d" %(best_params_epoch)
        net.parameters = best_params
        
        subplot(2,1,1)
        train_loss = np.array(train_loss)
        semilogy(train_loss[:,0], train_loss[:,1], label='batch train loss')
        legend()
        
        subplot(2,1,2)
        train_erros = np.array(train_erros)
        plot(train_erros[:,0], train_erros[:,1], label='batch train error rate')
        validation_errors = np.array(validation_errors)
        plot(validation_errors[:,0], validation_errors[:,1], label='validation error rate', color='r')
        ylim(0,0.2)
        legend()

# Problem 2 [5p]

Tune the following network to reach below 1.9% error rate on
the validation set. This should result in a test error below 2%. To
tune the network you will need to:
1. choose the number of layers (more than 1, less than 5),
2. choose the number of neurons in each layer (more than 100,
    less than 5000),
3. pick proper weight initialization,
4. pick proper learning rate schedule (need to decay over time,
    good range to check on MNIST is about 1e-2 ... 1e-1 at the beginning and
    half of that after 10000 batches),
5. pick a momentum constant (probably a constant one will be OK).


In [11]:
#
# TODO: pick a network architecture here. The one below is just 
# softmax regression
#

net = FeedForwardNet([
        AffineLayer(784,600),
        ReLULayer(),
        AffineLayer(600,200),
        ReLULayer(),
        AffineLayer(200,10),
        SoftMaxLayer()
        ])

SGD(net, mnist_train_stream, mnist_validation_stream, mnist_test_stream)

print "Test error rate: %f" % (compute_error_rate(net, mnist_test_stream), )

At minibatch 100, batch loss 0.841693, batch error rate 31.000000%, alpha 0.297045
At minibatch 200, batch loss 0.636310, batch error rate 21.000000%, alpha 0.294089
At minibatch 300, batch loss 0.323156, batch error rate 11.000000%, alpha 0.291162
At minibatch 400, batch loss 0.470165, batch error rate 15.000000%, alpha 0.288265
At minibatch 500, batch loss 0.187942, batch error rate 4.000000%, alpha 0.285397
After epoch 1: valid_err_rate: 0.078900% currently going ot do 3 epochs
At minibatch 600, batch loss 0.326579, batch error rate 10.000000%, alpha 0.282557
At minibatch 700, batch loss 0.122720, batch error rate 5.000000%, alpha 0.279745
At minibatch 800, batch loss 0.408577, batch error rate 15.000000%, alpha 0.276961
At minibatch 900, batch loss 0.216251, batch error rate 6.000000%, alpha 0.274206
At minibatch 1000, batch loss 0.240067, batch error rate 7.000000%, alpha 0.271477
After epoch 2: valid_err_rate: 0.044200% currently going ot do 4 epochs
At minibatch 1100, batch loss

After epoch 17: valid_err_rate: 0.019300% currently going ot do 25 epochs
At minibatch 8600, batch loss 0.006762, batch error rate 0.000000%, alpha 0.126956
At minibatch 8700, batch loss 0.004966, batch error rate 0.000000%, alpha 0.125693
At minibatch 8800, batch loss 0.015526, batch error rate 1.000000%, alpha 0.124442
At minibatch 8900, batch loss 0.001356, batch error rate 0.000000%, alpha 0.123204
At minibatch 9000, batch loss 0.001908, batch error rate 0.000000%, alpha 0.121978
After epoch 18: valid_err_rate: 0.016800% currently going ot do 28 epochs
At minibatch 9100, batch loss 0.004648, batch error rate 0.000000%, alpha 0.120764
At minibatch 9200, batch loss 0.002310, batch error rate 0.000000%, alpha 0.119562
At minibatch 9300, batch loss 0.000501, batch error rate 0.000000%, alpha 0.118372
At minibatch 9400, batch loss 0.002352, batch error rate 0.000000%, alpha 0.117195
At minibatch 9500, batch loss 0.013590, batch error rate 1.000000%, alpha 0.116028
After epoch 19: valid_

In [12]:
epoch = 28, 0.0172
compute_error_rate(net, mnist_test_stream)

0.0173

# Problem 3 [2p bonus]

Implement norm constraints, i.e. limit the total
norm of connections incoming to a neuron. In our case, this
corresponds to clipping the norm of *rows* of weight
matrices. An easy way of implementing it is to make a gradient
step, then look at the norm of rows and scale down those that are
over the threshold (this technique is called "projected gradient descent").

# Problem 4 [2p bonus]

Implement a **dropout** layer and try to train a
network getting below 1.5% test error rates with dropout (the best
result is below 1\% for dropout!). Details: http://arxiv.org/pdf/1207.0580.pdf.

# Problem 5 [3p bonus]

Implement convolutional and max-pooling layers and (without dropout) get a test error rate below 1.5%.

# Problem 6 [1-3p bonus]

Implement a data augmentation method (e.g. rotations, noise, crops) that will yield a significant test error rate reduction for your network. Number of bonus points depends on the ingenuity of your solution and error rate gains.

In [38]:
net = FeedForwardNet([
        AffineLayer(784,800, dropout_percent=1),
        ReLULayer(dropout_percent=0.77),
#         TanhLayer(),
        AffineLayer(800,600, dropout_percent=1),
        ReLULayer(dropout_percent=0.6),
#         TanhLayer(),
        AffineLayer(600,10, dropout_percent=1),
#         ReLULayer(),
        SoftMaxLayer()
        ])

SGD(net, mnist_train_stream, mnist_validation_stream, mnist_test_stream)

print "Test error rate: %f" % (compute_error_rate(net, mnist_test_stream), )

At minibatch 100, batch loss 1.020404, batch error rate 40.000000%, alpha 0.297045
At minibatch 200, batch loss 0.362937, batch error rate 14.000000%, alpha 0.294089
At minibatch 300, batch loss 0.351410, batch error rate 9.000000%, alpha 0.291162
At minibatch 400, batch loss 0.375008, batch error rate 12.000000%, alpha 0.288265
At minibatch 500, batch loss 0.334091, batch error rate 14.000000%, alpha 0.285397
After epoch 1: valid_err_rate: 0.085500% currently going ot do 3 epochs
At minibatch 600, batch loss 0.204162, batch error rate 4.000000%, alpha 0.282557
At minibatch 700, batch loss 0.253866, batch error rate 7.000000%, alpha 0.279745
At minibatch 800, batch loss 0.135297, batch error rate 4.000000%, alpha 0.276961
At minibatch 900, batch loss 0.179030, batch error rate 5.000000%, alpha 0.274206
At minibatch 1000, batch loss 0.219061, batch error rate 8.000000%, alpha 0.271477
After epoch 2: valid_err_rate: 0.062800% currently going ot do 4 epochs
At minibatch 1100, batch loss 0

After epoch 17: valid_err_rate: 0.020400% currently going ot do 25 epochs
At minibatch 8600, batch loss 0.019205, batch error rate 0.000000%, alpha 0.126956
At minibatch 8700, batch loss 0.012422, batch error rate 0.000000%, alpha 0.125693
At minibatch 8800, batch loss 0.119355, batch error rate 3.000000%, alpha 0.124442
At minibatch 8900, batch loss 0.064709, batch error rate 1.000000%, alpha 0.123204
At minibatch 9000, batch loss 0.006834, batch error rate 0.000000%, alpha 0.121978
After epoch 18: valid_err_rate: 0.016700% currently going ot do 28 epochs
At minibatch 9100, batch loss 0.026817, batch error rate 1.000000%, alpha 0.120764
At minibatch 9200, batch loss 0.011356, batch error rate 0.000000%, alpha 0.119562
At minibatch 9300, batch loss 0.024197, batch error rate 1.000000%, alpha 0.118372
At minibatch 9400, batch loss 0.073037, batch error rate 2.000000%, alpha 0.117195
At minibatch 9500, batch loss 0.003498, batch error rate 0.000000%, alpha 0.116028
After epoch 19: valid_

At minibatch 16900, batch loss 0.002315, batch error rate 0.000000%, alpha 0.055357
At minibatch 17000, batch loss 0.008366, batch error rate 0.000000%, alpha 0.054806
After epoch 34: valid_err_rate: 0.014700% currently going ot do 52 epochs
At minibatch 17100, batch loss 0.004491, batch error rate 0.000000%, alpha 0.054261
At minibatch 17200, batch loss 0.003518, batch error rate 0.000000%, alpha 0.053721
At minibatch 17300, batch loss 0.043855, batch error rate 3.000000%, alpha 0.053186
At minibatch 17400, batch loss 0.003529, batch error rate 0.000000%, alpha 0.052657
At minibatch 17500, batch loss 0.006873, batch error rate 0.000000%, alpha 0.052133
After epoch 35: valid_err_rate: 0.015500% currently going ot do 52 epochs
At minibatch 17600, batch loss 0.005874, batch error rate 0.000000%, alpha 0.051614
At minibatch 17700, batch loss 0.008297, batch error rate 0.000000%, alpha 0.051100
At minibatch 17800, batch loss 0.014796, batch error rate 1.000000%, alpha 0.050592
At minibatch

At minibatch 25200, batch loss 0.001106, batch error rate 0.000000%, alpha 0.024137
At minibatch 25300, batch loss 0.004883, batch error rate 0.000000%, alpha 0.023897
At minibatch 25400, batch loss 0.010190, batch error rate 1.000000%, alpha 0.023659
At minibatch 25500, batch loss 0.002799, batch error rate 0.000000%, alpha 0.023424
After epoch 51: valid_err_rate: 0.015100% currently going ot do 67 epochs
At minibatch 25600, batch loss 0.016360, batch error rate 1.000000%, alpha 0.023191
At minibatch 25700, batch loss 0.001897, batch error rate 0.000000%, alpha 0.022960
At minibatch 25800, batch loss 0.004626, batch error rate 0.000000%, alpha 0.022732
At minibatch 25900, batch loss 0.007488, batch error rate 0.000000%, alpha 0.022505
At minibatch 26000, batch loss 0.000818, batch error rate 0.000000%, alpha 0.022281
After epoch 52: valid_err_rate: 0.014900% currently going ot do 67 epochs
At minibatch 26100, batch loss 0.004430, batch error rate 0.000000%, alpha 0.022060
At minibatch

After epoch 67: valid_err_rate: 0.014500% currently going ot do 97 epochs
At minibatch 33600, batch loss 0.007963, batch error rate 0.000000%, alpha 0.010420
At minibatch 33700, batch loss 0.003736, batch error rate 0.000000%, alpha 0.010316
At minibatch 33800, batch loss 0.003851, batch error rate 0.000000%, alpha 0.010214
At minibatch 33900, batch loss 0.010042, batch error rate 0.000000%, alpha 0.010112
At minibatch 34000, batch loss 0.001763, batch error rate 0.000000%, alpha 0.010011
After epoch 68: valid_err_rate: 0.014900% currently going ot do 97 epochs
At minibatch 34100, batch loss 0.032139, batch error rate 1.000000%, alpha 0.009912
At minibatch 34200, batch loss 0.006107, batch error rate 0.000000%, alpha 0.009813
At minibatch 34300, batch loss 0.007335, batch error rate 0.000000%, alpha 0.009715
At minibatch 34400, batch loss 0.020998, batch error rate 1.000000%, alpha 0.009619
At minibatch 34500, batch loss 0.001471, batch error rate 0.000000%, alpha 0.009523
After epoch 

At minibatch 41900, batch loss 0.001953, batch error rate 0.000000%, alpha 0.004543
At minibatch 42000, batch loss 0.001536, batch error rate 0.000000%, alpha 0.004498
After epoch 84: valid_err_rate: 0.014900% currently going ot do 97 epochs
At minibatch 42100, batch loss 0.006769, batch error rate 0.000000%, alpha 0.004453
At minibatch 42200, batch loss 0.009109, batch error rate 0.000000%, alpha 0.004409
At minibatch 42300, batch loss 0.001051, batch error rate 0.000000%, alpha 0.004365
At minibatch 42400, batch loss 0.005389, batch error rate 0.000000%, alpha 0.004322
At minibatch 42500, batch loss 0.010829, batch error rate 0.000000%, alpha 0.004279
After epoch 85: valid_err_rate: 0.014800% currently going ot do 97 epochs
At minibatch 42600, batch loss 0.001430, batch error rate 0.000000%, alpha 0.004236
At minibatch 42700, batch loss 0.001529, batch error rate 0.000000%, alpha 0.004194
At minibatch 42800, batch loss 0.003832, batch error rate 0.000000%, alpha 0.004152
At minibatch