<h3>Theano Tutorial - Preliminaries</h3>
Here are just the basics of getting started with Theano for deep learning.  For starters, the common namespace imports are

In [3]:
import theano
import theano.tensor as T
import numpy

<h3>Storing Datasets</h3>
Ideally, we'd like to take advantage of GPU calculations to speed up our code.  Since most training algorithms use minibatches it might be tempting to define a new minibatch variable on every iteration - but this involves the expensive transfer of data from the CPU to GPU every cycle as well.  Instead it is common practice to have all minibatches in a shared variable and have the specific batch defined by slices.  Note that the correct datatype is always float - its a GPU!


In [37]:
from sklearn.datasets import load_digits
from keras.utils import np_utils
x = load_digits()['data']
y = load_digits()['target']
y = np_utils.to_categorical(y)
y[0]

array([ 1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.])

In [34]:
shared_x = theano.shared(numpy.asarray(x, dtype=theano.config.floatX))
shared_y = theano.shared(numpy.asarray(y, dtype=theano.config.floatX))

Therefore if we have a batch size of 100, accessing the third batch becomes the following

In [35]:
x_train = shared_x[2 * 100: 3 * 100]
y_train = shared_y[2 * 100: 3 * 100]

<h3>Loss Functions</h3>
<h4>0-1 Loss</h4>

$$l_{0, 1} = \sum_{i=0}^{|D|}I_{f(x_i) \neq y^{i}} $$

can be implemented in theano as a symbolic function - it will be compiled into theano, not run in python

In [13]:
y_pred = []
loss_zero_one = T.sum(T.neq(T.argmax(y_pred), y))

<h4>Negative Log-likelihood Loss</h4>
A differentiable function, easier to optimize than the above.  Considers the probability of each predicted label versus the actual label.  Where $\theta$ our the parameters of the model

$$L(\theta, D) = -\sum_{i=0}^{|D|} log P(Y = y^{(i)} | x^{(i)}, \theta) $$

In [42]:
#loss_nll = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])

# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector.  Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.

<h3>Gradient Descent</h3>

    while True:
        loss = f(params)
        d_loss = ... # compute gradient
        params -= learning_rate * d_loss
        if <stopping_condition_met>
            return params
        
Calculating the gradient using every datapoint is very expensive, we can instead estimate it the gradient using only a single example

<h3>Stochastic Gradient Descent</h3>

    for (x_i,y_i) in training_set:
        # imagine an infinite generator
        # that may repeat examples (if there is only a finite training set)
        loss = f(params, x_i, y_i)
        d_loss_wrt_params = ... # compute gradient
        params -= learning_rate * d_loss_wrt_params
        if <stopping_condition_met>:
            return params
            
Theano documentation recomends a twist on SGD using minibatches of fixed size.  This reduces variance somewhat while still speeding up calculations, though the advantage shrinks to nothing quickly with larger minibatch size.

<h3>Minibatch Stochastic Gradient Descent</h3>

    for (x_batch,y_batch) in train_batches:
        # imagine an infinite generator
        # that may repeat examples
        loss = f(params, x_batch, y_batch)
        d_loss_wrt_params = ... # compute gradient using theano
        params -= learning_rate * d_loss_wrt_params
        if <stopping_condition_met>:
            return params

Implementing minibatch SGD in theano looks like the following

In [None]:
 Minibatch Stochastic Gradient Descent

# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;

# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)

# compile the MSGD step into a theano function
updates = [(params, params - learning_rate * d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)

for (x_batch, y_batch) in train_batches:
    # here x_batch and y_batch are elements of train_batches and
    # therefore numpy arrays; function MSGD also updates the params
    print('Current loss is ', MSGD(x_batch, y_batch))
    if stopping_condition_is_met:
        return params

<h3>Regularization</h3>
shrinkage of the weight matrix has been shown to improve a model's ability to generalize.  This is done by adding an additional penalty term to the loss function.

Where  $R(\theta)$ is a function of the weights $\theta$,

$$ R(\theta) = (\sum_{i=1}^{|\theta|} {|\theta_i|}^{ \ p})^{\frac{1}{p}} $$

Our new loss function becomes

$$ NLL(\theta, D) + \lambda R(\theta)$$

<h4></h4>

In [None]:
# symbolic Theano variable that represents the L1 regularization term
L1  = T.sum(abs(param))

# symbolic Theano variable that represents the squared L2 term
L2_sqr = T.sum(param ** 2)

# the loss
loss = NLL + lambda_1 * L1 + lambda_2 * L2