# Classifying MNIST digits using Logistic Regression

Following:

http://deeplearning.net/tutorial/logreg.html

## The Model

Logistic regression is a linear probabilistic classifier. Classification is performed by projecting an input vector onto a set of hyperplanes, and the distance from the input to a hyperplane quantifies the probability the input vector is a member of the corresponding class.

The probability an input vector $x$ is a member of class $i$ may be written as:
\begin{equation}
\begin{split}
P(Y = i|x,W,b) =&~ \text{softmax}_i (W x + b) \\
=&~ \frac{e^{W_i x + b_i}}{\sum_j e^{W_j x + b_j}}\\
\end{split}
\end{equation}
where $W$ is a matrix of weights and $b$ is a bias vector.

The model prediction is then
\begin{equation}
y_p = \text{argmax}_i P(Y = i|x, W, b)
\end{equation}

In Theano, because the parameters of our model must maintain a persistent state throughout training, we will allocate shared variables for $W$ and $b$.

## Defining a Loss Function

Learning optimal parameters is the process of minimizing a loss function. For multi-class logistic regression, the negative log-likelihood is a common choice for the loss function. It is equivalent to maximizing the likelihood of the data set under the model parameters.

\begin{equation}
\begin{split}
\mathcal{L}(\theta = \left\{W, b\right\}, \mathcal{D}) =&~ \sum_{i=0}^{\mathcal{D}} \log P(Y = y^{(i)} | x^{(i)}, W, b) \\
\mathcal{l}(\theta = \left\{W, b\right\}, \mathcal{D}) =&~ -\mathcal{L}(\theta = \left\{W, b\right\}, \mathcal{D}) \\
\end{split}
\end{equation}

Gradient descent is the simplest method for minimizing arbitrary non-linear functions. For more, see:

http://deeplearning.net/tutorial/gettingstarted.html#opt-sgd

**Note**: Even though the loss is formally defined as the _sum_ over the dataset of individual error terms, in the code we will usually use `T.mean`. This allows for less dependence on minibatch size in the learning rate.

## Creating a LogisticRegression class

We start by allocating symbolic variables for the training imputs $x$ and their classes $y$ (note `x` and `y` are defined outside the scope of the object). We also define a symbolic `cost` variable to minimize.

## Learning the Model

With Theano we do not need to manually derive expressions for the gradient of the loss function with respect to the parameters because Theano performs automatic differentiaion. We can use these gradients in training by defining the `updates` of our function to continually shift the parameters by the `learning_rate` times their gradients.

We also use `givens` to pass data into our function more efficiently. 

Our function `train_model` is defined such that:

* the input is the minibatch `index` that, with the batch size, defines `x` and `y`
* the return value is the cost/loss associated with the `x` and `y` defined by the `index`
* on each call, our function first replaces `x` and `y` with `index`ed slices from the data, and then it evaluates the cost of that minibatch and applies the operations we defined in the `updates` list

So each time we call `train_model(index)`, we compute and return the cost of a minibatch while performing a step of MSGD (Minibatch SGD). The algorithm therefore is looping over all the examples in the dataset, considering them in one minibatch at a time, and repeatedly calling `train_model`.

## Testing the model

When testing the model we are interested in the likelihood but also the plain number of misclassified examples. So our `LogisticRegression` class will therefore have an extra instance method that builds the symbolic graph for retrieving the number of misclassified examples in each minibatch.

We then create functions `test_model` and `validate_model` that we can use to retrieve this value. These functions take a minibatch and compte the number of items that were misclassified. The only difference between the functions is that `test_model` draws its data from the testing set and `validate_model` draws its data from the validation set.

## Putting it all together