# Simple Binary Connect
A bare-bones instructional implementation of [BinaryConnect(2015) by Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David](https://arxiv.org/pdf/1511.00363). The rest of the post is organised as follows:

1. Introduction (contents of the README)
2. Objective
3. Some Theory
4. Forward Pass
5. Backward Pass
6. Testing
7. Adding Binarization
8. Testing
9. Discussion

## 1. Introduction
A conversation at work spawned a discussion about the difficulties of successfully deploying Deep Neural Networks on embedded and mobile devices. A [very]rough review reveals three broad approaches:
- Training shallower/smaller architectures to reduce parameters.
- Pruning redundant network connections.
- Quantization of weights while preserving accuracy.

The focus of this tutorial is the BinaryConnect paper, which falls in the third category. The paper proposes a training procedure to learn binarized weights (+1, -1) as opposed to full precision weights (32bits or 64bits) with minimal loss in precision.

## 2. Objective
We will not be replicating the results of the paper. Rather, this will serve as a proof-of-concept for the ...umm, concept. These are the concession that we will be making in the interest of time.
- Working with MNIST and only MNIST.
- Training a VERY simple model (logistic regression).
- Using slow and un-optimised Python.

The last point it actually more important that you might think. The true potential of such approaches is unlocked while using specialised hardware and software which actually exploits the single bit weights. But you won't get to see that with Python because I will still use 32 bit integers to represent +1 and -1. So there!

## 3. Some Theory
For those unfamiliar with back-propagation, there's a excellent post on the topic by [Michael Nielsen](http://neuralnetworksanddeeplearning.com/chap2.html). Here's a <INSERT_FAVORITE_BIG_NUMBER> foot overview using logistic regression model.

### 3.1. Model for multi-class logistic regression
I define a multi-class classification problems as follows:
$$
\begin{align}
X & \sim input\ matrix[batch, features] \\
W & \sim weight\ matrix[features, classes] \\
\textbf{b} & \sim bias\ vector[1, classes]
\end{align}
$$

The logistic model $f$ is defined as
$$
f(X, W, \textbf{b}) = softmax(X.W + \textbf{b})
$$
where
$$
softmax(v_{ij}) = \frac{e^{v_{ij}}}{\sum_k e^{v_{ik}}}
$$

$f$ returns the *logits*, which are the model output, which in this case will be a probability distribution for every sample against every possible class. We can calculate the error from the correct class label using any number of loss/distance measures. In the spirit of needlessly complicating things, let's continue with [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy). Amongst other things, a cross-entropy as an error measure is helpful as it allows weight updates even when the activations are close to saturation, i.e. when the error gradient is very close to 0.

We define cross entropy loss $XE$ for a target distribution $\textbf{t}$ and predicted distribution $\textbf{p}$ as  follows
$$
XE(\textbf{t}, \textbf{p}) = - \sum_i t_i log(p_i)
$$
where $i$ denotes the class(or dimension) index.

### 3.2. Calculating error gradients

Training the network requires calculating the error gradients w.r.t the parameters: $W, \textbf{b}$. And since the output is a function of the parameters, we can just keep applying the chain-rule repeatedly to get the error gradients w.r.t. the parameters, starting from the gradient w.r.t. the output.

This repeated and recursive transfer of the error is called *backpropagating the error*. If you're not happy with this definition, raise a PullRequest with a better one. 

At this point, I will go over the notation and indices once again because it gets pretty gnarly. 

- $X$: uppercase letters represent row-ordered matrices.
- $\textbf{x}$: bold lowercase letters represent vectors. Here $\textbf{x}$ is a single sample of the batched data in $X$.
- $x_{ij}$: lowercase normal font letters represent single values. $x_{ij}$ represents the $j^{th}$ column element for the $i^{th}$ row in the input matrix $X$.

**Gradients for logits $l$**

$$
\begin{align}
\frac{\partial XE}{\partial l_{ij}} &= \frac{\partial XE(\textbf{t}, \textbf{l}_i)}{\partial l_{ij}}
\end{align}
$$

Note that the gradient is being calculated for a single element $(i, j)$ of the logits matrix $L$. For clarity, I'm assuming there's only one sample in the batch (i.e. $i=1$) which leaves us with logits vector $\textbf{l}_i$. I have omitted the first(sample or batch) index $i$ henceforth.

Now, with that out of the way, let's continue.
$$
\begin{align}
\frac{\partial XE}{\partial l_j} &= \frac{\partial XE(\textbf{t}, \textbf{l})}{\partial l_j} \\ \\
&= \frac{ \partial(- \sum_k t_k log (l_k)) }{\partial l_j} \\ \\
&= \frac{ - \sum_k t_k \partial log (l_k) }{\partial l_j}
\end{align}
$$

As a general rule, while introducing any new function or expression that involves indices, make it a point to declare new index placeholders - it will lead to fewer errors and a lot less confusion.



Here, $k$ iterates over all class labels. Expanding the sigma,  for $m$ class labels, we will have $m$ different partial derivative terms. And for exactly one term, the numerator and denominator will have the same index. Concretely, for one term, $k = j$ and for all other $n-1$ terms, $k \neq j$. 