<a href="https://colab.research.google.com/github/jneeven/Weather-Forecasting-Data/blob/master/Binarized_Neural_Networks_cheatsheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Quantizers and Gradients
Both explained at
https://larq.dev/api/quantizers/


#### Straight-through Estimator

Straight-through estimator uses sign function for activation:

`f(x) = -1 if x < 0 else 1`

and the following gradient function:

`f'(x) = 1 if abs(x) <= 1 else 0`

<br/>

Not sure what the reasoning behind this is: if the activation is small, gradient is 1, and if it is large, gradient is 0. \
Is explained in [Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1](https://arxiv.org/pdf/1602.02830.pdf).\
According to that, *the gradient is canceled if x is too large*, and not doing this significantly worsens performance.



## Shift-based BatchNorm
For efficiency, we need some kind of batchnorm that doesn't require calculating the running mean and standard deviation, and then dividing by it, as these are all expensive operations. We therefore use a kind of batchnorm based on bit shifts.

If x is the vector of activations of a given layer for one batch, we get
````
def SBN(x, gamma, epsilon=1e-8):
    batch_mean = np.mean(x)
    centered = x - batch_mean
    approx_variance = np.mean(centered * << >> AP2(centered))
    normalized = centered << >> AP2(1 / np.sqrt(approx_variance + epsilon))
    denormalized = AP2(gamma) << >> normalized
  
````
Where gamma is a learnable parameter, and AP2 is the approximate power of 2:
````
def AP2(x):
    return sign(x) * 
````

## MISC
Bengio BNN paper also clips the real-valued weights to -1 and 1, because they'd otherwise grow very large without any impact on the binarized weights.
