### [Fast Inference with Early Exit](https://www.intel.ai/fast-inference-with-early-exit/#gs.oc5n5e)

 a confidence measure will determine if a prediction made at a certain stage can exit early from the entire deep learning topology, thus saving unnecessary processing in the subsequent layers.
 
+ Leverage the variance of difficulty among real-world data and thus uses only part of the network to handle recognition tasks: Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition
+ Multi-Scale Dense Network (MSDNet) 
+ selectively inserting exits between specific layers. Checks for an ability to exit were done after some amount of extra processing on the exit branches themselves: BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks.
+ perform dynamic routing of the data and thus skip certain layers of processing along the way: SkipNet: Learning Dynamic Routing in Convolutional Networks


[Distiller](https://ai.intel.com/compressing-deep-learning-models-with-neural-network-distiller/): nn compression research


[Early exit](https://nervanasystems.github.io/distiller/algo_earlyexit/index.html) is a new feature in Distiller and is available as an Open Source package on [Github](https://github.com/NervanaSystems/distiller).




### BranchyNet
<img src="https://d3i71xaburhd42.cloudfront.net/677674e81070879f7b6da6261d0ba174985a3cf6/1-Figure1-1.png" width="150"/>
additional side branch classifiers

+ For many simple test examples can exit the network early via these branches when they are infered with high confidence.
+ For difficult examples, deeper networks are still utilized.


Jointly Optimization: 
+ optimizes the weighted loss of all exit points
+ each exit point provides regularization on others?
+ The final loss function (N is the total number of exiting point): 

$$
L_{branchynet}(\hat{\mathbf{y}}, \mathbf{y}; \theta) = \sum_{n=1}^N { w_n L(\hat{\mathbf{y}}_{exit_n}, \mathbf{y}; \theta)  }
$$



Structures of Branches:
+ The computation cost of branch should be less than that of exiting at a later exit point
+ Earlier branch has more layers, and later branch has fewer layers.

#### Q&A

How do we choose weight $w_n$ for each exit point?
+ higher weights at early exiting point

How do we messure the confidence of an exiting point?
+ Compute the entropy value at the exiting point: $\text{entropy}(\mathbf{y}) = \sum_{c\in C}{ y_c \log {y_c} }$, where $C$ is the set of classes.
+ $y_c$ is obtained by forward pass on the sub-networks

How do we determine the exiting threshold $T_n$ for exiting point?
+ determined by application, i.e., as long as the accuracy and speed meets the requirements.

How do we choose the exiting location?
+ Depends on the difficulty of the task (dataset, etc.)

Do we forward from the first layer again for the future exiting points? I think there is more computation loss.



### SkipNet
![](https://d3i71xaburhd42.cloudfront.net/f37ea0b173dd0403a5028c12746082d31dff60bb/4-Figure2-1.png)
Modify ResNet

Gating Network:
+ Add to each residual network: takes the outputs of the previous layer as inputs, and outputs 0/1 to determine whether to skip the block (1: no-skip; 0: skip)
+ non-differentiable
    - ~~gradient descent~~
    - softmax softer? make it differentiable
    - fidelity + penalty (=reward)
    
Loss Function:
$$
L_{\theta}(g, X) = L(\hat{y}(X, F_{\theta}, g), y) - \frac{\alpha}{N} \sum_{i=1}^N{(1 - g_i ) C_i},
$$
where $g_i$ is the $i$th decision (0/1), $N$ is the number of decisions, $C_i$ is hyperparamter to measure the importance of $F_{\theta}^i$ (the authors select $C_i$ as 1), $\alpha$ is another hyperparameter, $F_{\theta}^i$ is the set of network parameters for $i$th layer including gating network, $F_{\theta} = [F_{\theta}^1, F_{\theta}^2, \cdots, F_{\theta}^N]$


### MSDNet

### BlockDrop
dynamically remove networks while keeping high accuracy



Assumptions:
+ different blocks do not share strong dependencies, but we cannot remove too many layers


the residual blocks that are kept for evaluation can be further pruned to speed up

instance-specific residual block removal scheme

Dropping layers => regularization: Dropout, DropConnect
+ dropping only happens in training, not inference


*Formula 1.* We bound probability using the following formula:
$$
s^\prime = \alpha s + (1 - \alpha) (1 - s),
$$
where $s_i$ is the probablity of preserving $i$th block, $\alpha$ is the hyperparameter selected as $0.8$ in authors' codes. In codes, the shape of $s$ is $(\text{batch_size}, \text{num_blocks})$. We use $s$ to initialize Bernoulli parameters and stochastically sample 0/1.
```python
import torch
from torch.distributions import Bernoulli

s = torch.Tensor([[0.23, 0.46, 0.75, 0.52], [0.35, 0.29, 0.52, 0.58]])
distr = Bernoulli(s)
distr.sample() # tensor([[0., 1., 0., 0.], [0., 0., 1., 0.]])
```

*Formula 2.* Advantage is computed as subtraction of reward based on Bernoulli distribution from that based on maximally probable configuration.
$$
A = R(u) - R(\tilde{u}),
$$
where $u$ is computed with the above `Bernoulli(s).sample()`, but $\tilde{u}$ is calculated with `s[s>=0.5] = 1.0; s[s<0.5] = 0.0;`

*Formula 3.* The loss function is defined as follows (NOTE: the authors subtract it from the entropy loss and then back-propagate):
$$
\bigtriangledown_W J = \mathbb{E}[A \bigtriangledown_W \sum_{k=1}^K \log[s_k u_k + (1 - s_k) (1 - u_k)]], 
$$
where $s_k u_k + (1 - s_k) (1 - u_k)$ is the probability density value of Bernoulli at value $u_k$. For example, if $s_k = 0.46, u_k = 1$, then it equals to $0.46$. If $s_k = 0.46, u_k = 0$, then it equals to $0.54$. Therefore, the loss function above multiplies the probability of policy with its corresponding advantage.

*Formula 4.* High reward is given if we achieve low block utilization and correct prediction:
$$
R(u) = \begin{cases}
1-\left(\frac{|u|_0}{K}\right)^2 & \text{if correct}\\
-\gamma, & \text{otherwise}
\end{cases}
$$
where $\gamma$ is a penalty hyperparameter which is selected as $1$ in official codes. $|u|_0$ is the number of blocks that are preserved, and $K$ is the total number of blocks.
