## Improved Techniques for Training Adaptive Deep Networks

### Related Work
#### Adaptive Inference
save computation on "easy" samples

Approach:
learn adaptive network topology selection policies

#### Knowledge Distillation
Outputs from teach network are utilized to supervise the training of the student network

### Contributions
+ gradient equilibrium: resolve the gradients conflict of different classifiers
+ inline subnetwork collaboration & one-for-all knowledge distillation: enhance the collaboration among classifiers


#### 1. gradient equilibrium
Weighted cumulative loss function may lead to a gradient imblance issue due to the overlap of the network. The gradients of overlapping networks will become very large, because their gradients come from all exiting points behind them.


Suppose we have $k$ classifiers. For $i$th branch (indexing from 1), we rescale the gradients from $i$th classifier by $\frac{1}{k - i + 1}$ and that from the subsequent $(k - i)$ classifiers by $\frac{k - i}{k - i + 1}$.

For example, if we have $3$ blocks and each block has $1$ classifier. Then, for each classifier, its parameters receive full gradients from backward, while for $i$th block, its parameters obtain $\frac{1}{k-i+1}$ gradients from $i$th classifier and extra gradients from the subsequent classifiers.

<img src="images/gradient_equilibrium.png"/>

#### 2. Collaboration
In previous works, they treat multiple classifiers independently, expecting that their losses are simply summed up during training process.
+ Inline Subnetwork Collaboration(ISC): Add a connection from $i$-th classifier to $i+1$-th classifier in forward process, but ignore the gradients in backward process (to prevent the early classiifers being influenced by the latter ones)
+ One_for_all Knowledge Distillation(OFA): We use logits of $i$-th classifier as the knowledge to facilitate the learning of its subsequent classifier
+ The loss function of $i$-th classifier consists of cross-entropy loss and the alignment of soft class probabilities between the teacher and student models using the Kullback Leibler divergence.

ISC:
We continue on the illustration above, and the only difference is to add the connection between the logits at previous layer and current classifier. The previous logits are concatenated with current logits and then transformed by a a simple network (e.g. fully-connected layer), which is fed into the current classifier.
![](images/inline_subnetwork_collaboration.png)

OFA:
The final classifier is trained using cross entropy with labels, while the early classifiers are trained using the combination of cross entropy with labels and knowledge of logits from the final classifier.
![](images/one_for_all_knowledge_distillation.png)