# Decoupled Neural Interfaces using Synthetic Gradients

* 싸이그래머 / DGM : 파트 1 - DeepMind 논문리뷰 [1]
* 김무성

# Contents
* 1 Introduction
* 2 Decoupled Neural Interfaces
    - 2.1 Synthetic Gradient for Feed-Forward Networks
    - 2.2 Synthetic Gradient for Recurrent Networks
* 3 Experiments
    - 3.1 Feed-Forward Networks
    - 3.2 Recurrent Neural Networks
    - 3.3 Multi-Network System

#### 참고
* [1] (deepmind original paper) Decoupled Neural Interfaces using Synthetic Gradients - https://arxiv.org/pdf/1608.05343.pdf
* [2] (deepmind blog) Decoupled Neural Interfaces using Synthetic Gradients - https://deepmind.com/blog/decoupled-neural-networks-using-synthetic-gradients/
* [3] (모두연) Decoupled Neural Interfaces using Synthetic Gradients - https://norman3.github.io/papers/docs/synthetic_gradients
* [4] (code) Image classification with synthetic gradient in tensorflow - https://github.com/andrewliao11/DNI-tensorflow

# 1 Introduction

#### computational graph

* Each layer (or module) in a directed neural network can be considered a computation step, that transforms its incoming data. 
* These modules are connected via directed edges, creating a forward processing graph which defines the flow of data from the network inputs, through each module, producing network outputs. 
* Defining a loss on outputs allows errors to be generated, and propagated back through the network graph to provide a signal to update each module.

<img src="http://nbviewer.jupyter.org/github/KonanAcademy/deep/blob/master/seminar/season01/ch06/figures/cap6.3.png" width=600 />

<img src="http://nbviewer.jupyter.org/github/KonanAcademy/deep/blob/master/seminar/season01/ch06/figures/cap6.81.png" width=600 />
<img src="http://nbviewer.jupyter.org/github/KonanAcademy/deep/blob/master/seminar/season01/ch06/figures/cap6.82.png" width=600 />

<img src="http://nbviewer.jupyter.org/github/KonanAcademy/deep/blob/master/seminar/season01/ch06/figures/chain.png" width=600 />

#### locking

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-1.width-1500.png" width=600 />

This process results in several forms of locking, namely:
* (i) Forward Locking 
    - no module can process its incoming data before the previous nodes in the directed forward graph have executed; 
* (ii) Update Locking 
    - no module can be updated before all dependent modules have executed in forwards mode; also, in many credit-assignment algorithms (including backpropagation [18]) we have 
* (iii) Backwards Locking  
    - no module can be updated before all dependent modules have executed in both forwards mode and backwards mod

Forwards, update, and backwards locking constrains us to running and updating neural networks in a sequential, synchronous manner.

Though seemingly benign when training simple feed-forward nets, this poses problems when thinking about creating systems of networks acting in multiple environments at different and possibly irregular or asynchronous timescales.

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-2.width-1500.png" width=300 />

#### To remove update locking for neural networks

<img src="http://sebastianraschka.com/images/faq/visual-backpropagation/forward-propagation.png" width=400 />
<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-1.width-1500.png" width=400 />
<img src="http://sebastianraschka.com/images/faq/visual-backpropagation/backpropagation.png" width=400 />

The goal of this work is to remove update locking for neural networks. This is achieved by removing backpropagation.

To update weights $θ_i$ of module $i$ we drastically approximate the function implied by backpropagation:

<img src="figures/cap1.png" width=600 />

#### synthetic gradient

In this paper, <font color="red">we remove the reliance on backpropagation to get error gradients</font>, and <font color="blue">instead learn a parametric model which predicts</font> what the gradients will be based upon only local information. We call these predicted gradients synthetic gradients.[2]

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-3.width-1500.png" width=400 />

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-4.width-1500.png" width=600 />

# 2 Decoupled Neural Interfaces
* 2.1 Synthetic Gradient for Feed-Forward Networks
* 2.2 Synthetic Gradient for Recurrent Networks

#### communication protocol & update decoupled

##### Figure 1. (a) - General communication protocol

<img src="figures/cap2.png" width=600 />

<img src="figures/cap3.png" width=600 />

#### DNI(Decoupled Neural Interfaces)

We can apply this protocol to neural networks communicating, resulting in what we call Decoupled Neural Interfaces (DNI). 

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-5.width-1500.png" width=600 />

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/documents/3-6.gif" width=300 />

#### synthetic gradients

we concentrate our empirical study on differentiable networks trained with backpropagation and gradient-based updates. Therefore, we focus on producing error gradients as the feedback $δ^{ˆ}_A$ which we dub synthetic gradients.

## 2.1 Synthetic Gradient for Feed-Forward Networks

#### FFN

<img src="figures/cap4.png" width=600 />

##### Figure 1. (b)

<img src="figures/cap5.png" width=600 />

<img src="figures/bp.png" width=600 />

<img src="figures/cap3.png" width=600 />

#### update decoupled & train synthetic gradient model('s paramter)

##### Figure 1. (c)

<img src="figures/cap6.png" width=600 />

<img src="figures/cap3.png" width=600 />

##### Figure 1. (d)

Furthermore, for a feed-forward network, we can use synthetic gradients as communication feedback to decouple every layer in the network, as shown in Fig. 1 (d). 

##### The execution of this process is illustrated in Fig. 2.

<img src="figures/cap7.png" width=600 />

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/documents/3-6.gif" width=300 />

#### context

This process allows each layer to be updated as soon as a forward pass has been executed. Additionally, if any supervision or context c is available at the time of synthetic gradient computation, the synthetic gradient
model can take this as an extra input, $δ^{ˆ} = M(h,c)$.

<img src="https://norman3.github.io/papers/images/synthetic_gradients/f03.png" width=400 />

## 2.2 Synthetic Gradient for Recurrent Networks

<img src="figures/cap8.png" width=600 />

<img src="figures/cap9.png" width=600 />

<img src="http://sanghyukchun.github.io/images/post/89-1.PNG" width=600 />

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-7.width-1500.png" width=600 />

<img src="http://www.wildml.com/wp-content/uploads/2015/10/rnn-bptt-with-gradients.png" width=600 />

#### truncated BPTT

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-8.width-1500.png" width=600 />

#### BPTT using DNI

<img src="https://storage.googleapis.com/deepmind-live-cms.google.com.a.appspot.com/images/3-9.width-1500.png" width=600 />

##### Figure 3. (a)

<img src="figures/cap10.png" width=600 />

<img src="figures/cap8.png" width=600 />

#### chunking

This amounts to taking the infinitely unrolled RNN as the full neural network $F_{1}^∞$, and chunking it into an infinite number of sub-networks where the recurrent core is unrolled for T steps, giving $F_{t}^{t+T−1}$

This scheme can be implemented very efficiently by exploiting the recurrent nature of the network, as shown in Fig. 4. 

<img src="figures/cap11.png" width=600 />

<img src="figures/cap12.png" width=600 />

# 3 Experiments
* 3.1 Feed-Forward Networks
* 3.2 Recurrent Neural Networks

## 3.1 Feed-Forward Networks
* Every layer DNI
* Sparse Updates
* Complete Unlock

<img src="figures/cap13.png" width=600 />

### Every layer DNI

### Sparse Updates

<img src="figures/cap14.png" width=600 />

### Complete Unlock

<img src="figures/cap15.png" width=600 />

## 3.2 Recurrent Neural Networks
* Copy and Repeat Copy
* Language Modelling

<img src="figures/cap16.png" width=600 />

### Copy and Repeat Copy

<img src="figures/cap17.png" width=600 />

### Language Modelling

<img src="figures/cap18.png" width=600 />

## 3.3 Multi-Network System

# 4 Conclusion

# 참고자료
* [1] (deepmind original paper) Decoupled Neural Interfaces using Synthetic Gradients - https://arxiv.org/pdf/1608.05343.pdf
* [2] (deepmind blog) Decoupled Neural Interfaces using Synthetic Gradients - https://deepmind.com/blog/decoupled-neural-networks-using-synthetic-gradients/
* [3] (모두연) Decoupled Neural Interfaces using Synthetic Gradients - https://norman3.github.io/papers/docs/synthetic_gradients
* [4] (code) Image classification with synthetic gradient in tensorflow - https://github.com/andrewliao11/DNI-tensorflow