## **Differentiable Neural Computers**

Hybrid Computing using a neural network with dynamic external memory (Graves et al. 2016)

Konstantinos Kogkalidis

May 28, 2018

Logic and Computation

# Overview: Probabilistic Programming

#### Cross-domain

- Data Flow Programming
- Bayesian Reasoning
- Machine Learning
- Functional Programming

# Overview: Probabilistic Programming

#### Cross-domain

- Data Flow Programming
- Bayesian Reasoning
- Machine Learning
- Functional Programming

Intuition: "Rather than explicitly write a program, write some constraints on the behavior of the desired program and use computational tools to search the program space for models satisfying these constraints."

# Overview: Probabilistic Programming

#### Cross-domain

- Data Flow Programming
- Bayesian Reasoning
- Machine Learning
- Functional Programming

Intuition: "Rather than explicitly write a program, write some constraints on the behavior of the desired program and use computational tools to search the program space for models satisfying these constraints."

| Program       | Model      |
|---------------|------------|
| Discrete      | Continuous |
| Deterministic | Stochastic |
| Static        | Adaptive   |

Overview: DNC

#### Differentiable Neural Computer

A recurrent neural network coupled with an external memory.

Overview: DNC

### Differentiable Neural Computer

A recurrent neural network coupled with an external memory.

- Inspired by biological memory
- Reimaging of classical concepts of computation
- Extension of NTMs
  - End-to-end differentiable
  - Auto-associative memory
  - Turing complete
  - + Stronger memory management

## **Introduction: Classic Computation**

#### Von Neumann architecture



#### **Introduction: Neural Networks**

#### Simple Neural Net



 $NN: \overline{\boldsymbol{x}_{t_i}} \mapsto \overline{\boldsymbol{y}_{t_i}}$ No memory

#### **Introduction: Neural Networks**

### Simple Neural Net



 $NN : \boldsymbol{x}_{t_i} \mapsto \boldsymbol{y}_{t_i}$ No memory

### Simple Recurrent Net



 $RNN: \boldsymbol{x}_{t_0} \otimes \boldsymbol{x}_{t_1} \otimes \cdots \otimes \boldsymbol{x}_{t_i} \mapsto \boldsymbol{y}_{t_i}$ Finite memory

#### **Introduction: Neural Networks**

### Simple Neural Net



 $NN: oldsymbol{x}_{t_i} \mapsto oldsymbol{y}_{t_i}$ No memory

### Simple Recurrent Net



 $RNN: \boldsymbol{x}_{t_0} \otimes \boldsymbol{x}_{t_1} \otimes \cdots \otimes \boldsymbol{x}_{t_i} \mapsto \boldsymbol{y}_{t_i}$ Finite memory

"If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs."

Train a RNN to act as the controller of a memory matrix M of N addresses through R read heads and one write head.

Train a RNN to act as the controller of a memory matrix M of N addresses through R read heads and one write head.

### 1. Content Lookup

- Attention over memory defined by weightings  $w \in S^N$
- Compare controller output with memory objects (auto-associative memory)
- Allow partial matches (pattern completion)

Train a RNN to act as the controller of a memory matrix M of N addresses through R read heads and one write head.

- 1. Content Lookup
  - Attention over memory defined by weightings  $w \in S^N$
  - Compare controller output with memory objects (auto-associative memory)
  - Allow partial matches (pattern completion)
- 2. Sequential Retrieval
  - Fill  $L \in [0,1]^{N \times N}$  indexing temporal transitions
  - Shift operations defined by Lw,  $L^Tw$

Train a RNN to act as the controller of a memory matrix M of N addresses through R read heads and one write head.

### 1. Content Lookup

- Attention over memory defined by weightings  $w \in S^N$
- Compare controller output with memory objects (auto-associative memory)
- Allow partial matches (pattern completion)

#### 2. Sequential Retrieval

- Fill  $L \in [0,1]^{N \times N}$  indexing temporal transitions
- Shift operations defined by Lw,  $L^{\top}w$

#### 3. Dynamic Allocation

- Mark memory locations with  $\{0,1\}$  to signal usage
- Manipulate signals during R/W operations to enable reallocation
- Generalization to unbounded memory

#### Controller: Overview

A deep long short-term memory network receiving input:

$$\boldsymbol{\mathcal{X}}_t = [\boldsymbol{x}_t; \boldsymbol{r}_{t-1}^1; \dots; \boldsymbol{r}_{t-1}^R]$$
 (timestep t)

and producing output:

$$(\boldsymbol{v}_T, \boldsymbol{\xi}_T) = \mathcal{N}([\boldsymbol{\chi}_1; \dots; \boldsymbol{\chi}_T]; \vartheta)$$
 (entire sequence)

where  ${\cal N}$  a set of state equations and  $\vartheta$  their trainable parameters.

## **Controller: State Equations**

A more detailed look into  $\mathcal{N}$ .

## **Controller: State Equations**

A more detailed look into  $\mathcal{N}$ .

### LSTM layer equations

Input:  $[\boldsymbol{\chi}_t; \boldsymbol{h}_{t-1}^l; \boldsymbol{h}_t^{l-1}]$ 

Output:  $h_t^I$ 

### **Controller: State Equations**

A more detailed look into  $\mathcal{N}$ .

## LSTM layer equations

```
Input: [\boldsymbol{\mathcal{X}}_t; \boldsymbol{h}_{t-1}^l; \boldsymbol{h}_t^{l-1}]
```

Output:  $h_t^I$ 

$$\begin{split} & \boldsymbol{i}_t^l = \zeta_i([\boldsymbol{\mathcal{X}}_t; \boldsymbol{h}_{t-1}^l; \boldsymbol{h}_t^{l-1}]) & \text{(input gate)} \\ & \boldsymbol{f}_t^l = \zeta_f([\boldsymbol{\mathcal{X}}_t; \boldsymbol{h}_{t-1}^l; \boldsymbol{h}_t^{l-1}]) & \text{(forget gate)} \\ & \boldsymbol{s}_t^l = \boldsymbol{f}_t^l \boldsymbol{s}_{t-1}^l + \boldsymbol{i}_t^l \zeta_s([\boldsymbol{\mathcal{X}}_t; \boldsymbol{h}_{t_1}^l; \boldsymbol{h}_t^{l-1}]) & \text{(state update)} \\ & \boldsymbol{o}_t^l = \zeta_o([\boldsymbol{\mathcal{X}}_t; \boldsymbol{h}_{t-1}^l; \boldsymbol{h}_t^{l-1}]) & \text{(output gate)} \\ & \boldsymbol{h}_t^l = \boldsymbol{o}_t^l \zeta_h(\boldsymbol{s}_t^l) & \text{(hidden layer)} \end{split}$$

# Controller: Signal-Flow (1/2)

## Single LSTM layer



## Controller: Signal-Flow (2/2)

### LSTM Network (multiple layers)



## **Controller: Outputs**

$$(oldsymbol{v}_t, oldsymbol{\xi}_t) = \mathcal{N}([oldsymbol{\mathcal{X}}_1; \dots; oldsymbol{\mathcal{X}}_T]; artheta)$$

### **Controller: Outputs**

$$(\boldsymbol{v}_t, \boldsymbol{\xi}_t) = \mathcal{N}([\boldsymbol{\chi}_1; \dots; \boldsymbol{\chi}_T]; \vartheta)$$

Intermediate output 
$$m{v}_t = W_y[m{h}_t^1; \dots; m{h}_t^L]$$
 
$$m{y}_t = m{v}_t + W_R[m{r}_t^1; \dots; m{r}_t^R] \qquad \qquad \text{(Memory-conditioning)}$$

### **Controller: Outputs**

$$(oldsymbol{v}_t, oldsymbol{\xi}_t) = \mathcal{N}([oldsymbol{\mathcal{X}}_1; \dots; oldsymbol{\mathcal{X}}_T]; artheta)$$

Intermediate output 
$$v_t = W_y[\pmb{h}_t^1; \dots; \pmb{h}_t^L]$$
 
$$y_t = v_t + W_R[\pmb{r}_t^1; \dots; \pmb{r}_t^R] \qquad \qquad \text{(Memory-conditioning)}$$

# Interface vector $\xi_t = W_{\xi}[h_t^1; \dots; h_t^L]$

- Read keys:  $\mathbf{k}_t^{r,i}$
- Read strengths:  $\beta_t^{r,i}$
- Write key:  $\mathbf{k}_t^w$
- Write strength:  $\beta_t^w$
- Erase vector:  $e_t$

- Write vector:  $\boldsymbol{v}_t$
- Free gates:  $\phi_t^i$
- Allocation gate:  $g_t^a$
- Write gate:  $g_t^w$
- Read modes:  $\pi_t^i$

## Memory Adressing: Content-Lookup

```
R read keys \mathbf{k}^{r,i} \in \mathbb{R}^W, i = 1 \dots R
R read strengths \beta^{r,i} \in [1,\infty), i = 1 \dots R
Write key \mathbf{k}^w \in \mathbb{R}^W
Write strength \beta^w \in [1,\infty)
```

# Memory Adressing: Content-Lookup

R read keys  $\mathbf{k}^{r,i} \in \mathbb{R}^W, i = 1...R$ R read strengths  $\beta^{r,i} \in [1,\infty), i = 1...R$ Write key  $\mathbf{k}^w \in \mathbb{R}^W$ Write strength  $\beta^w \in [1,\infty)$ 

Matching function  $\mathcal D$  comparing memory contents

## Memory Adressing: Content-Lookup

R read keys 
$$\mathbf{k}^{r,i} \in \mathbb{R}^W, i = 1 \dots R$$
  
R read strengths  $\beta^{r,i} \in [1,\infty), i = 1 \dots R$   
Write key  $\mathbf{k}^w \in \mathbb{R}^W$   
Write strength  $\beta^w \in [1,\infty)$ 

Matching function  $\mathcal D$  comparing memory contents

Weighting function  $\mathcal C$  normalizing and sharpening matches

$$C(M, \mathbf{k}, \beta)[i] = \frac{\exp\{\beta \mathcal{D}(\mathbf{k}, M[i,:])\}}{\sum_{j} \exp\{\beta \mathcal{D}(\mathbf{k}, M[j,:])\}}$$

## Memory Adressing: R/W

Attention dictated by weightings  $\mathbf{w} \in \mathcal{S}^N$ Erase vector  $\mathbf{e}_t \in [0,1]^W$ Write vector  $\mathbf{v}_t \in \mathbb{R}^W$ 

Read operations

$$\mathbf{r}_t^i = M_t^{\top} \mathbf{w}_t^{r,i}$$

## Memory Adressing: R/W

Attention dictated by weightings  ${m w} \in \mathcal{S}^N$ 

Erase vector  $\boldsymbol{e}_t \in [0,1]^W$ 

Write vector  $\mathbf{v}_t \in \mathbb{R}^W$ 

### Read operations

$$\mathbf{r}_t^i = M_t^{\top} \mathbf{w}_t^{r,i}$$

### Write operations

$$M_t = \underbrace{M_{t-1} \circ (\mathbf{1} - oldsymbol{w}_t^w oldsymbol{e}_t^ op)}_{ ext{erased memory}} + \underbrace{oldsymbol{w}_t^w oldsymbol{v}_t^ op}_{ ext{new write}}$$

## Memory Adressing: Dynamic Allocation

```
Free gates \phi_t^i \in [0, 1]^W
Allocation gate g_t^a \in [0, 1]
Write gate g_t^w \in [0, 1]
```

"Free list" scheme

$$oldsymbol{\psi}_t = \prod_{i=1}^{N} (1 - \phi_t^i w_{t-1}^{r,i})$$
 (memory retention)

$$u_t = (\boldsymbol{u}_{t-1} + \boldsymbol{w}_{t-1}^w + \boldsymbol{u}_{t-1} \circ \boldsymbol{w}_{t-1}^w) \circ \psi_t$$
 (usage tracking)

## Memory Adressing: Dynamic Allocation

Free gates 
$$\phi_t^i \in [0, 1]^W$$
  
Allocation gate  $g_t^a \in [0, 1]$   
Write gate  $g_t^w \in [0, 1]$ 

"Free list" scheme

$$m{\psi}_t = \prod_{i=1}^R (1 - m{\phi}_t^i w_{t-1}^{r,i})$$
 (memory retention)  $u_t = (m{u}_{t-1} + m{w}_{t-1}^w + m{u}_{t-1} \circ m{w}_{t-1}^w) \circ m{\psi}_t$  (usage tracking)

#### Attention shift

- ullet Obtain the allocation vector  $oldsymbol{a}_t$  by normalizing  $oldsymbol{u}_t$
- Shift  $\mathbf{w}_t$  by  $g_t^a \mathbf{a}_t$  and scale by  $g_t^w$

# Memory Adressing: Temporal Linking

Free gates  $\phi_t^i \in [0,1]^W$ 

Memory retention

$$oldsymbol{\psi}_t = \prod_{i=1}^R (\mathbf{1} - oldsymbol{\phi}_t^i w_{t-1}^{r,i})$$

# Memory Adressing: Temporal Linking

Free gates  $\phi_t^i \in [0,1]^W$ 

Memory retention

$$m{\psi}_t = \prod_{i=1}^R (\mathbf{1} - m{\phi}_t^i w_{t-1}^{r,i})$$

Usage tracking

$$u_t = (\mathbf{u}_{t-1} + \mathbf{w}_{t-1}^w + \mathbf{u}_{t-1} \circ \mathbf{w}_{t-1}^w) \circ \psi_t$$

# Memory Adressing: Temporal Linking

Free gates  $\phi_t^i \in [0,1]^W$ 

Memory retention

$$m{\psi}_t = \prod_{i=1}^R (\mathbf{1} - m{\phi}_t^i w_{t-1}^{r,i})$$

Usage tracking

$$u_t = (\mathbf{u}_{t-1} + \mathbf{w}_{t-1}^w + \mathbf{u}_{t-1} \circ \mathbf{w}_{t-1}^w) \circ \psi_t$$

Allocation vector  $\boldsymbol{a}_t$  given by normalizing and sorting  $\boldsymbol{u}_t$ 

### Further Reading

- Neural Turing Machines (Graves, Wayne, Danihelka)
- Entity Networks (Henaff, Weston, Szlam, Bordes, LeCun)
- End-to-End Memory Networks (Sukhbaatar, Szlam, Weston, Fergus)
- Jointly Learning to Align and Translate (Bahdanau, Cho, Bengio)
- Principles of Probabilistic Programming Languages (Goodman)
- Backprop as a Functor (Fong, Spivak, Tuyras)
- Formal Methods for Probabilistic Programming (Selsam, Liang, Dill)

#### Conclusion

#### **Takeaway**

We can automatically infer simple functions over complex data structures, in the form of probability distributions just by using examples.

#### Conclusion

#### **Takeaway**

We can automatically infer simple functions over complex data structures, in the form of probability distributions just by using examples.

Thank you!