# Encoder-Decoder Implementation in Python Numpy

# Encoder

## Common

* $\overline{E}$ - `W_0_enc_approx_embdr` - wspólne dla obu kierunków, rozmiar ${m \times K_x}$, gdzie $m = 620$ i $K_x = 30001$
* $\overline{b}_{\bar{E}}$ - `b_0_enc_approx_embdr`, rozmiar $m \times 1$, bias dla `W_enc_approx_embdr`?

## Forward

* $\overrightarrow{W}$ - `W_0_enc_input_embdr_0` - wagi dla 0-wej ukrytej warstwy, w lewo, rozmiar $n\times m$, gdzie $n=1000$
* $\overrightarrow{W}_z$ - `W_0_enc_update_embdr_0` - wagi jakoś powiązane z aktualizacją GRU, rozmiar $n\times m$
* $\overrightarrow{W}_r$ - `W_0_enc_reset_embdr_0` - wagi jakoś powiązane z resetowaniem GRU, rozmiar $n\times m$
* $\overrightarrow{U}$ - `W_enc_transition_0`, rozmiar $n \times n$
* $\overrightarrow{U}_z$ - `G_enc_transition_0`, rozmiar $n \times n$
* $\overrightarrow{U}_r$ - `R_enc_transition_0`, rozmiar $n \times n$
* $\overrightarrow{b}_{\overrightarrow{W}}$ - `b_0_enc_input_embdr_0`, rozmiar $n \times 1$, bias dla `W_0_enc_input_embdr_0`? 

## Backward
Analogicznie, z interfiksem `back`

## Calculations

Równania pochdzą z Bahdanau *et. al* (2014) uzupełnione o biasy:

$$
\renewcommand{\ora}[1]{\overrightarrow{#1}}
\renewcommand{\ola}[1]{\overleftarrow{#1}}
\ora{h}_i = \left\{
\begin{array}{ll}
(1 - \ora{z}_i) \circ \ora{h}_{i-1} + \ora{z}_i \circ \ora{\underline{h}}_i & \mathrm{, if } i > 0 \\
0 & \mathrm{, if } i = 0 
\end{array}
\right.
$$

gdzie 

$$
\begin{eqnarray}
\ora{\underline{h}}_i &=& \tanh\left(\ora{W}(\overline{E}x_i+\overline{b}) + \ora{b}_{\ora{W}} +\ora{U}\left[\ora{r}_i \circ \ora{h}_{i-1}\right]\right)\\
\ora{z_i} &=& \sigma\left(\ora{W}_z(\overline{E}x_i+\overline{b})+\ora{U}_z\ora{h}_{i-1}\right)\\
\ora{r_i} &=& \sigma\left(\ora{W}_r(\overline{E}x_i+\overline{b})+\ora{U}_r\ora{h}_{i-1}\right)\\
\end{eqnarray}
$$

W drugą stronę tak samo, zmieniamy kierunek strzałki, implementacyjnie odwracamy sekwencję, podstawiamy macierze dla drugiego kierunku i odwracamy wynik. Wtedy:

$$
h_i = \left[
\begin{array}{c}
\ora{h}_i \\
\ola{h}_i
\end{array}
\right]
$$

# Decoder

### Embeddings

* $E$ - `W_0_dec_approx_embdr`, embeddings dla wyjścia, rozmiar $m \times K_y$
* $b$ - `b_0_dec_approx_embdr`, bias dla embeddings, rozmiar $m \times 1$

### RNN and GRU

* $W_s$ - `W_0_dec_initializer_0`, dla inicjalizacji stanu dekodera
* $b_{W_s}$ - `b_0_dec_initializer_0`, bias dla inicjalizacji stanu dekodera
* $W$ - `W_0_dec_input_embdr_0` - rozmiar $n\times m$
* $b_W$ - `b_0_dec_input_embdr_0`, bias
* $W_z$ - `W_0_dec_update_embdr_0` - wagi jakoś powiązane z aktualizacją GRU, rozmiar $n\times m$
* $W_r$ - `W_0_dec_reset_embdr_0` - wagi jakoś powiązane z resetowaniem GRU, rozmiar $n\times m$
* $U$ - `W_dec_transition_0`, rozmiar $n \times n$
* $U_z$ - `G_dec_transition_0`, rozmiar $n \times n$
* $U_r$ - `R_dec_transition_0`, rozmiar $n \times n$
* $C$ - `W_0_dec_dec_inputter_0`, rozmiar $n \times 2n$
* $C_z$ - `W_0_dec_dec_updater_0`, rozmiar $n \times 2n$
* $C_r$ - `W_0_dec_dec_reseter_0`, rozmiar $n \times 2n$

### Alignment model

$n^\prime=1000$ liczba neuronów w alignment model, 

* $v_\alpha$ - `D_dec_transition_0`, rozmiar $n^\prime \times 1$
* $W_\alpha$ - `B_dec_transition_0`, rozmiar $n^\prime \times n$
* $U_\alpha$ - `A_dec_transition_0`, rozmiar $n^\prime \times 2n$

### Softmax

$l = 500$, ukryta warstwa softmaxa, oraz $W_o = W_o^{(2)}W_o^{(1)}$

* $W_{o}^{(1)}$ - `W1_dec_deep_softmax`, rozmiar $m\times l$
* $W_{o}^{(2)}$ - `W2_dec_deep_softmax`, rozmiar $K_y\times m$
* $b_{W_o}$ - `b_dec_deep_softmax`, bias, rozmiar $K_y\times 1$.
* $U_o$ - `W_0_dec_hid_readout_0`, rozmiar $2l \times 2l$
* $b_{U_o}$ `b_0_dec_hid_readout_0`, rozmiar $2l \times 1$
* $V_o$ - `W_0_dec_prev_readout_0`, $2l \times m$
* $C_o$ - `W_0_dec_repr_readout_0`, rozmiar $2l \times 2n$

## Calculations

$$
s_i = (1-z_i) \circ s_{i-1} + z_i \circ \tilde{s}_i \qquad s_0 = \tanh\left(W_s\ola{h}_1 + b_{W_s}\right)
$$

gdzie

$$
\begin{eqnarray}
z_i &=& \sigma\left(W_z(Ey_i+b)+U_zs_{i-1} + C_zc_i \right)\\
r_i &=& \sigma\left(W_r(Ey_i+b)+U_rs_{i-1} + C_rc_i \right)\\
\tilde{s_i} &=& \tanh\left(W(Ey_i+b) + b_W + U \left[r_i \circ s_{i-1}\right] +Cc_i\right) \\
\end{eqnarray}
$$

Attention score obliczamy jako:

$$
c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j
$$

gdzie

$$ 
\begin{eqnarray}
\alpha_{ij} &=& \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x}\exp(e_{ik})} \\ 
e_{ij} &=& v_\alpha^T \tanh\left(W_{\alpha}s_{i-1} + U_{\alpha}h_j\right) 
\end{eqnarray}
$$

Oraz głęboki softmax (mamy $W_o = W_o^{(2)}W_o^{(1)}$):

$$
p(y_i|s_{i-1},y_{i-1},c_i) = \textrm{softmax}\left(y_i^T\left(W_ot_i + b_o\right)\right) 
$$

gdzie ($l = 500$)

$$
t_i = \left[\max \left\{ \tilde{t}_{i,2j-1},\tilde{t}_{i,2j} \right\}\right]_{j=1,\ldots,l}^T
$$

$$
\tilde{t}_i = U_os_{i-1}+b_{U_o}+V_o(Ey_{i-1} + b)+C_oc_i 
$$

gdzie $y_0$ = 0

In [1]:
#%%writefile bahdanau.py

import numpy as np
from __future__ import print_function

def logit(X):
    return 1.0 / (1.0 + np.exp(-X))

def softmax(X, ax=0):
    expX = np.exp(X)
    expXsum = np.sum(expX, axis=ax)
    return (expX / expXsum)

def batchAndMask(sents):
    maxLength = max(len(s) for s in sents)
    sentsPadded = [np.pad(np.copy(s), (0, maxLength-len(s)), mode="constant") 
                   for s in sents]
    batch = np.vstack(sentsPadded)
    mask = batch != 0
    return batch, mask

In [2]:
#%%writefile -a bahdanau.py

class Encoder:
    class Embeddings:
        def __init__(self, data):
            self.E  = data["W_0_enc_approx_embdr"]
            self.EB = data["b_0_enc_approx_embdr"].T

        def Lookup(self, i):
            return self.E[i] + self.EB
    
    class RNN:
        def __init__(self, data):
            self.W   = data["W"]
            self.B   = data["B"]
            self.U   = data["U"]
            self.Wz  = data["Wz"]
            self.Uz  = data["Uz"]
            self.Wr  = data["Wr"]
            self.Ur  = data["Ur"]

        def InitializeState(self, batchSize=1):
            H0 = np.zeros(1000 * batchSize).reshape(batchSize, 1000)
            return H0
        
        def GetNextState(self, embd, prevState):
            Zi  = logit(embd.dot(self.Wz) + prevState.dot(self.Uz))
            Ri  = logit(embd.dot(self.Wr) + prevState.dot(self.Ur))
            Hi_ = np.tanh(embd.dot(self.W) + self.B + (Ri * prevState).dot(self.U)) 
            Hi  = (1.0 - Zi) * prevState + Zi * Hi_
            return Hi
        
        def GetContext(self, embeddings):
            states = []
            prevState = self.InitializeState()
            for embd in embeddings:
                state = self.GetNextState(embd, prevState)
                states.append(state)
                prevState = state
            return states
    
    def __init__(self, data):
        self.embeddings = self.Embeddings(data)
        
        fW = dict()
        fW["W"] = data["W_0_enc_input_embdr_0"]
        fW["B"] = data["b_0_enc_input_embdr_0"].T
        fW["U"] = data["W_enc_transition_0"]
        fW["Wz"] = data["W_0_enc_update_embdr_0"]
        fW["Uz"] = data["G_enc_transition_0"]
        fW["Wr"] = data["W_0_enc_reset_embdr_0"]
        fW["Ur"] = data["R_enc_transition_0"]

        bW = dict()
        bW["W"] = data["W_0_back_enc_input_embdr_0"]
        bW["B"] = data["b_0_back_enc_input_embdr_0"].T
        bW["U"] = data["W_back_enc_transition_0"]
        bW["Wz"] = data["W_0_back_enc_update_embdr_0"]
        bW["Uz"] = data["G_back_enc_transition_0"]
        bW["Wr"] = data["W_0_back_enc_reset_embdr_0"]
        bW["Ur"] = data["R_back_enc_transition_0"]
        
        self.rnnForward  = self.RNN(fW)
        self.rnnBackward = self.RNN(bW)
        
    def GetContext(self, batch):
        batchSize, numSteps  = batch.shape
        sourceEmbeddings = [self.embeddings.Lookup(batch[:,i]) for i in range(numSteps)]
        statesForward  = self.rnnForward.GetContext(sourceEmbeddings)
        statesBackward = self.rnnBackward.GetContext(sourceEmbeddings[::-1])[::-1]
        states = np.hstack((np.vstack(statesForward),
                            np.vstack(statesBackward)))
        return states 

In [7]:
#%%writefile -a bahdanau.py

class Decoder:
    class Embeddings:
        def __init__(self, data):
            self.E  = data["W_0_dec_approx_embdr"]
            self.EB = data["b_0_dec_approx_embdr"]
            
        def Initialize(self, batchSize=1):
            return np.zeros((batchSize, self.E.shape[1]))
        
        def Lookup(self, i):
            return self.E[i] + self.EB
        
    class RNN:
        def __init__(self, data):
            self.Ws  = data["W_0_dec_initializer_0"]
            self.WsB = data["b_0_dec_initializer_0"].T

            self.W   = data["W_0_dec_input_embdr_0"]
            self.B   = data["b_0_dec_input_embdr_0"].T
            self.U   = data["W_dec_transition_0"]
            self.C   = data["W_0_dec_dec_inputter_0"]

            self.Wz  = data["W_0_dec_update_embdr_0"]
            self.Uz  = data["G_dec_transition_0"]
            self.Cz  = data["W_0_dec_dec_updater_0"]

            self.Wr  = data["W_0_dec_reset_embdr_0"]
            self.Ur  = data["R_dec_transition_0"]
            self.Cr  = data["W_0_dec_dec_reseter_0"]

        def InitializeState(self, sourceContext, batchSize=1):
            H1Backward = sourceContext[0,1000:].T
            S0 = np.tanh(H1Backward.dot(self.Ws) + self.WsB)
            return np.tile(S0, batchSize).reshape(batchSize, 1000)
        
        def GetNextState(self, embd, prevState, context):        
            Zi = logit(embd.dot(self.Wz) + prevState.dot(self.Uz) + context.dot(self.Cz))
            Ri = logit(embd.dot(self.Wr) + prevState.dot(self.Ur) + context.dot(self.Cr))
            Si_= np.tanh(embd.dot(self.W) + self.B
                          + (Ri * prevState).dot(self.U)
                          + context.dot(self.C))
            Si  = (1.0 - Zi) * prevState + Zi * Si_
            return Si
    
    class AlignmentModel:
        def __init__(self, data):
            self.Va  = data["D_dec_transition_0"].T
            self.Wa  = data["B_dec_transition_0"]
            self.Ua  = data["A_dec_transition_0"]
            
        def GetContext(self, sourceContext, prevState):
            a = sourceContext.dot(self.Ua)
            b = prevState.dot(self.Wa)
            c = a.reshape(1, a.shape[0], a.shape[1]) + b.reshape(b.shape[0], 1, b.shape[1])
            Ei = np.tensordot(self.Va, np.tanh(c).T, axes=[[1],[0]])
            Ai = softmax(Ei, ax=1)
            Ai = Ai.reshape(Ai.shape[1],Ai.shape[2])
            Ci = Ai.T.dot(sourceContext)
            return Ci
    
    class DeepSoftMax:
        def __init__(self, data):
            Wo1      = data["W1_dec_deep_softmax"]
            Wo2      = data["W2_dec_deep_softmax"] 
            self.Wo  = Wo1.dot(Wo2)
            self.WoB = data["b_dec_deep_softmax"].T
            self.Uo  = data["W_0_dec_hid_readout_0"]
            self.UoB = data["b_0_dec_hid_readout_0"].T
            self.Vo  = data["W_0_dec_prev_readout_0"]
            self.Co  = data["W_0_dec_repr_readout"]
            
        def GetProbs(self, prevState, prevEmbd, context):
            Ti = prevState.dot(self.Uo) + self.UoB + prevEmbd.dot(self.Vo) + context.dot(self.Co)
            maximum = np.maximum(Ti[:,::2], Ti[:,1::2])
            P = softmax((maximum.dot(self.Wo) + self.WoB).T)
            logP = np.log(P)
            return logP

    def __init__(self, data):
        self.embeddings     = self.Embeddings(data)
        self.rnn            = self.RNN(data)
        self.alignmentModel = self.AlignmentModel(data)
        self.deepSoftMax    = self.DeepSoftMax(data)
    
    def GetScores(self, batch, mask, sourceContext):
        states, probs = [], []
        batchSize, numSteps  = batch.shape
        
        previousState = self.rnn.InitializeState(sourceContext, batchSize)
        previousEmbedding = self.embeddings.Initialize(batchSize)
        
        for i in range(numSteps):
            wordBatch = batch[:,i]

            alignedSourceContext = self.alignmentModel.GetContext(sourceContext, previousState)

            allProbs = self.deepSoftMax.GetProbs(previousState, previousEmbedding, alignedSourceContext)
            
            currentEmbedding = self.embeddings.Lookup(wordBatch)
            currentState = self.rnn.GetNextState(currentEmbedding, previousState, alignedSourceContext)
            
            for column, wordId in enumerate(wordBatch):
                #print(wordId, allProbs[wordId, column])
                probs.append(allProbs[wordId, column]) 
            previousState, previousEmbedding = currentState, currentEmbedding
            
        probs = np.array(probs).reshape(numSteps, batchSize).T * mask
        return np.sum(probs, axis=1), probs #, states

In [14]:
#%%writefile -a bahdanau.py

data = np.load("/home/marcinj/Badania/best_nmt/search_model.npz")
encoder = Encoder(data)
decoder = Decoder(data)

In [15]:
#%%writefile -a bahdanau.py

# "this is a little test <eol>"
sourceSentence, mask = batchAndMask([
        np.array([22, 186, 3, 52, 5, 2, 4464, 11255, 5, 903, 6, 52, 5, 2, 8053, 5, 
                  19041, 328, 10, 2, 477, 7, 620, 2227, 119, 80, 24, 25, 4129, 2, 
                  382, 1053, 3, 80, 195, 599, 49, 618, 8, 2, 2567, 3, 2187, 17, 
                  8, 5345, 11899, 23, 4886, 3, 2450, 17, 8, 2, 8153, 3193, 5, 1718, 4, 30000])])

t1 = np.array([229, 14, 2, 19, 5, 6535, 19417, 17, 1, 6, 19, 5, 19417, 5, 1, 664, 
               12, 4, 1289, 8, 234, 92, 2685, 2, 4, 4, 813, 1301, 23, 941, 43, 2, 
               4, 117, 69, 64, 1075, 20, 4166, 4757, 2, 8890, 14, 7, 8455, 1, 57, 
               6158, 2, 18796, 14, 7, 10, 23947, 10062, 5, 3125, 3, 30000])
batch, mask = batchAndMask([t1])


import time
start = time.time()

np.set_printoptions(precision=6, suppress=True)

sourceContext = encoder.GetContext(sourceSentence)
prob, probs = decoder.GetScores(batch, mask, sourceContext)

end = time.time()

print(probs, "\n")
print("Final: ", prob, "\n") 

print("Time: ", np.round(end - start, 4))

[[-1.681214 -0.026378 -0.091734 -1.760913 -0.243037 -0.103349 -0.651866
  -0.339058 -0.606603 -0.559903 -0.429009 -0.651369 -0.754762 -0.446464
  -0.037696 -0.341691 -0.12027  -0.021312 -0.78809  -2.853825 -0.071444
  -0.207325 -1.044245 -0.69245  -0.003423 -0.301036 -0.286701 -2.718127
  -0.518496 -0.154297 -0.341055 -0.110045 -0.116031 -2.53656  -0.173996
  -1.296167 -0.510002 -0.734223 -0.10083  -3.073422 -0.135046 -2.451027
  -1.246977 -0.55526  -2.803749 -1.305413 -1.054786 -0.58193  -0.295945
  -0.712755 -0.362529 -0.396563 -0.46071  -0.906829 -1.539457 -0.687323
  -0.106607 -0.015487 -0.00162 ]] 

Final:  [-43.118431] 

Time:  19.9453
