# Recurrent Neural Networks

Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have "memory". They can read inputs $x^{\langle t \rangle}$ (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a uni-directional RNN to take information from the past to process later inputs. A bidirection RNN can take context from both the past and the future. 

**Notation**:
- Superscript $[l]$ denotes an object associated with the $l^{th}$ layer. 
    - Example: $a^{[4]}$ is the $4^{th}$ layer activation. $W^{[5]}$ and $b^{[5]}$ are the $5^{th}$ layer parameters.

- Superscript $(i)$ denotes an object associated with the $i^{th}$ example. 
    - Example: $x^{(i)}$ is the $i^{th}$ training example input.

- Superscript $\langle t \rangle$ denotes an object at the $t^{th}$ time-step. 
    - Example: $x^{\langle t \rangle}$ is the input x at the $t^{th}$ time-step. $x^{(i)\langle t \rangle}$ is the input at the $t^{th}$ timestep of example $i$.
    
- Lowerscript $i$ denotes the $i^{th}$ entry of a vector.
    - Example: $a^{[l]}_i$ denotes the $i^{th}$ entry of the activations in layer $l$.



## 1 - Forward propagation for the basic Recurrent Neural Network
 
The basic RNN that you will implement has the structure below. In this example, $T_x = T_y$. 

<img src="https://imgur.com/Yaa79IN.png" style="width:500;height:300px;">
<caption><center> **Figure 1**: Basic RNN model </center></caption>

Here's how you can implement an RNN: 

**Code Instructions**:
1. Implement the calculations needed for one time-step of the RNN.
2. Implement a loop over $T_x$ time-steps in order to process all the inputs, one at a time. 

Let's go!

## 1.1 - RNN cell

A Recurrent neural network can be seen as the repetition of a single cell. You are first going to implement the computations for a single time-step. The following figure describes the operations for a single time-step of an RNN cell. 

<img src="https://imgur.com/vGxAY57.png" style="width:700px;height:300px;">
<caption><center> **Figure 2**: Basic RNN cell. Takes as input $x^{\langle t \rangle}$ (current input) and $a^{\langle t - 1\rangle}$ (previous hidden state containing information from the past), and outputs $a^{\langle t \rangle}$ which is given to the next RNN cell and also used to predict $y^{\langle t \rangle}$ </center></caption>


**Code Instructions**:
1. Compute the hidden state with tanh activation: $a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$.
2. Using your new hidden state $a^{\langle t \rangle}$, compute the prediction $\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$. We provided you a function: `softmax`.
3. Store $(a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters)$ in cache
4. Return $a^{\langle t \rangle}$ , $y^{\langle t \rangle}$ and cache

We will vectorize over $m$ examples. Thus, $x^{\langle t \rangle}$ will have dimension $(n_x,m)$, and $a^{\langle t \rangle}$ will have dimension $(n_a,m)$. 

## 1.2 - RNN forward pass 

RNN as the repetition of the cell you've just built. If your input sequence of data is carried over 10 time steps, then you will copy the RNN cell 10 times. Each cell tak
es as input the hidden state from the previous cell ($a^{\langle t-1 \rangle}$) and the current time-step's input data ($x^{\langle t \rangle}$). It outputs a hidden state ($a^{\langle t \rangle}$) and a prediction ($y^{\langle t \rangle}$) for this time-step.


<img src="https://imgur.com/YdNCgkN.png" style="width:800px;height:300px;">
<caption><center> **Figure 3**: Basic RNN. The input sequence $x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$  is carried over $T_x$ time steps. The network outputs $y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$. </center></caption>

**Code Instructions**:
1. Create a vector of zeros ($a$) that will store all the hidden states computed by the RNN.
2. Initialize the "next" hidden state as $a_0$ (initial hidden state).
3. Start looping over each time step, your incremental index is $t$ :
    - Update the "next" hidden state and the cache by running `rnn_cell_forward`
    - Store the "next" hidden state in $a$ ($t^{th}$ position) 
    - Store the prediction in y
    - Add the cache to the list of caches
4. Return $a$, $y$ and caches


### 1.3 - Basic RNN  backward pass

We will start by computing the backward pass for the basic RNN-cell.

<img src="https://imgur.com/3EniMu4.png" style="width:500;height:300px;"> <br>
<caption><center> **Figure 4**: RNN-cell's backward pass. Just like in a fully-connected neural network, the derivative of the cost function $J$ backpropagates through the RNN by following the chain-rule from calculus. The chain-rule is also used to calculate $(\frac{\partial J}{\partial W_{ax}},\frac{\partial J}{\partial W_{aa}},\frac{\partial J}{\partial b})$ to update the parameters $(W_{ax}, W_{aa}, b_a)$. </center></caption>

#### Deriving the one step backward functions: 

To compute the `rnn_cell_backward` you need to compute the following equations. It is a good exercise to derive them by hand. 

The derivative of $\tanh$ is $1-\tanh(x)^2$.

Similarly for $\frac{ \partial a^{\langle t \rangle} } {\partial W_{ax}}, \frac{ \partial a^{\langle t \rangle} } {\partial W_{aa}},  \frac{ \partial a^{\langle t \rangle} } {\partial b}$, the derivative of  $\tanh(u)$ is $(1-\tanh(u)^2)du$. 

The final two equations also follow same rule and are derived using the $\tanh$ derivative. Note that the arrangement is done in a way to get the same dimensions to match.

#### Backward pass through the RNN

Computing the gradients of the cost with respect to $a^{\langle t \rangle}$ at every time-step $t$ is useful because it is what helps the gradient backpropagate to the previous RNN-cell. To do so, you need to iterate through all the time steps starting at the end, and at each step, you increment the overall $db_a$, $dW_{aa}$, $dW_{ax}$ and you store $dx$.

**Instructions**:

Implement the `rnn_backward` function. Initialize the return variables with zeros first and then loop through all the time steps while calling the `rnn_cell_backward` at each time timestep, update the other variables accordingly.

In the next part, you will build a more complex LSTM model, which is better at addressing vanishing gradients. The LSTM will be better able to remember a piece of information and keep it saved for many timesteps. 

## 2 - Long Short-Term Memory (LSTM) network

This following figure shows the operations of an LSTM-cell.

<img src="https://imgur.com/wRyYVQ6.png" style="width:500;height:400px;">
<caption><center> **Figure 4**: LSTM-cell. This tracks and updates a "cell state" or memory variable $c^{\langle t \rangle}$ at every time-step, which can be different from $a^{\langle t \rangle}$. </center></caption>

Similar to the RNN example above, you will start by understanding the LSTM cell for a single time-step. Then you can iteratively call it from inside a for-loop to have it process an input with $T_x$ time-steps. 

### About the gates

#### - Forget gate

For the sake of this illustration, lets assume we are reading words in a piece of text, and want use an LSTM to keep track of grammatical structures, such as whether the subject is singular or plural. If the subject changes from a singular word to a plural word, we need to find a way to get rid of our previously stored memory value of the singular/plural state. In an LSTM, the forget gate lets us do this: 

$$\Gamma_f^{\langle t \rangle} = \sigma(W_f[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_f)\tag{1} $$

Here, $W_f$ are weights that govern the forget gate's behavior. We concatenate $[a^{\langle t-1 \rangle}, x^{\langle t \rangle}]$ and multiply by $W_f$. The equation above results in a vector $\Gamma_f^{\langle t \rangle}$ with values between 0 and 1. This forget gate vector will be multiplied element-wise by the previous cell state $c^{\langle t-1 \rangle}$. So if one of the values of $\Gamma_f^{\langle t \rangle}$ is 0 (or close to 0) then it means that the LSTM should remove that piece of information (e.g. the singular subject) in the corresponding component of $c^{\langle t-1 \rangle}$. If one of the values is 1, then it will keep the information. 

#### - Update gate

Once we forget that the subject being discussed is singular, we need to find a way to update it to reflect that the new subject is now plural. Here is the formulat for the update gate: 

$$\Gamma_u^{\langle t \rangle} = \sigma(W_u[a^{\langle t-1 \rangle}, x^{\{t\}}] + b_u)\tag{2} $$ 

Similar to the forget gate, here $\Gamma_u^{\langle t \rangle}$ is again a vector of values between 0 and 1. This will be multiplied element-wise with $\tilde{c}^{\langle t \rangle}$, in order to compute $c^{\langle t \rangle}$.

#### - Updating the cell 

To update the new subject we need to create a new vector of numbers that we can add to our previous cell state. The equation we use is: 

$$ \tilde{c}^{\langle t \rangle} = \tanh(W_c[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_c)\tag{3} $$

Finally, the new cell state is: 

$$ c^{\langle t \rangle} = \Gamma_f^{\langle t \rangle}* c^{\langle t-1 \rangle} + \Gamma_u^{\langle t \rangle} *\tilde{c}^{\langle t \rangle} \tag{4} $$


#### - Output gate

To decide which outputs we will use, we will use the following two formulas: 

$$ \Gamma_o^{\langle t \rangle}=  \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o)\tag{5}$$ 
$$ a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle})\tag{6} $$

Where in equation 5 you decide what to output using a sigmoid function and in equation 6 you multiply that by the $\tanh$ of the previous state. 

### 2.1 - LSTM cell

**Instructions**:
1. Concatenate $a^{\langle t-1 \rangle}$ and $x^{\langle t \rangle}$ in a single matrix: $concat = \begin{bmatrix} a^{\langle t-1 \rangle} \\ x^{\langle t \rangle} \end{bmatrix}$
2. Compute all the formulas 1-6. You can use `sigmoid()` and `np.tanh()`.
3. Compute the prediction $y^{\langle t \rangle}$. You can use `softmax()`

### 2.2 - Forward pass for LSTM

Now that you have implemented one step of an LSTM, you can now iterate this over this using a for-loop to process a sequence of $T_x$ inputs. 

<img src="https://imgur.com/CFEgAAx.png" style="width:500;height:300px;">
<caption><center> **Figure 5**: LSTM over multiple time-steps. </center></caption>

**Exercise:** Implement `lstm_forward()` to run an LSTM over $T_x$ time-steps. 

**Note**: $c^{\langle 0 \rangle}$ is initialized with zeros.

The forward passes for the basic RNN and the LSTM. When using a deep learning framework, implementing the forward pass is sufficient to build systems that achieve great performance. Now we will see how to do backpropagation in LSTM  and RNNS


## 2.3- LSTM backward pass

### 2.3.1 One Step backward

The LSTM backward pass is slighltly more complicated than the forward one. We have provided you with all the equations for the LSTM backward pass below. (If you enjoy calculus exercises feel free to try deriving these from scratch yourself.) 

### 2.3.2 gate derivatives

$$d \Gamma_o^{\langle t \rangle} = da_{next}*\tanh(c_{next}) * \Gamma_o^{\langle t \rangle}*(1-\Gamma_o^{\langle t \rangle})\tag{7}$$

$$d\tilde c^{\langle t \rangle} = dc_{next}*\Gamma_u^{\langle t \rangle}+ \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * i_t * da_{next} * \tilde c^{\langle t \rangle} * (1-\tanh(\tilde c)^2) \tag{8}$$

$$d\Gamma_u^{\langle t \rangle} = dc_{next}*\tilde c^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * \tilde c^{\langle t \rangle} * da_{next}*\Gamma_u^{\langle t \rangle}*(1-\Gamma_u^{\langle t \rangle})\tag{9}$$

$$d\Gamma_f^{\langle t \rangle} = dc_{next}*\tilde c_{prev} + \Gamma_o^{\langle t \rangle} (1-\tanh(c_{next})^2) * c_{prev} * da_{next}*\Gamma_f^{\langle t \rangle}*(1-\Gamma_f^{\langle t \rangle})\tag{10}$$

### 2.3.3 parameter derivatives 

$$ dW_f = d\Gamma_f^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{11} $$
$$ dW_u = d\Gamma_u^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{12} $$
$$ dW_c = d\tilde c^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{13} $$
$$ dW_o = d\Gamma_o^{\langle t \rangle} * \begin{pmatrix} a_{prev} \\ x_t\end{pmatrix}^T \tag{14}$$

To calculate $db_f, db_u, db_c, db_o$ you just need to sum across the horizontal (axis= 1) axis on $d\Gamma_f^{\langle t \rangle}, d\Gamma_u^{\langle t \rangle}, d\tilde c^{\langle t \rangle}, d\Gamma_o^{\langle t \rangle}$ respectively. Note that you should have the `keep_dims = True` option.

Finally, you will compute the derivative with respect to the previous hidden state, previous memory state, and input.

$$ da_{prev} = W_f^T*d\Gamma_f^{\langle t \rangle} + W_u^T * d\Gamma_u^{\langle t \rangle}+ W_c^T * d\tilde c^{\langle t \rangle} + W_o^T * d\Gamma_o^{\langle t \rangle} \tag{15}$$
Here, the weights for equations 13 are the first n_a, (i.e. $W_f = W_f[:n_a,:]$ etc...)

$$ dc_{prev} = dc_{next}\Gamma_f^{\langle t \rangle} + \Gamma_o^{\langle t \rangle} * (1- \tanh(c_{next})^2)*\Gamma_f^{\langle t \rangle}*da_{next} \tag{16}$$
$$ dx^{\langle t \rangle} = W_f^T*d\Gamma_f^{\langle t \rangle} + W_u^T * d\Gamma_u^{\langle t \rangle}+ W_c^T * d\tilde c_t + W_o^T * d\Gamma_o^{\langle t \rangle}\tag{17} $$
where the weights for equation 15 are from n_a to the end, (i.e. $W_f = W_f[n_a:,:]$ etc...)


### 3.3 Backward pass through the LSTM RNN

This part is very similar to the `rnn_backward` function you implemented above. You will first create variables of the same dimension as your return variables. You will then iterate over all the time steps starting from the end and call the one step function you implemented for LSTM at each iteration. You will then update the parameters by summing them individually. Finally return a dictionary with the new gradients. 

**Instructions**: Implement the `lstm_backward` function. Create a for loop starting from $T_x$ and going backward. For each step call `lstm_cell_backward` and update the your old gradients by adding the new gradients to them. Note that `dxt` is not updated but is stored.

# Questions and Answers(Q&A)

In LSTM Network (Understanding LSTMs), Why input gate and output gate use tanh? what is the intuition behind this?

A: The reason for using tanh is that its range is between (-1,1) whereas the sigmoid function is (0,1).Actually tanh is extended version of sigmoid.
$$ tanh = 2* \sigma(2x) - 1 $$
So, the gradient of the tanh is almost twice than the sigmoid gradient in the x range (-1.663,1.663) and is almost equal to sigmoid gradient in the x range (-inf,-1.633) U (1.663,inf). So the convergence is faster in the case of tanh because of larger gradients and is more resistant to the vanishing gradient problems.

# Activity Recognition

In [0]:
import pandas as pd
import numpy as np
import cv2
import os
import h5py
from tqdm import tqdm
from keras.preprocessing import image
from keras.applications.inception_v3 import InceptionV3, preprocess_input
from keras.models import Model, load_model, Sequential
from keras.layers import Input, LSTM, Dense, Dropout
from keras.utils import to_categorical
from keras.applications.imagenet_utils import preprocess_input
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, TensorBoard,EarlyStopping
from keras.utils.io_utils import HDF5Matrix

SEQ_LEN = 30
MAX_SEQ_LEN = 200
BATCH_SIZE = 16
EPOCHS = 1000

In [0]:
def get_data(path, if_pd=False):
    """Load our data from file."""
    df = pd.read_csv(path)
    return df

def get_class_dict(df):
    class_name =  list(df['class'].unique())
    index = np.arange(0, len(class_name))
    label_index = dict(zip(class_name, index))
    index_label = dict(zip(index, class_name))
    return (label_index, index_label)
    
def clean_data(df):
    mask = np.logical_and(df['frames'] >= SEQ_LEN, df['frames'] <= MAX_SEQ_LEN)
    df = df[mask]
    return df
def split_train_test(df):
    partition =  (df.groupby(['partition']))
    un = df['partition'].unique()
    train = partition.get_group(un[0])
    test = partition.get_group(un[1])
    return (train, test)

def preprocess_image(img):
    img = cv2.resize(img, (227,227))
    return preprocess_input(img)
    
    
def encode_video(row, model, label_index):
    cap = cv2.VideoCapture(os.path.join("data","UCF-101",str(row["class"].iloc[0]) ,str(row["video_name"].iloc[0]) + ".avi"))
    images = []  
    for i in range(SEQ_LEN):
        ret, frame = cap.read()
        frame = preprocess_image(frame)
        images.append(frame)
    
    
    features = model.predict(np.array(images))
    index = label_index[row["class"].iloc[0]]
    y_onehot = to_categorical(index, len(label_index.keys()))
    
    return features, y_onehot

def encode_dataset(data, model, label_index, phase):
    input_f = []
    output_y = []
    required_classes = ["ApplyEyeMakeup" , "ApplyLipstick" , "Archery" , "BabyCrawling" , "BalanceBeam" ,
                       "BandMarching" , "BaseballPitch" , "Basketball" , "BasketballDunk"]
   
    
    for i in tqdm(range(data.shape[0])):
    # Check whether the given row , is of a class that is required
        if str(data.iloc[[i]]["class"].iloc[0]) in required_classes:
 
            features,y =  encode_video(data.iloc[[i]], model, label_index)
            input_f.append(features)
            output_y.append(y)
        
    
    f = h5py.File(phase+'_8'+'.h5', 'w')
    f.create_dataset(phase, data=np.array(input_f))
    f.create_dataset(phase+"_labels", data=np.array(output_y))
    
    del input_f[:]
    del output_y[:]

def lstm():
    """Build a simple LSTM network. We pass the extracted features from
    our CNN to this model predomenently."""
    input_shape = (SEQ_LEN, 2048)
    # Model.
    model = Sequential()
    model.add(LSTM(2048, return_sequences=False,
                   input_shape=input_shape,
                   dropout=0.5))
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.5))
    #model.add(Dense(len(label_index.keys()), activation='softmax'))
    model.add(Dense(99, activation='softmax'))
    
    checkpoint = ModelCheckpoint(filepath='models\\checkpoint-{epoch:02d}-{val_loss:.2f}.hdf5')
    
    tb_callback = TensorBoard(
    log_dir="logs",
    histogram_freq=2,
    write_graph=True
    )
    
    early_stopping = EarlyStopping(monitor = 'val_loss',patience= 10)
    
    callback_list = [checkpoint, tb_callback]

    optimizer = Adam(lr=1e-5, decay=1e-6)
    metrics = ['accuracy', 'top_k_categorical_accuracy']
    model.compile(loss='categorical_crossentropy', optimizer=optimizer,metrics=metrics)
    return model, callback_list


In [0]:
def main():
    # Get model with pretrained weights.
    base_model = InceptionV3(
    weights='imagenet',
    include_top=True)
    
    
    # We'll extract features at the final pool layer.
    model = Model(
        inputs=base_model.input,
        outputs=base_model.get_layer('avg_pool').output)
    
    # Getting the data
    df = get_data('.\\data\\data_file.csv')
    
    # Clean the data
    df_clean = clean_data(df)
    
    # Creating index-label maps and inverse_maps
    label_index, index_label = get_class_dict(df_clean)
    
    # Split the dataset into train and test
    train, test = split_train_test(df_clean)
    
    # Encoding the dataset
    encode_dataset(train, model, label_index, "train")
    encode_dataset(test,model,label_index,"test")
    
    x_train = HDF5Matrix('train_8.h5', 'train')
    y_train = HDF5Matrix('train_8.h5', 'train_labels')

    x_test = HDF5Matrix('test_8.h5', 'test')
    y_test = HDF5Matrix('test_8.h5', 'test_labels')
    
    model, callback_list = lstm()
    model.fit(x_train, y_train, batch_size = BATCH_SIZE, epochs = EPOCHS,
              verbose = 2,validation_data = (x_test, y_test),
              shuffle = 'batch', callbacks=callback_list)
    
    #model.save("Activity_Recognition.h5")

In [0]:
main()

# Sequence to Sequence Models - a general overview

Many times, we might have to convert one sequence to another. Really?  Where?

We do this in machine translation. For this purpose we use models known as sequence to sequence models. (**seq2seq**)

If we take a high-level view, a seq2seq model has encoder, decoder and intermediate step as its main components:
![alt text](https://cdn-images-1.medium.com/max/800/1*3lj8AGqfwEE5KCTJ-dXTvg.png)

A basic sequence-to-sequence model consists of two recurrent neural networks (RNNs): an encoder that processes the input and a decoder that generates the output. This basic architecture is depicted below.
![alt text](https://www.tensorflow.org/images/basic_seq2seq.png)

Each box in the picture above represents a cell of the RNN, most commonly a GRU cell or an LSTM cell. Encoder and decoder can share weights or, as is more common, use a different set of parameters.

In the basic model depicted above, every input has to be encoded into a fixed-size state vector, as that is the only thing passed to the decoder. To allow the decoder more direct access to the input, an **attention** mechanism was introduced. We'll look into details of the attention mechanism in the next part.

## Encoder

Our input sequence is how are you. Each word from the input sequence is associated to a vector 
w∈Rd (via a lookup table). In our case, we have 3 words, thus our input will be transformed into $[w0,w1,w2]∈R^{d×3}$. Then, we simply run an LSTM over this sequence of vectors and store the last hidden state outputed by the LSTM: this will be our encoder representation e. Let’s write the hidden states $[e_0,e_1,e_2]$ (and thus $e=e_2$).
![alt text](https://guillaumegenthial.github.io/assets/img2latex/seq2seq_vanilla_encoder.svg)




## Decoder

Now that we have a vector e that captures the meaning of the input sequence, we’ll use it to generate the target sequence word by word. Feed to another LSTM cell: $e$ as hidden state and a special start of sentence vector $w_{s o s}$ as input. The LSTM computes the next hidden state $h_0 ∈ R h$ . Then, we apply some function $g : R^h ↦ R^V$ so that 
$s_0 := g ( h_0 ) ∈ R^V$ is a vector of the same size as the vocabulary.

\begin{equation}
h_0 = LSTM ( e , w_{s o s} ) 
\end{equation}
\begin{equation}
s_0 = g ( h_0 )
\end{equation}
\begin{equation}
p_0 = softmax ( s_0 )
\end{equation}
\begin{equation}
i_0 = argmax ( p_0 )$
\end{equation}

Then, apply a softmax to $s_ 0$ to normalize it into a vector of probabilities $p_0 ∈ R^V$ . Now, each entry of $p_0$ will measure how likely is each word in the vocabulary. Let’s say that the word “comment” has the highest probability (and thus $i_0 = argmax ( p_0 )$ corresponds to the index of “comment”). Get a corresponding vector $w_{i_0} = w_{comment}$ and repeat the procedure: the LSTM will take $h_0$ as hidden state and $w_{comment}$ as input and will output a probability vector $p_1$ over the second word, etc.

\begin{equation}
h_1 = LSTM ( h_0 , w_{i_0} ) 
\end{equation}
\begin{equation}
s_1 = g ( h_1 )
\end{equation}
\begin{equation}
p_1 = softmax ( s_1 ) 
\end{equation}
\begin{equation}
i _1 = argmax ( p_1 )
\end{equation}

The decoding stops when the predicted word is a special end of sentence token.

![alt text](https://guillaumegenthial.github.io/assets/img2latex/seq2seq_vanilla_decoder.svg)

# Attention !!!

![alt text](https://guillaumegenthial.github.io/assets/img2latex/seq2seq_attention_mechanism_new.svg)

## Seq2Seq with Attention 

The previous model has been refined over the past few years and greatly benefited from what is known as attention. Attention is a mechanism that forces the model to learn to focus (= to attend) on specific parts of the input sequence when decoding, instead of relying only on the hidden vector of the decoder’s LSTM. One way of performing attention is as follows. We slightly modify the reccurrence formula that we defined above by adding a new vector $c_t$ to the input of the LSTM
\begin{equation}
h_t = LSTM ( h_{t − 1} , [ w_{i_{t − 1 }}, c_t ] )
\end{equation}
\begin{equation}
s_t = g ( h_t )
\end{equation}
\begin{equation}
 p_t = softmax ( s_t )
\end{equation}
\begin{equation}
 i_t = argmax ( p_t )
\end{equation}

 The vector c_t is the attention (or context) vector. We compute a new context vector at each decoding step. First, with a function $f ( h_{t − 1} , e_{t ′} ) ↦ α t ′ ∈ R$ , compute a score for each hidden state $e_{t'}$ of the encoder. Then, normalize the sequence of $αt′$ using a softmax and compute c t as the weighted average of the $e_{t ′}$.

$α_t ′ = f ( h_{t − 1 }, e_{t ′} ) ∈ R$                 for all $t ′$  

$\vec{\alpha} = softmax ( α ) $

$c_t = n \sum_{t'=0}^{n}  \vec{\alpha}_{t′} e_{t ′}$

The choice of the function  $f$ varies

# One of the main usage of a sequence to sequence model is in Neural Machine Translation

# Let's now execute a basic NMT with Tensorflow seq2seq
### The code might not seem very trivial at the first go!!!

In [0]:
"""
The helper file
"""

import os
import pickle
import copy
import numpy as np

CODES = {'<unk>': 0,'<s>': 1, '</s>': 2}

def load_data(path):
    """
    Load Dataset from File
    """
    input_file = os.path.join(path)
    with open(input_file, 'r', encoding='utf-8') as f:
        data = f.read()

    return data

def preprocess_and_save_data(source_path, target_path):
    """
    Preprocess Text Data.  Save to to file.
    """
    
    # Preprocess
    source_text = load_data(source_path)
    target_text = load_data(target_path)

    source_text = source_text.lower()
    target_text = target_text.lower()

    source_vocab_to_int, source_int_to_vocab = create_lookup_tables(source_text)
    
    target_vocab_to_int, target_int_to_vocab = create_lookup_tables(target_text)
    
    source_text, target_text = text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int)

    # Save Data
    pickle.dump((
        (source_text, target_text),
        (source_vocab_to_int, target_vocab_to_int),
        (source_int_to_vocab, target_int_to_vocab)), open('preprocess.p', 'wb'))

def load_preprocess():
    """
    Load the Preprocessed Training data and return them in batches of <batch_size> or less
    """
    return pickle.load(open('preprocess.p', mode='rb'))

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    """
    vocab = set(text.split())
    vocab_to_int = copy.copy(CODES)
    
    for v_i, v in enumerate(vocab, len(CODES)):
        vocab_to_int[v] = v_i

    int_to_vocab = {v_i: v for v, v_i in vocab_to_int.items()}

    return vocab_to_int, int_to_vocab

def save_params(params):
    """
    Save parameters to file
    """
    pickle.dump(params, open('params.p', 'wb'))

def load_params():
    """
    Load parameters from file
    """
    return pickle.load(open('params.p', mode='rb'))

def batch_data(source, target, batch_size):
    """
    Batch source and target together
    """
    for batch_i in range(0, len(source)//batch_size):
        start_i = batch_i * batch_size
        source_batch = source[start_i:start_i + batch_size]
        target_batch = target[start_i:start_i + batch_size]
        yield np.array(pad_sentence_batch(source_batch)), np.array(pad_sentence_batch(target_batch))

def pad_sentence_batch(sentence_batch):
    """
    Pad sentence with </s> id
    """
    max_sentence = max([len(sentence) for sentence in sentence_batch])
    return [sentence + [CODES['</s>']] * (max_sentence - len(sentence))
            for sentence in sentence_batch]

def text_to_ids(source_text, target_text, source_vocab_to_int, target_vocab_to_int):
    """
    Convert source and target text to proper word ids
    :param source_text: String that contains all the source text.
    :param target_text: String that contains all the target text.
    :param source_vocab_to_int: Dictionary to go from the source words to an id
    :param target_vocab_to_int: Dictionary to go from the target words to an id
    :return: A tuple of lists (source_id_text, target_id_text)
    """
    source_text_to_id = [[source_vocab_to_int[word] for word in line.split()] for line in source_text.split('\n')]
    target_text_to_id = [[target_vocab_to_int[word] for word in line.split()] for line in target_text.split('\n')]
    
    return (source_text_to_id, target_text_to_id)

In [0]:
source_path = 'data/small_vocab_fr'
target_path = 'data/small_vocab_en'
source_text = load_data(source_path)
target_text = load_data(target_path)

In [0]:
preprocess_and_save_data(source_path, target_path)

In [0]:
from distutils.version import LooseVersion
import warnings
import tensorflow as tf

In [0]:
import numpy as np

(source_int_text, target_int_text), (source_vocab_to_int, target_vocab_to_int), _ = load_preprocess()
# pad_sentence_batch(source_int_text)
source_vocab = len(source_vocab_to_int)
target_vocab = len(target_vocab_to_int)

In [0]:
class Seq2seqHyperparams(object):
    def __init__(self, hidden_units=256, n_layers_enconder=2,
                 n_layers_decoder=2, num_encoder_symbols=source_vocab, 
                 num_decoder_symbols=target_vocab, learning_rate=0.01,
                 embedding_size=15, max_gradient_norm=5.0, dtype=tf.float32,
                 epochs=1, dropout=0.2, forget_bias=1.0,
                 use_beam_search=True, beam_width=10, length_penalty_weight=0.0,
                 use_attention=True, learning_rate_decay=False, 
                 use_bidirectional_enconder=False):
    
        self.hidden_units = hidden_units
        self.n_layers_enconder = n_layers_enconder
        self.n_layers_decoder = n_layers_decoder
        self.num_encoder_symbols = num_encoder_symbols
        self.num_decoder_symbols = num_decoder_symbols
        self.learning_rate = learning_rate
        self.embedding_size = embedding_size
        self.max_gradient_norm = max_gradient_norm
        self.dtype = dtype
        self.dropout = dropout
        self.forget_bias = forget_bias
        self.use_beam_search = use_beam_search
        self.beam_width = beam_width
        self.length_penalty_weight = length_penalty_weight
        self.use_attention = use_attention
        self.learning_rate_decay = learning_rate_decay
        self.use_bidirectional_enconder = use_bidirectional_enconder


        # Extra vocabulary symbols
        unk = '<unk>'
        sos = '<s>'
        eos = '</s>' # also function as PAD
        self.extra_tokens = [unk, sos, eos]
        self.unk_token = self.extra_tokens.index(unk) #unk_token = 0
        self.start_token = self.extra_tokens.index(sos) # start_token = 1
        self.end_token = self.extra_tokens.index(eos)   # end_token = 2

hparams = Seq2seqHyperparams()

In [0]:
import tensorflow.contrib.seq2seq as seq2seq
from tensorflow.contrib.rnn import MultiRNNCell
from tensorflow import layers

tf.reset_default_graph()

train_graph = tf.Graph()
with train_graph.as_default():
    
    ### DEFINING PLACEHOLDERS ###

    # encoder_inputs: [batch_size, max_time_steps]
    encoder_inputs = tf.placeholder(dtype=tf.int32,
                shape=(None, None), name='encoder_inputs')

    # encoder_inputs_length: [batch_size]
    encoder_inputs_length = tf.placeholder(
                dtype=tf.int32, shape=(None,), name='encoder_inputs_length')

    # get dynamic batch_size
    batch_size = tf.shape(encoder_inputs)[0]

    ### TRAIN MODE PLACEHOLDERS ###

    # decoder_inputs: [batch_size, max_time_steps]
    decoder_inputs = tf.placeholder(
                    dtype=tf.int32, shape=(None, None), name='decoder_inputs')

    # decoder_inputs_length: [batch_size]
    decoder_inputs_length = tf.placeholder(
                    dtype=tf.int32, shape=(None,), name='decoder_inputs_length')

    decoder_start_token = tf.ones(
                    shape=[batch_size, 1], dtype=tf.int32) * hparams.start_token
    decoder_end_token = tf.ones(
                    shape=[batch_size, 1], dtype=tf.int32) * hparams.end_token  


    # decoder_inputs_train: [batch_size , max_time_steps + 1]
    # insert sos symbol in front of each decoder input
    decoder_inputs_train = tf.concat([decoder_start_token,
                                          decoder_inputs], axis=1)

    # decoder_inputs_length_train: [batch_size]
    decoder_inputs_length_train = decoder_inputs_length + 1

    # decoder_targets_train: [batch_size, max_time_steps + 1]
    # insert eos symbol at the end of each decoder input
    decoder_targets_train = tf.concat([decoder_inputs,
                                           decoder_end_token], axis=1)

In [0]:
with train_graph.as_default():
    ## DEFINING ENCODER ##

    encoder_embeddings = tf.Variable(tf.random_uniform([hparams.num_encoder_symbols, hparams.embedding_size], -1.0, 1.0),
                                     dtype=hparams.dtype)

    # Embedded_inputs: [batch_size, time_step, embedding_size]
    encoder_inputs_embedded = tf.nn.embedding_lookup(
        params=encoder_embeddings, ids=encoder_inputs)

    if hparams.use_bidirectional_enconder:  #bidirectional encoder is not working!
        
        num_bi_layers = int(hparams.n_layers_enconder / 2)
        num_residual_layers = hparams.n_layers_enconder - 1
        num_bi_residual_layers = int(num_residual_layers / 2)
        
        print(num_bi_layers, num_residual_layers, num_bi_residual_layers)
        
        cell_list = []
        for i in range(hparams.n_layers_enconder):
            cell = tf.contrib.rnn.BasicLSTMCell(hparams.hidden_units, forget_bias=hparams.forget_bias)

            if (i >= hparams.n_layers_enconder - num_residual_layers):
                cell = tf.contrib.rnn.ResidualWrapper(cell, residual_fn=None)
                if hparams.dropout > 0.0:
                    cell = tf.contrib.rnn.DropoutWrapper(
                        cell=cell, input_keep_prob=(1.0 - hparams.dropout))
            
            cell_list.append(cell)
            
        if len(cell_list) == 1:  # Single layer.
            fw_cell = cell_list[0]
            bw_cell = cell_list[0]
        else:  # Multi layers
            fw_cell = tf.contrib.rnn.MultiRNNCell(cell_list)
            bw_cell = tf.contrib.rnn.MultiRNNCell(cell_list)

        fw_cell = tf.contrib.rnn.BasicLSTMCell(hparams.n_layers_enconder)
        bw_cell = tf.contrib.rnn.BasicLSTMCell(hparams.n_layers_enconder)

        bi_outputs, bi_state = tf.nn.bidirectional_dynamic_rnn(
                                                        fw_cell,
                                                        bw_cell,
                                                        encoder_inputs_embedded,
                                                        dtype=dtype,
                                                        sequence_length=encoder_inputs_length,
                                                        time_major=False,
                                                        swap_memory=True)
        print(bi_outputs, "\n\n", bi_state)

        encoder_outputs, bi_encoder_state = tf.concat(bi_outputs, -1), bi_state
        
        if num_bi_layers == 1:
            encoder_last_state = bi_encoder_state
        else:
            # alternatively concat forward and backward states
            encoder_state = []
            for layer_id in range(num_bi_layers):
                encoder_state.append(bi_encoder_state[0][layer_id])  # forward
                encoder_state.append(bi_encoder_state[1][layer_id])  # backward
            encoder_last_state = tuple(encoder_state)

        encoder_state = bi_encoder_state
        
    else:
        # Build RNN cell
        cells = []
        for _ in range(hparams.n_layers_enconder):
            cell = tf.contrib.rnn.BasicLSTMCell(hparams.hidden_units, forget_bias=hparams.forget_bias)
            if hparams.dropout > 0.0:
                cell = tf.contrib.rnn.DropoutWrapper(
                    cell=cell, input_keep_prob=(1.0 - hparams.dropout))
            cells.append(cell)
        if hparams.n_layers_enconder == 1:
            encoder_cells = cells[0]
        else:
            encoder_cells = tf.contrib.rnn.MultiRNNCell(cells)

        encoder_outputs, encoder_last_state = tf.nn.dynamic_rnn(
            cell=encoder_cells, inputs=encoder_inputs_embedded,
            sequence_length=encoder_inputs_length, dtype=hparams.dtype,
            time_major=False)

In [0]:
with train_graph.as_default():
    ### DEFINING DECODER ###

    # Building decoder_cell
    cells = []
    # Build RNN cell
    for _ in range(hparams.n_layers_decoder):
        cell = tf.contrib.rnn.BasicLSTMCell(hparams.hidden_units, forget_bias=hparams.forget_bias)
        if hparams.dropout > 0.0:
            cell = tf.contrib.rnn.DropoutWrapper(
                cell=cell, input_keep_prob=(1.0 - hparams.dropout))
        cells.append(cell)
    if hparams.n_layers_decoder == 1:
        decoder_cells = cells[0]
    else:
        decoder_cells = tf.contrib.rnn.MultiRNNCell(cells)

    if hparams.use_attention:
        memory = encoder_outputs
        
        attention_mechanism = tf.contrib.seq2seq.BahdanauAttention(
            hparams.hidden_units,
            memory,
            memory_sequence_length=encoder_inputs_length,
            normalize=True)
        
        decoder_cells_train = tf.contrib.seq2seq.AttentionWrapper(
            decoder_cells,
            attention_mechanism,
            attention_layer_size=hparams.hidden_units,
            alignment_history=False,
            output_attention=True,
            name="attention")
        
        decoder_initial_state = decoder_cells_train.zero_state(batch_size, hparams.dtype).clone(
          cell_state=encoder_last_state)
        
    else:
        decoder_cells_train = decoder_cells
        decoder_initial_state = encoder_last_state

    decoder_embeddings = tf.Variable(tf.random_uniform([hparams.num_decoder_symbols, hparams.embedding_size], -1.0, 1.0), dtype=hparams.dtype)
    
    # decoder_inputs_embedded: [batch_size, max_time_step + 1, embedding_size]
    decoder_inputs_embedded = tf.nn.embedding_lookup(
        params=decoder_embeddings, ids=decoder_inputs_train)

In [0]:
with train_graph.as_default():
    ### TRAIN MODE ###
    
    # Helper to feed inputs for training: read inputs from dense ground truth vectors
    training_helper = seq2seq.TrainingHelper(inputs=decoder_inputs_embedded,
                                       sequence_length=decoder_inputs_length_train,
                                       time_major=False,
                                        name='training_helper')

    training_decoder = seq2seq.BasicDecoder(cell=decoder_cells_train,
                                       helper=training_helper,
                                       initial_state=decoder_initial_state)

    # decoder_outputs_train: BasicDecoderOutput
    #                        namedtuple(rnn_outputs, sample_id)
    # decoder_outputs_train.rnn_output: [batch_size, max_time_step + 1, num_decoder_symbols] if output_time_major=False
    #                                   [max_time_step + 1, batch_size, num_decoder_symbols] if output_time_major=True
    # decoder_outputs_train.sample_id: [batch_size], tf.int32
    (decoder_outputs_train, decoder_last_state_train, 
         decoder_outputs_length_decode)  = seq2seq.dynamic_decode(decoder=training_decoder,
                                                        output_time_major=False,
                                                        swap_memory=True,
                                                        impute_finished=True)

    # More efficient to do the projection on the batch-time-concatenated tensor
    # logits_train: [batch_size, max_time_step + 1, num_decoder_symbols]
    
    sample_id = decoder_outputs_train.sample_id
    
    output_layer = layers.Dense(hparams.num_decoder_symbols, name='output_projection')
    logits_train = output_layer(decoder_outputs_train.rnn_output)

In [0]:
with train_graph.as_default():
    
    ### LOSS, GRADIENT AND OPTIMIZATION ###
    
    if hparams.learning_rate_decay:
        global_step = tf.Variable(0, trainable=False)

        learning_rate = tf.constant(hparams.learning_rate)

        #using luong10 decay scheme
        decay_factor = 0.5
        start_decay_step = int(hparams.epochs / 2)
        decay_times = 10

        remain_steps = hparams.epochs - start_decay_step
        decay_steps = int(remain_steps / decay_times)

        learning_rate = tf.cond(global_step < start_decay_step,
                                lambda: hparams.learning_rate,
                                lambda: tf.train.exponential_decay(
                                    hparams.learning_rate,
                                    (global_step - start_decay_step),
                                    decay_steps, decay_factor, staircase=True),
                                name="learning_rate_decay_cond")
    
    # Maximum decoder time_steps in current batch
    max_decoder_length = tf.reduce_max(decoder_inputs_length_train)
    
    # masks: masking for valid and padded time steps, [batch_size, max_time_step + 1]
    target_weights = tf.sequence_mask(lengths=decoder_inputs_length_train, 
                             maxlen=max_decoder_length, dtype=hparams.dtype, name='masks')
    
    crossent = tf.nn.sparse_softmax_cross_entropy_with_logits(
        labels=decoder_targets_train, logits=logits_train)
    
    loss = (tf.reduce_sum(crossent * target_weights) /
        tf.cast(batch_size, dtype=hparams.dtype))

    trainable_params = tf.trainable_variables()
    
    opt = tf.train.AdamOptimizer(learning_rate=hparams.learning_rate)
    
    gradients = tf.gradients(loss, 
                             trainable_params)
    
    clip_gradients, gradient_norm = tf.clip_by_global_norm(gradients, hparams.max_gradient_norm)
    
    updates = opt.apply_gradients(
            zip(clip_gradients, trainable_params))

In [0]:
with train_graph.as_default():

    ### INFERENCE MODE ###
    start_tokens = tf.fill([batch_size], hparams.start_token)
    
    decoder_initial_state_infer = tf.contrib.seq2seq.tile_batch(
                  encoder_last_state, multiplier=hparams.beam_width)
    
    if hparams.use_attention:
        memory = tf.contrib.seq2seq.tile_batch(
          memory, multiplier=hparams.beam_width)
        
        source_sequence_length = tf.contrib.seq2seq.tile_batch(
          encoder_inputs_length, multiplier=hparams.beam_width)
        
        encoder_last_state = tf.contrib.seq2seq.tile_batch(
          encoder_last_state, multiplier=hparams.beam_width)
        
        batch_size = batch_size * hparams.beam_width
        
        attention_mechanism_infer = tf.contrib.seq2seq.BahdanauAttention(
            hparams.hidden_units,
            memory,
            memory_sequence_length=source_sequence_length,
            normalize=True)
        
        decoder_cells_infer = tf.contrib.seq2seq.AttentionWrapper(
            decoder_cells,
            attention_mechanism_infer,
            attention_layer_size=hparams.hidden_units,
            alignment_history=False,
            output_attention=True,
            name="attention_infer")
        
        decoder_initial_state_infer = decoder_cells_infer.zero_state(batch_size, hparams.dtype).clone(
          cell_state=encoder_last_state)
    
    if hparams.use_beam_search:

        inference_decoder = tf.contrib.seq2seq.BeamSearchDecoder(
              cell=decoder_cells_infer,
              embedding=decoder_embeddings,
              start_tokens=start_tokens,
              end_token=hparams.end_token,
              initial_state=decoder_initial_state_infer,
              beam_width=hparams.beam_width,
              output_layer=output_layer,
              length_penalty_weight=hparams.length_penalty_weight)
        
    else:
        inference_helper = seq2seq.GreedyEmbeddingHelper(decoder_embeddings,
                                                        start_tokens=start_tokens,
                                                        end_token=hparams.end_token)

        inference_decoder = seq2seq.BasicDecoder(cell=decoder_cells_infer,
                                                 helper=inference_helper,
                                                 initial_state=decoder_initial_state,
                                                 output_layer=output_layer)
    
    maximum_iterations = tf.round(tf.reduce_max(encoder_inputs_length) * 2)
    
    (decoder_infer_outputs, decoder_infer_last_state,
                 decoder_infer_outputs_length) = (seq2seq.dynamic_decode(
                    decoder=inference_decoder,
                    output_time_major=False,
                    maximum_iterations=maximum_iterations))
    
    if hparams.use_beam_search:
        decoder_pred_decode = decoder_infer_outputs.predicted_ids
        tf.identity(decoder_pred_decode, 'decoder_pred_decode')
    
    else:
        logits_infer = decoder_infer_outputs.rnn_output
        sample_id_infer = decoder_infer_outputs.sample_id                                                                       

In [0]:
#training parameters
class TrainingHyperparams(object):
    def __init__(self, epochs=20, batch_size=512):
        self.epochs = epochs
        self.batch_size = batch_size

train_hparams = TrainingHyperparams()

In [0]:
def sentence_to_seq(sentence, vocab_to_int):
    """
    Convert a sentence to a sequence of ids
    :param sentence: String
    :param vocab_to_int: Dictionary to go from the words to an id
    :return: List of word ids
    """
    lower_case_words = [word.lower() for word in sentence.split()]
    
    word_id = [vocab_to_int.get(word, vocab_to_int['<unk>']) for word in lower_case_words]
    
    return word_id

import time

### TRAINING ###
save_path = 'checkpoints/dev'

train_source = source_int_text[train_hparams.batch_size:]
train_target = target_int_text[train_hparams.batch_size:]

valid_source = source_int_text[:train_hparams.batch_size]
valid_target = target_int_text[:train_hparams.batch_size]

get_accuracy_every = 30

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())
    steps = 0
    
    for epoch_i in range(train_hparams.epochs):
        
        step = 0
        for batch_i, (source_batch, target_batch) in enumerate(
                batch_data(train_source, train_target, train_hparams.batch_size)):
            start_time = time.time()
            
            source_batch_seq_lenght = []
            for item in source_batch:
                source_batch_seq_lenght.append(np.shape(item)[0])
            
            target_batch_seq_lenght = []
            for item in target_batch:
                target_batch_seq_lenght.append(np.shape(item)[0])
                
#             if (source_batch_seq_lenght[0] > 300): #OOM problems with datasets containing very large sentences
#                 continue
                
            _, loss_val = sess.run(
                [updates, loss],
                {encoder_inputs: source_batch,
                 decoder_inputs: target_batch,
                encoder_inputs_length: source_batch_seq_lenght,
                decoder_inputs_length: target_batch_seq_lenght})

            print('Epoch {:>3} Batch {:>4}/{}, Loss: {:>6.3f}'
                  .format(epoch_i, batch_i, len(source_int_text) // train_hparams.batch_size, loss_val))
                
            end_time = time.time()
            
    print("Training time: ", end_time - start_time)
    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_path)
    print('Model Trained and Saved')

Epoch   0 Batch    0/269, Loss: 97.877
Epoch   0 Batch    1/269, Loss: 71.076
Epoch   0 Batch    2/269, Loss: 114.789
Epoch   0 Batch    3/269, Loss: 79.170
Epoch   0 Batch    4/269, Loss: 101.692
Epoch   0 Batch    5/269, Loss: 130.967
Epoch   0 Batch    6/269, Loss: 79.163
Epoch   0 Batch    7/269, Loss: 71.977
Epoch   0 Batch    8/269, Loss: 85.193
Epoch   0 Batch    9/269, Loss: 77.328
Epoch   0 Batch   10/269, Loss: 77.700
Epoch   0 Batch   11/269, Loss: 62.228
Epoch   0 Batch   12/269, Loss: 58.238
Epoch   0 Batch   13/269, Loss: 69.596
Epoch   0 Batch   14/269, Loss: 61.465
Epoch   0 Batch   15/269, Loss: 58.058
Epoch   0 Batch   16/269, Loss: 56.366
Epoch   0 Batch   17/269, Loss: 54.344
Epoch   0 Batch   18/269, Loss: 52.447
Epoch   0 Batch   19/269, Loss: 53.707
Epoch   0 Batch   20/269, Loss: 49.734
Epoch   0 Batch   21/269, Loss: 50.285
Epoch   0 Batch   22/269, Loss: 56.243
Epoch   0 Batch   23/269, Loss: 50.994
Epoch   0 Batch   24/269, Loss: 50.204
Epoch   0 Batch   25/2

Epoch   0 Batch  210/269, Loss:  1.412
Epoch   0 Batch  211/269, Loss:  1.338
Epoch   0 Batch  212/269, Loss:  1.461
Epoch   0 Batch  213/269, Loss:  1.526
Epoch   0 Batch  214/269, Loss:  1.507
Epoch   0 Batch  215/269, Loss:  1.403
Epoch   0 Batch  216/269, Loss:  1.465
Epoch   0 Batch  217/269, Loss:  1.329
Epoch   0 Batch  218/269, Loss:  1.346
Epoch   0 Batch  219/269, Loss:  1.348
Epoch   0 Batch  220/269, Loss:  1.320
Epoch   0 Batch  221/269, Loss:  1.263
Epoch   0 Batch  222/269, Loss:  1.159
Epoch   0 Batch  223/269, Loss:  1.210
Epoch   0 Batch  224/269, Loss:  1.399
Epoch   0 Batch  225/269, Loss:  1.172
Epoch   0 Batch  226/269, Loss:  1.337
Epoch   0 Batch  227/269, Loss:  1.211
Epoch   0 Batch  228/269, Loss:  1.163
Epoch   0 Batch  229/269, Loss:  1.116
Epoch   0 Batch  230/269, Loss:  1.206
Epoch   0 Batch  231/269, Loss:  1.157
Epoch   0 Batch  232/269, Loss:  1.074
Epoch   0 Batch  233/269, Loss:  1.142
Epoch   0 Batch  234/269, Loss:  1.160
Epoch   0 Batch  235/269,

Epoch   1 Batch  153/269, Loss:  0.618
Epoch   1 Batch  154/269, Loss:  0.625
Epoch   1 Batch  155/269, Loss:  0.590
Epoch   1 Batch  156/269, Loss:  0.617
Epoch   1 Batch  157/269, Loss:  0.615
Epoch   1 Batch  158/269, Loss:  0.555
Epoch   1 Batch  159/269, Loss:  0.606
Epoch   1 Batch  160/269, Loss:  0.620
Epoch   1 Batch  161/269, Loss:  0.590
Epoch   1 Batch  162/269, Loss:  0.575
Epoch   1 Batch  163/269, Loss:  0.596
Epoch   1 Batch  164/269, Loss:  0.663
Epoch   1 Batch  165/269, Loss:  0.648
Epoch   1 Batch  166/269, Loss:  0.604
Epoch   1 Batch  167/269, Loss:  0.646
Epoch   1 Batch  168/269, Loss:  0.603
Epoch   1 Batch  169/269, Loss:  0.586
Epoch   1 Batch  170/269, Loss:  0.552
Epoch   1 Batch  171/269, Loss:  0.608
Epoch   1 Batch  172/269, Loss:  0.687
Epoch   1 Batch  173/269, Loss:  0.577
Epoch   1 Batch  174/269, Loss:  0.604
Epoch   1 Batch  175/269, Loss:  0.685
Epoch   1 Batch  176/269, Loss:  0.612
Epoch   1 Batch  177/269, Loss:  0.529
Epoch   1 Batch  178/269,

Epoch   2 Batch   96/269, Loss:  0.586
Epoch   2 Batch   97/269, Loss:  0.552
Epoch   2 Batch   98/269, Loss:  0.484
Epoch   2 Batch   99/269, Loss:  0.514
Epoch   2 Batch  100/269, Loss:  0.511
Epoch   2 Batch  101/269, Loss:  0.584
Epoch   2 Batch  102/269, Loss:  0.563
Epoch   2 Batch  103/269, Loss:  0.583
Epoch   2 Batch  104/269, Loss:  0.560
Epoch   2 Batch  105/269, Loss:  0.541
Epoch   2 Batch  106/269, Loss:  0.528
Epoch   2 Batch  107/269, Loss:  0.547
Epoch   2 Batch  108/269, Loss:  0.493
Epoch   2 Batch  109/269, Loss:  0.595
Epoch   2 Batch  110/269, Loss:  0.522
Epoch   2 Batch  111/269, Loss:  0.530
Epoch   2 Batch  112/269, Loss:  0.566
Epoch   2 Batch  113/269, Loss:  0.542
Epoch   2 Batch  114/269, Loss:  0.516
Epoch   2 Batch  115/269, Loss:  0.505
Epoch   2 Batch  116/269, Loss:  0.526
Epoch   2 Batch  117/269, Loss:  0.593
Epoch   2 Batch  118/269, Loss:  0.537
Epoch   2 Batch  119/269, Loss:  0.526
Epoch   2 Batch  120/269, Loss:  0.568
Epoch   2 Batch  121/269,

Epoch   3 Batch   39/269, Loss:  0.508
Epoch   3 Batch   40/269, Loss:  0.479
Epoch   3 Batch   41/269, Loss:  0.479
Epoch   3 Batch   42/269, Loss:  0.487
Epoch   3 Batch   43/269, Loss:  0.509
Epoch   3 Batch   44/269, Loss:  0.502
Epoch   3 Batch   45/269, Loss:  0.470
Epoch   3 Batch   46/269, Loss:  0.467
Epoch   3 Batch   47/269, Loss:  0.503
Epoch   3 Batch   48/269, Loss:  0.501
Epoch   3 Batch   49/269, Loss:  0.469
Epoch   3 Batch   50/269, Loss:  0.479
Epoch   3 Batch   51/269, Loss:  0.482
Epoch   3 Batch   52/269, Loss:  0.552
Epoch   3 Batch   53/269, Loss:  0.505
Epoch   3 Batch   54/269, Loss:  0.411
Epoch   3 Batch   55/269, Loss:  0.503
Epoch   3 Batch   56/269, Loss:  0.440
Epoch   3 Batch   57/269, Loss:  0.491
Epoch   3 Batch   58/269, Loss:  0.550
Epoch   3 Batch   59/269, Loss:  0.516
Epoch   3 Batch   60/269, Loss:  0.496
Epoch   3 Batch   61/269, Loss:  0.550
Epoch   3 Batch   62/269, Loss:  0.428
Epoch   3 Batch   63/269, Loss:  0.443
Epoch   3 Batch   64/269,

Epoch   3 Batch  250/269, Loss:  0.488
Epoch   3 Batch  251/269, Loss:  0.470
Epoch   3 Batch  252/269, Loss:  0.485
Epoch   3 Batch  253/269, Loss:  0.511
Epoch   3 Batch  254/269, Loss:  0.516
Epoch   3 Batch  255/269, Loss:  0.488
Epoch   3 Batch  256/269, Loss:  0.486
Epoch   3 Batch  257/269, Loss:  0.496
Epoch   3 Batch  258/269, Loss:  0.488
Epoch   3 Batch  259/269, Loss:  0.508
Epoch   3 Batch  260/269, Loss:  0.421
Epoch   3 Batch  261/269, Loss:  0.451
Epoch   3 Batch  262/269, Loss:  0.509
Epoch   3 Batch  263/269, Loss:  0.519
Epoch   3 Batch  264/269, Loss:  0.489
Epoch   3 Batch  265/269, Loss:  0.493
Epoch   3 Batch  266/269, Loss:  0.483
Epoch   3 Batch  267/269, Loss:  0.456
Epoch   4 Batch    0/269, Loss:  0.516
Epoch   4 Batch    1/269, Loss:  0.499
Epoch   4 Batch    2/269, Loss:  0.588
Epoch   4 Batch    3/269, Loss:  0.481
Epoch   4 Batch    4/269, Loss:  0.458
Epoch   4 Batch    5/269, Loss:  0.465
Epoch   4 Batch    6/269, Loss:  0.485
Epoch   4 Batch    7/269,

Epoch   4 Batch  193/269, Loss:  0.510
Epoch   4 Batch  194/269, Loss:  0.454
Epoch   4 Batch  195/269, Loss:  0.477
Epoch   4 Batch  196/269, Loss:  0.517
Epoch   4 Batch  197/269, Loss:  0.501
Epoch   4 Batch  198/269, Loss:  0.475
Epoch   4 Batch  199/269, Loss:  0.460
Epoch   4 Batch  200/269, Loss:  0.509
Epoch   4 Batch  201/269, Loss:  0.530
Epoch   4 Batch  202/269, Loss:  0.515
Epoch   4 Batch  203/269, Loss:  0.453
Epoch   4 Batch  204/269, Loss:  0.468
Epoch   4 Batch  205/269, Loss:  0.524
Epoch   4 Batch  206/269, Loss:  0.488
Epoch   4 Batch  207/269, Loss:  0.483
Epoch   4 Batch  208/269, Loss:  0.512
Epoch   4 Batch  209/269, Loss:  0.426
Epoch   4 Batch  210/269, Loss:  0.451
Epoch   4 Batch  211/269, Loss:  0.455
Epoch   4 Batch  212/269, Loss:  0.478
Epoch   4 Batch  213/269, Loss:  0.461
Epoch   4 Batch  214/269, Loss:  0.471
Epoch   4 Batch  215/269, Loss:  0.464
Epoch   4 Batch  216/269, Loss:  0.466
Epoch   4 Batch  217/269, Loss:  0.480
Epoch   4 Batch  218/269,

Epoch   5 Batch  136/269, Loss:  0.482
Epoch   5 Batch  137/269, Loss:  0.455
Epoch   5 Batch  138/269, Loss:  0.500
Epoch   5 Batch  139/269, Loss:  0.514
Epoch   5 Batch  140/269, Loss:  0.494
Epoch   5 Batch  141/269, Loss:  0.526
Epoch   5 Batch  142/269, Loss:  0.442
Epoch   5 Batch  143/269, Loss:  0.444
Epoch   5 Batch  144/269, Loss:  0.495
Epoch   5 Batch  145/269, Loss:  0.420
Epoch   5 Batch  146/269, Loss:  0.506
Epoch   5 Batch  147/269, Loss:  0.535
Epoch   5 Batch  148/269, Loss:  0.437
Epoch   5 Batch  149/269, Loss:  0.445
Epoch   5 Batch  150/269, Loss:  0.498
Epoch   5 Batch  151/269, Loss:  0.435
Epoch   5 Batch  152/269, Loss:  0.452
Epoch   5 Batch  153/269, Loss:  0.495
Epoch   5 Batch  154/269, Loss:  0.469
Epoch   5 Batch  155/269, Loss:  0.476
Epoch   5 Batch  156/269, Loss:  0.455
Epoch   5 Batch  157/269, Loss:  0.475
Epoch   5 Batch  158/269, Loss:  0.425
Epoch   5 Batch  159/269, Loss:  0.435
Epoch   5 Batch  160/269, Loss:  0.528
Epoch   5 Batch  161/269,

Epoch   6 Batch   79/269, Loss:  0.476
Epoch   6 Batch   80/269, Loss:  0.486
Epoch   6 Batch   81/269, Loss:  0.429
Epoch   6 Batch   82/269, Loss:  0.451
Epoch   6 Batch   83/269, Loss:  0.509
Epoch   6 Batch   84/269, Loss:  0.531
Epoch   6 Batch   85/269, Loss:  0.485
Epoch   6 Batch   86/269, Loss:  0.501
Epoch   6 Batch   87/269, Loss:  0.466
Epoch   6 Batch   88/269, Loss:  0.457
Epoch   6 Batch   89/269, Loss:  0.533
Epoch   6 Batch   90/269, Loss:  0.547
Epoch   6 Batch   91/269, Loss:  0.510
Epoch   6 Batch   92/269, Loss:  0.483
Epoch   6 Batch   93/269, Loss:  0.484
Epoch   6 Batch   94/269, Loss:  0.491
Epoch   6 Batch   95/269, Loss:  0.494
Epoch   6 Batch   96/269, Loss:  0.522
Epoch   6 Batch   97/269, Loss:  0.493
Epoch   6 Batch   98/269, Loss:  0.427
Epoch   6 Batch   99/269, Loss:  0.517
Epoch   6 Batch  100/269, Loss:  0.435
Epoch   6 Batch  101/269, Loss:  0.499
Epoch   6 Batch  102/269, Loss:  0.535
Epoch   6 Batch  103/269, Loss:  0.549
Epoch   6 Batch  104/269,

Epoch   7 Batch   22/269, Loss:  0.493
Epoch   7 Batch   23/269, Loss:  0.536
Epoch   7 Batch   24/269, Loss:  0.515
Epoch   7 Batch   25/269, Loss:  0.528
Epoch   7 Batch   26/269, Loss:  0.493
Epoch   7 Batch   27/269, Loss:  0.516
Epoch   7 Batch   28/269, Loss:  0.533
Epoch   7 Batch   29/269, Loss:  0.450
Epoch   7 Batch   30/269, Loss:  0.499
Epoch   7 Batch   31/269, Loss:  0.478
Epoch   7 Batch   32/269, Loss:  0.518
Epoch   7 Batch   33/269, Loss:  0.513
Epoch   7 Batch   34/269, Loss:  0.440
Epoch   7 Batch   35/269, Loss:  0.565
Epoch   7 Batch   36/269, Loss:  0.519
Epoch   7 Batch   37/269, Loss:  0.462
Epoch   7 Batch   38/269, Loss:  0.473
Epoch   7 Batch   39/269, Loss:  0.491
Epoch   7 Batch   40/269, Loss:  0.475
Epoch   7 Batch   41/269, Loss:  0.491
Epoch   7 Batch   42/269, Loss:  0.497
Epoch   7 Batch   43/269, Loss:  0.484
Epoch   7 Batch   44/269, Loss:  0.473
Epoch   7 Batch   45/269, Loss:  0.433
Epoch   7 Batch   46/269, Loss:  0.448
Epoch   7 Batch   47/269,

Epoch   7 Batch  233/269, Loss:  2.963
Epoch   7 Batch  234/269, Loss:  3.696
Epoch   7 Batch  235/269, Loss:  2.874
Epoch   7 Batch  236/269, Loss:  2.852
Epoch   7 Batch  237/269, Loss:  3.171
Epoch   7 Batch  238/269, Loss:  4.098
Epoch   7 Batch  239/269, Loss:  2.845
Epoch   7 Batch  240/269, Loss:  3.438
Epoch   7 Batch  241/269, Loss:  4.609
Epoch   7 Batch  242/269, Loss:  3.448
Epoch   7 Batch  243/269, Loss:  6.055
Epoch   7 Batch  244/269, Loss:  4.512
Epoch   7 Batch  245/269, Loss:  4.812
Epoch   7 Batch  246/269, Loss:  5.455
Epoch   7 Batch  247/269, Loss:  7.026
Epoch   7 Batch  248/269, Loss: 10.203
Epoch   7 Batch  249/269, Loss: 16.374
Epoch   7 Batch  250/269, Loss: 20.362
Epoch   7 Batch  251/269, Loss: 28.700
Epoch   7 Batch  252/269, Loss: 50.603
Epoch   7 Batch  253/269, Loss: 104.095
Epoch   7 Batch  254/269, Loss: 146.727
Epoch   7 Batch  255/269, Loss: 160.308
Epoch   7 Batch  256/269, Loss: 171.847
Epoch   7 Batch  257/269, Loss: 180.655
Epoch   7 Batch  258

Epoch   8 Batch  168/269, Loss: 522.161
Epoch   8 Batch  169/269, Loss: 534.428
Epoch   8 Batch  170/269, Loss: 488.100
Epoch   8 Batch  171/269, Loss: 447.697
Epoch   8 Batch  172/269, Loss: 427.743
Epoch   8 Batch  173/269, Loss: 458.195
Epoch   8 Batch  174/269, Loss: 467.561
Epoch   8 Batch  175/269, Loss: 455.704
Epoch   8 Batch  176/269, Loss: 503.707
Epoch   8 Batch  177/269, Loss: 484.864
Epoch   8 Batch  178/269, Loss: 442.825
Epoch   8 Batch  179/269, Loss: 376.659
Epoch   8 Batch  180/269, Loss: 355.878
Epoch   8 Batch  181/269, Loss: 426.715
Epoch   8 Batch  182/269, Loss: 374.878
Epoch   8 Batch  183/269, Loss: 379.178
Epoch   8 Batch  184/269, Loss: 378.013
Epoch   8 Batch  185/269, Loss: 404.723
Epoch   8 Batch  186/269, Loss: 353.802
Epoch   8 Batch  187/269, Loss: 365.009
Epoch   8 Batch  188/269, Loss: 348.650
Epoch   8 Batch  189/269, Loss: 332.997
Epoch   8 Batch  190/269, Loss: 332.925
Epoch   8 Batch  191/269, Loss: 374.244
Epoch   8 Batch  192/269, Loss: 428.331


Epoch   9 Batch  102/269, Loss: 1831.671
Epoch   9 Batch  103/269, Loss: 1905.039
Epoch   9 Batch  104/269, Loss: 1852.136
Epoch   9 Batch  105/269, Loss: 1899.399
Epoch   9 Batch  106/269, Loss: 1866.879
Epoch   9 Batch  107/269, Loss: 1829.458
Epoch   9 Batch  108/269, Loss: 1856.031
Epoch   9 Batch  109/269, Loss: 1906.039
Epoch   9 Batch  110/269, Loss: 1890.621
Epoch   9 Batch  111/269, Loss: 1866.631
Epoch   9 Batch  112/269, Loss: 1886.669
Epoch   9 Batch  113/269, Loss: 1904.845
Epoch   9 Batch  114/269, Loss: 1890.614
Epoch   9 Batch  115/269, Loss: 1788.668
Epoch   9 Batch  116/269, Loss: 1873.078
Epoch   9 Batch  117/269, Loss: 1838.296
Epoch   9 Batch  118/269, Loss: 1863.858
Epoch   9 Batch  119/269, Loss: 1885.140
Epoch   9 Batch  120/269, Loss: 1811.626
Epoch   9 Batch  121/269, Loss: 1851.054
Epoch   9 Batch  122/269, Loss: 1849.797
Epoch   9 Batch  123/269, Loss: 1785.819
Epoch   9 Batch  124/269, Loss: 1797.184
Epoch   9 Batch  125/269, Loss: 1782.729
Epoch   9 Batch 

Epoch  10 Batch   34/269, Loss: 1541.419
Epoch  10 Batch   35/269, Loss: 1561.489
Epoch  10 Batch   36/269, Loss: 1552.530
Epoch  10 Batch   37/269, Loss: 1595.225
Epoch  10 Batch   38/269, Loss: 1570.520
Epoch  10 Batch   39/269, Loss: 1585.477
Epoch  10 Batch   40/269, Loss: 1600.212
Epoch  10 Batch   41/269, Loss: 1616.638
Epoch  10 Batch   42/269, Loss: 1655.980
Epoch  10 Batch   43/269, Loss: 1643.622
Epoch  10 Batch   44/269, Loss: 1707.080
Epoch  10 Batch   45/269, Loss: 1670.549
Epoch  10 Batch   46/269, Loss: 1651.397
Epoch  10 Batch   47/269, Loss: 1701.271
Epoch  10 Batch   48/269, Loss: 1719.376
Epoch  10 Batch   49/269, Loss: 1715.675
Epoch  10 Batch   50/269, Loss: 1755.350
Epoch  10 Batch   51/269, Loss: 1785.772
Epoch  10 Batch   52/269, Loss: 1761.863
Epoch  10 Batch   53/269, Loss: 1783.646
Epoch  10 Batch   54/269, Loss: 1788.759
Epoch  10 Batch   55/269, Loss: 1840.234
Epoch  10 Batch   56/269, Loss: 1872.146
Epoch  10 Batch   57/269, Loss: 1842.823
Epoch  10 Batch 

Epoch  10 Batch  234/269, Loss: 1003.412
Epoch  10 Batch  235/269, Loss: 1025.502
Epoch  10 Batch  236/269, Loss: 1013.197
Epoch  10 Batch  237/269, Loss: 1012.048
Epoch  10 Batch  238/269, Loss: 1022.730
Epoch  10 Batch  239/269, Loss: 1002.817
Epoch  10 Batch  240/269, Loss: 1014.037
Epoch  10 Batch  241/269, Loss: 1014.183
Epoch  10 Batch  242/269, Loss: 973.164
Epoch  10 Batch  243/269, Loss: 994.750
Epoch  10 Batch  244/269, Loss: 1003.805
Epoch  10 Batch  245/269, Loss: 1001.695
Epoch  10 Batch  246/269, Loss: 987.991
Epoch  10 Batch  247/269, Loss: 956.070
Epoch  10 Batch  248/269, Loss: 980.226
Epoch  10 Batch  249/269, Loss: 969.370
Epoch  10 Batch  250/269, Loss: 951.363
Epoch  10 Batch  251/269, Loss: 947.496
Epoch  10 Batch  252/269, Loss: 956.646
Epoch  10 Batch  253/269, Loss: 952.710
Epoch  10 Batch  254/269, Loss: 976.562
Epoch  10 Batch  255/269, Loss: 1002.743
Epoch  10 Batch  256/269, Loss: 964.031
Epoch  10 Batch  257/269, Loss: 992.799
Epoch  10 Batch  258/269, Los

Epoch  11 Batch  168/269, Loss: 1250.859
Epoch  11 Batch  169/269, Loss: 1230.205
Epoch  11 Batch  170/269, Loss: 1255.553
Epoch  11 Batch  171/269, Loss: 1225.102
Epoch  11 Batch  172/269, Loss: 1261.711
Epoch  11 Batch  173/269, Loss: 1247.247
Epoch  11 Batch  174/269, Loss: 1246.078
Epoch  11 Batch  175/269, Loss: 1221.113
Epoch  11 Batch  176/269, Loss: 1204.225
Epoch  11 Batch  177/269, Loss: 1210.823
Epoch  11 Batch  178/269, Loss: 1197.030
Epoch  11 Batch  179/269, Loss: 1209.981
Epoch  11 Batch  180/269, Loss: 1178.890
Epoch  11 Batch  181/269, Loss: 1218.550
Epoch  11 Batch  182/269, Loss: 1192.430
Epoch  11 Batch  183/269, Loss: 1236.930
Epoch  11 Batch  184/269, Loss: 1186.868
Epoch  11 Batch  185/269, Loss: 1236.126
Epoch  11 Batch  186/269, Loss: 1197.341
Epoch  11 Batch  187/269, Loss: 1227.290
Epoch  11 Batch  188/269, Loss: 1234.425
Epoch  11 Batch  189/269, Loss: 1242.464
Epoch  11 Batch  190/269, Loss: 1230.133
Epoch  11 Batch  191/269, Loss: 1218.698
Epoch  11 Batch 

Epoch  12 Batch  100/269, Loss: 1030.605
Epoch  12 Batch  101/269, Loss: 969.336
Epoch  12 Batch  102/269, Loss: 996.439
Epoch  12 Batch  103/269, Loss: 1043.563
Epoch  12 Batch  104/269, Loss: 993.331
Epoch  12 Batch  105/269, Loss: 1003.047
Epoch  12 Batch  106/269, Loss: 968.456
Epoch  12 Batch  107/269, Loss: 949.546
Epoch  12 Batch  108/269, Loss: 972.987
Epoch  12 Batch  109/269, Loss: 983.565
Epoch  12 Batch  110/269, Loss: 968.761
Epoch  12 Batch  111/269, Loss: 962.189
Epoch  12 Batch  112/269, Loss: 968.224
Epoch  12 Batch  113/269, Loss: 986.787
Epoch  12 Batch  114/269, Loss: 988.540
Epoch  12 Batch  115/269, Loss: 910.351
Epoch  12 Batch  116/269, Loss: 963.504
Epoch  12 Batch  117/269, Loss: 950.375
Epoch  12 Batch  118/269, Loss: 947.700
Epoch  12 Batch  119/269, Loss: 940.202
Epoch  12 Batch  120/269, Loss: 919.010
Epoch  12 Batch  121/269, Loss: 942.477
Epoch  12 Batch  122/269, Loss: 958.728
Epoch  12 Batch  123/269, Loss: 925.083
Epoch  12 Batch  124/269, Loss: 919.4

Epoch  13 Batch   37/269, Loss: 961.942
Epoch  13 Batch   38/269, Loss: 966.640
Epoch  13 Batch   39/269, Loss: 959.026
Epoch  13 Batch   40/269, Loss: 954.693
Epoch  13 Batch   41/269, Loss: 959.689
Epoch  13 Batch   42/269, Loss: 995.403
Epoch  13 Batch   43/269, Loss: 959.788
Epoch  13 Batch   44/269, Loss: 996.891
Epoch  13 Batch   45/269, Loss: 953.206
Epoch  13 Batch   46/269, Loss: 916.438
Epoch  13 Batch   47/269, Loss: 992.478
Epoch  13 Batch   48/269, Loss: 980.343
Epoch  13 Batch   49/269, Loss: 944.132
Epoch  13 Batch   50/269, Loss: 963.907
Epoch  13 Batch   51/269, Loss: 963.928
Epoch  13 Batch   52/269, Loss: 953.662
Epoch  13 Batch   53/269, Loss: 941.164
Epoch  13 Batch   54/269, Loss: 928.471
Epoch  13 Batch   55/269, Loss: 954.422
Epoch  13 Batch   56/269, Loss: 972.287
Epoch  13 Batch   57/269, Loss: 964.398
Epoch  13 Batch   58/269, Loss: 940.536
Epoch  13 Batch   59/269, Loss: 936.702
Epoch  13 Batch   60/269, Loss: 911.098
Epoch  13 Batch   61/269, Loss: 932.342


Epoch  13 Batch  241/269, Loss: 703.222
Epoch  13 Batch  242/269, Loss: 734.936
Epoch  13 Batch  243/269, Loss: 731.226
Epoch  13 Batch  244/269, Loss: 740.659
Epoch  13 Batch  245/269, Loss: 762.747
Epoch  13 Batch  246/269, Loss: 754.948
Epoch  13 Batch  247/269, Loss: 778.914
Epoch  13 Batch  248/269, Loss: 773.415
Epoch  13 Batch  249/269, Loss: 807.730
Epoch  13 Batch  250/269, Loss: 823.654
Epoch  13 Batch  251/269, Loss: 866.850
Epoch  13 Batch  252/269, Loss: 901.504
Epoch  13 Batch  253/269, Loss: 926.333
Epoch  13 Batch  254/269, Loss: 924.281
Epoch  13 Batch  255/269, Loss: 998.585
Epoch  13 Batch  256/269, Loss: 1021.456
Epoch  13 Batch  257/269, Loss: 1053.760
Epoch  13 Batch  258/269, Loss: 1097.700
Epoch  13 Batch  259/269, Loss: 1153.740
Epoch  13 Batch  260/269, Loss: 1141.315
Epoch  13 Batch  261/269, Loss: 1164.326
Epoch  13 Batch  262/269, Loss: 1178.329
Epoch  13 Batch  263/269, Loss: 1229.288
Epoch  13 Batch  264/269, Loss: 1237.447
Epoch  13 Batch  265/269, Loss:

Epoch  14 Batch  174/269, Loss: 1065.870
Epoch  14 Batch  175/269, Loss: 1018.787
Epoch  14 Batch  176/269, Loss: 998.337
Epoch  14 Batch  177/269, Loss: 988.095
Epoch  14 Batch  178/269, Loss: 968.087
Epoch  14 Batch  179/269, Loss: 963.828
Epoch  14 Batch  180/269, Loss: 932.931
Epoch  14 Batch  181/269, Loss: 820.914
Epoch  14 Batch  182/269, Loss: 775.392
Epoch  14 Batch  183/269, Loss: 742.519
Epoch  14 Batch  184/269, Loss: 690.909
Epoch  14 Batch  185/269, Loss: 689.692
Epoch  14 Batch  186/269, Loss: 669.676
Epoch  14 Batch  187/269, Loss: 669.264
Epoch  14 Batch  188/269, Loss: 666.372
Epoch  14 Batch  189/269, Loss: 697.666
Epoch  14 Batch  190/269, Loss: 689.596
Epoch  14 Batch  191/269, Loss: 692.080
Epoch  14 Batch  192/269, Loss: 727.060
Epoch  14 Batch  193/269, Loss: 744.510
Epoch  14 Batch  194/269, Loss: 736.799
Epoch  14 Batch  195/269, Loss: 758.974
Epoch  14 Batch  196/269, Loss: 766.209
Epoch  14 Batch  197/269, Loss: 776.778
Epoch  14 Batch  198/269, Loss: 793.11

Epoch  15 Batch  109/269, Loss: 595.353
Epoch  15 Batch  110/269, Loss: 534.349
Epoch  15 Batch  111/269, Loss: 505.081
Epoch  15 Batch  112/269, Loss: 494.516
Epoch  15 Batch  113/269, Loss: 494.352
Epoch  15 Batch  114/269, Loss: 433.332
Epoch  15 Batch  115/269, Loss: 426.222
Epoch  15 Batch  116/269, Loss: 433.207
Epoch  15 Batch  117/269, Loss: 452.666
Epoch  15 Batch  118/269, Loss: 467.298
Epoch  15 Batch  119/269, Loss: 457.147
Epoch  15 Batch  120/269, Loss: 485.021
Epoch  15 Batch  121/269, Loss: 489.227
Epoch  15 Batch  122/269, Loss: 501.370
Epoch  15 Batch  123/269, Loss: 523.413
Epoch  15 Batch  124/269, Loss: 546.926
Epoch  15 Batch  125/269, Loss: 565.088
Epoch  15 Batch  126/269, Loss: 557.668
Epoch  15 Batch  127/269, Loss: 584.629
Epoch  15 Batch  128/269, Loss: 606.676
Epoch  15 Batch  129/269, Loss: 628.486
Epoch  15 Batch  130/269, Loss: 634.201
Epoch  15 Batch  131/269, Loss: 631.064
Epoch  15 Batch  132/269, Loss: 634.232
Epoch  15 Batch  133/269, Loss: 610.528


Epoch  16 Batch   44/269, Loss: 1844.266
Epoch  16 Batch   45/269, Loss: 1879.710
Epoch  16 Batch   46/269, Loss: 1919.995
Epoch  16 Batch   47/269, Loss: 1985.347
Epoch  16 Batch   48/269, Loss: 1980.985
Epoch  16 Batch   49/269, Loss: 1990.029
Epoch  16 Batch   50/269, Loss: 2036.617
Epoch  16 Batch   51/269, Loss: 2051.805
Epoch  16 Batch   52/269, Loss: 2052.482
Epoch  16 Batch   53/269, Loss: 2061.703
Epoch  16 Batch   54/269, Loss: 2047.547
Epoch  16 Batch   55/269, Loss: 2075.425
Epoch  16 Batch   56/269, Loss: 2039.376
Epoch  16 Batch   57/269, Loss: 2018.772
Epoch  16 Batch   58/269, Loss: 2026.846
Epoch  16 Batch   59/269, Loss: 1995.029
Epoch  16 Batch   60/269, Loss: 1980.320
Epoch  16 Batch   61/269, Loss: 1926.336
Epoch  16 Batch   62/269, Loss: 1900.712
Epoch  16 Batch   63/269, Loss: 1878.806
Epoch  16 Batch   64/269, Loss: 1863.468
Epoch  16 Batch   65/269, Loss: 1762.828
Epoch  16 Batch   66/269, Loss: 1778.517
Epoch  16 Batch   67/269, Loss: 1714.355
Epoch  16 Batch 

In [0]:
# Save parameters for checkpoint
save_params(save_path)

In [0]:
import tensorflow as tf
import numpy as np

_, (source_vocab_to_int, target_vocab_to_int), (source_int_to_vocab, target_int_to_vocab) = load_preprocess()
load_path = load_params()

In [0]:
translate_sentence = "new jersey est parfois calme pendant l' automne , et il est neigeux en avril ."
#fr to en
#input: "new jersey est parfois calme pendant l' automne , et il est neigeux en avril ."
#target:"new jersey is sometimes quiet during autumn , and it is snowy in april ."

print(translate_sentence)

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
translate_sentence = sentence_to_seq(translate_sentence, source_vocab_to_int)
print(np.shape(translate_sentence))

loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_path + '.meta')
    loader.restore(sess, load_path)

    encoder_inputs = loaded_graph.get_tensor_by_name('encoder_inputs:0')
    encoder_inputs_length = loaded_graph.get_tensor_by_name('encoder_inputs_length:0')
    decoder_pred_decode = loaded_graph.get_tensor_by_name('decoder_pred_decode:0')
    
    predicted_ids = sess.run(decoder_pred_decode, {encoder_inputs: [translate_sentence],
                                                       encoder_inputs_length: [np.shape(translate_sentence)[0]]})[0]

print('Input')
print('  Word Ids:      {}'.format([i for i in translate_sentence]))
print('  Source Words: {}'.format([source_int_to_vocab[i] for i in translate_sentence]))

print('\nPrediction')
print('  Word Ids:      {}'.format([i[0] for i in predicted_ids]))
print('  Predicted Words: {}'.format([target_int_to_vocab[i[0]] for i in predicted_ids]))

In [0]:
print('\nTranslation:\n')
translation = ''
for word_i in predicted_ids:
    translation += target_int_to_vocab[word_i[0]] + ' '
    
print(translation)


Translation:

going go freezing going going pears animals may store tower little saw little little little little little little little little little little little little little little little little little little little little 
