### Seq2Seq Model

Here, we're building a sequence-to-sequence (seq2seq) model, which is used for tasks that involve reading in a sequence of text and generating an output text sequence based on the input. 

#### Sequence to sequence

The sequence to sequence (seq2seq) framework encompasses any task that involves taking in some text and returning some generated text. Some examples of this include chatbots, text summarization, and machine translation. 

In the past, these seq2seq tasks were often performed using Bayesian statistics. But, we've been able to apply deep learning to these tasks.

In particular, there's an extremely powerful model called the "encoder-decoder", which is specifically designed for seq2seq applications.

The encoder-decoder is named for its two parts: the encoder and the decoder. Both the encoder and decoder are langauge models. 

First, an input sequence is fed to the encoder. The output from the last layer of the encoder becomes the input for the first layer of the decoder. The decoder transforms that input back into a text sequence.


### What is Seq2Seq (from paper)

See this paper: (https://arxiv.org/pdf/1409.3215v3.pdf)
See this tutorial: (https://www.youtube.com/watch?v=ElmBrKyMXxs)

When you train a Seq2Seq model, you have an encoder network and a decoder network. You split a sentence such as "I know that all dogs are really good pets" into something like "I know that all dogs" and "are really good pets".

In the encoding portion, you run "I know that all dogs" and get the output and a hidden state. You feed these (not sure about output, but at the very least, you feed in the hidden state) into a decoder, which will try to recreate "are really good pets". 

As you do this over and over, the decoder output should better approximate the true output sequence. 

### Training Data

#### Training task

For a seq2seq model, we use training pairs that contain an input sequence and an output sequence (and the goal is to predict the output sequence).

During training, we perform two tasks:
1. **Input Task**: Extract useful information from the input sequence
2. **Output Task**: Calculate word probabilities at each output time step, using information from the input sequence and **previous** words in the output sequence.

We can leave the input sequence as is. We process the output sequence into two separate sequences: the ground truth sequence and the final token sequence.

#### Processing the output

The ground truth sequence for a seq2seq model is equal to the input sequence for a language model - it represents sequence prefixes that we use to calculate word probabilities at each time step of the output. 

#### SOS and EOS tokens

For seq2seq models, we need to have start-of-sequence (SOS) and end-of-sequence (EOS) tokens, which mark the start and end of a tokenized text sequence. 

Example. ['SOS', 'he', 'eats', 'bread', 'EOS']

In [None]:
"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column
tf_s2s = tf.contrib.seq2seq

# Seq2seq model
class Seq2SeqModel(object):
    def __init__(self, vocab_size, num_lstm_layers, num_lstm_units):
        self.vocab_size = vocab_size
        # Extended vocabulary includes start, stop token
        self.extended_vocab_size = vocab_size + 2
        self.num_lstm_layers = num_lstm_layers
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(
            num_words=vocab_size)

    # Create a sequence training tuple from input/output sequences
    def make_training_tuple(self, input_sequence, output_sequence):
        truncate_front = output_sequence[1:]
        truncate_back = output_sequence[:-1]
        sos_token = [self.vocab_size]
        eos_token = [self.vocab_size + 1]
        # create input, output sequences
        input_sequence = [sos_token] + input_sequence + [eos_token]
        ground_truth = truncate_back + [sos_token]
        final_sequence = truncate_front + [eos_token]
        return (input_sequence, ground_truth, final_sequence)
"""

### Final States

Here, let's compare the final state output of an LSTM and BiLSTM

#### The encoder

The encoder is used to extract useful information, and is typically an LSTM or BiLSTM

#### LSTM final state

We pass the final state of the encoder into the decoder. For an LSTM in Tensorflow, the final state is represented by the LSTMStateTuple object, which contains two important properties: (1) the hidden state (c) and (2) the state output (h). The hidden state represents the internal cell state ("memory") of the LSTM cell. 

The two properties are represented by tensors with shape = (batch_size, hidden_units)

#### Multi-layer final states

For a multi-layer LSTM, the final state output of `dynamic_rnn` is a tuple containing the final state for each layer.

#### BiLSTM final state

The final state of a BiLSTM is similar to that of the LSTM, except the output is a tuple of two LSTMStateTuple objects (one for the forward LSTM and one for the backward LSTM)

#### Combining forward and backward, for BiLSTM

Since the decoder portion utilizes only a regular LSTM (which works in the forward direction), we need to combine the forward and backward LSTMs, through concatenating the hidden state and state output of both the forward and backward states.

In [None]:
"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column
tf_s2s = tf.contrib.seq2seq

# Get c and h vectors for bidirectional LSTM final states
def get_bi_state_parts(state_fw, state_bw):
    bi_state_c = tf.concat([state_fw.c, state_bw.c],-1)
    bi_state_h = tf.concat([state_fw.h, state_bw.h], -1)
    return (bi_state_c, bi_state_h)

# Seq2seq model
class Seq2SeqModel(object):
    def __init__(self, vocab_size, num_lstm_layers, num_lstm_units):
        self.vocab_size = vocab_size
        # Extended vocabulary includes start, stop token
        self.extended_vocab_size = vocab_size + 2
        self.num_lstm_layers = num_lstm_layers
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(
            num_words=vocab_size)

    def make_lstm_cell(self, dropout_keep_prob, num_units):
        cell = tf.nn.rnn_cell.LSTMCell(num_units)
        return tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=dropout_keep_prob)

    # Create multi-layer LSTM
    def stacked_lstm_cells(self, is_training, num_units):
        dropout_keep_prob = 0.5 if is_training else 1.0
        cell_list = [self.make_lstm_cell(dropout_keep_prob, num_units) for i in range(self.num_lstm_layers)]
        cell = tf.nn.rnn_cell.MultiRNNCell(cell_list)
        return cell

    # Get embeddings for input/output sequences
    def get_embeddings(self, sequences, scope_name):
        with tf.variable_scope(scope_name):
            cat_column = tf_fc.sequence_categorical_column_with_identity(
                'sequences',
                self.extended_vocab_size)
            embedding_column = tf.feature_column.embedding_column(
                cat_column,
                int(self.extended_vocab_size**0.25))
            seq_dict = {'sequences': sequences}
            embeddings, sequence_lengths = tf_fc.sequence_input_layer(
                seq_dict,
                [embedding_column])
            return embeddings, tf.cast(sequence_lengths, tf.int32)
    
    # Create the encoder for the model
    def encoder(self, encoder_inputs, is_training):
        input_embeddings, input_seq_lens = self.get_embeddings(encoder_inputs, 'encoder_emb')
        cell_fw = self.stacked_lstm_cells(is_training, self.num_lstm_units)
        cell_bw = self.stacked_lstm_cells(is_training, self.num_lstm_units)
        enc_outputs, final_states = tf.nn.bidirectional_dynamic_rnn(
            cell_fw,
            cell_bw,
            input_embeddings,
            sequence_length=input_seq_lens,
            dtype=tf.float32)
        states_fw, states_bw = final_states
        combined_state = []
        for i in range(self.num_lstm_layers):
            bi_state_c, bi_state_h = get_bi_state_parts(
                states_fw[i], states_bw[i]
            )
"""

### Combined State

We need to combine the final states for a BiLSTM into usable initial states

#### LSTMStateTuple initialization

We can initialize an LSTMStateTuple object with a hidden state (c) and state output (h). 

Below is an example of combining the BiLSTM forward and backward states into a single LSTMStateTuple object, which can be passed into the decoder. 

For BiLSTM encoders with multiple layers, we combine the states for each layer to create a tuple of `LSTMStateTuple` objects, where the element at index "i" of the tuple is the "ith" layer's combined final state

In [None]:
"""
import tensorflow as tf

# Final states of single-layer BiLSTM
# Forward and backward cells both have 5 hidden units
state_fw, state_bw = final_states

# Concatenate along final axis
final_c = tf.concat([state_fw.c, state_bw.c], -1)
final_h = tf.concat([state_fw.h, state_bw.h], -1)

combined_state = tf.nn.rnn_cell.LSTMStateTuple(
    final_c, final_h)
print(combined_state)
"""

"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column
tf_s2s = tf.contrib.seq2seq

# Get c and h vectors for bidirectional LSTM final states
def get_bi_state_parts(state_fw, state_bw):
    bi_state_c = tf.concat([state_fw.c, state_bw.c], -1)
    bi_state_h = tf.concat([state_fw.h, state_bw.h], -1)
    return bi_state_c, bi_state_h

# Seq2seq model
class Seq2SeqModel(object):
    def __init__(self, vocab_size, num_lstm_layers, num_lstm_units):
        self.vocab_size = vocab_size
        # Extended vocabulary includes start, stop token
        self.extended_vocab_size = vocab_size + 2
        self.num_lstm_layers = num_lstm_layers
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(
            num_words=vocab_size)

    def make_lstm_cell(self, dropout_keep_prob, num_units):
        cell = tf.nn.rnn_cell.LSTMCell(num_units)
        return tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=dropout_keep_prob)

    # Create multi-layer LSTM
    def stacked_lstm_cells(self, is_training, num_units):
        dropout_keep_prob = 0.5 if is_training else 1.0
        cell_list = [self.make_lstm_cell(dropout_keep_prob, num_units) for i in range(self.num_lstm_layers)]
        cell = tf.nn.rnn_cell.MultiRNNCell(cell_list)
        return cell

    # Get embeddings for input/output sequences
    def get_embeddings(self, sequences, scope_name):
        with tf.variable_scope(scope_name):
            cat_column = tf_fc.sequence_categorical_column_with_identity(
                'sequences',
                self.extended_vocab_size)
            embedding_column = tf.feature_column.embedding_column(
                cat_column,
                int(self.extended_vocab_size**0.25))
            seq_dict = {'sequences': sequences}
            embeddings, sequence_lengths = tf_fc.sequence_input_layer(
                seq_dict,
                [embedding_column])
            return embeddings, tf.cast(sequence_lengths, tf.int32)
    
    # Create the encoder for the model
    def encoder(self, encoder_inputs, is_training):
        input_embeddings, input_seq_lens = self.get_embeddings(encoder_inputs, 'encoder_emb')
        cell_fw = self.stacked_lstm_cells(is_training, self.num_lstm_units)
        cell_bw = self.stacked_lstm_cells(is_training, self.num_lstm_units)
        enc_outputs, final_states = tf.nn.bidirectional_dynamic_rnn(
            cell_fw,
            cell_bw,
            input_embeddings,
            sequence_length=input_seq_lens,
            dtype=tf.float32)
        states_fw, states_bw = final_states
        combined_state = []
        for i in range(self.num_lstm_layers):
            bi_state_c, bi_state_h = get_bi_state_parts(
                states_fw[i], states_bw[i]
            )
            bi_lstm_state = tf.nn.rnn_cell.LSTMStateTuple(bi_state_c, bi_state_h)
            combined_state.append(bi_lstm_state)
        final_state = tuple(combined_state)
        return enc_outputs, input_seq_lens, final_state
"""

### Encoder-Decoder

#### Model architecture

The decoder uses the final state of the encoder as its initial state, which gives it access to the information that the encoder extracted from the input sequence (which is crucial for good sequence-to-sequence modeling). 

![image](chapter2_encoder_decoder_outline.png)

Above, the final state of the LSTM encoder for each layer is used as the starting state for the corresponding decoding layer. The encoder doesn't return outputs, and we only use the outputs of the decoder. 

The input tokens for the encoder represent the input sequence, while the input tokens for the decoder represent the ground truth tokens for the given input sequence. The decoder's output is equivalent to the output of a language model (i.e., each time step's output is based on the ground truth tokens from the previous time steps). 

#### Training vs. inference

When we're training, we have access to both the input and output sequences for each training pair. However, when we're making predictions, we only have access to the input, so the decoder doesn't have access to ground truth tokens. We get around this by using the decoder's word predictions from previous time steps as our "ground truth" tokens (i.e., we use past predictions as our "true" values)

### Attention

#### Using the encoder

When we use an encoder-decoder model, the information that the decoder gets from the encoder is the final state of each layer. The final state essentially encapsulates/summarizes the encoder's extracted information from the input sequence. 

But, trying to encapsulate all the useful information from an input sequence is hard, especially if the input sequence is large and contains long-term dependencies. For example, decoders perform poorly on input sequences with long-term dependencies. 

For example, a text with the following input sequence is one that a regular encoder-decoder model would struggle to decode (e.g., for translation) because it's difficult to determine which part of the input is important:

"Sam grew up in Los Angeles. As a child, he dreamed of one day becoming an actor like Brad Pitt or Johnny Depp. Each day he would practice public speaking and impromptu skits near Venice Beach."

To get around this, we can use the encoder's outputs as additional input for the decoder, which gives the decoder a lot more useful information about the input sequence. We do this by using **attention**

#### How attention works

Although we want to include the input sequence for the decoder, we don't necessarily want to use each input token equally, since we want the decoder to "pay attention" to certain outputs from the encoder. we let the decoder decide which encoder outputs are most useful for the decoder at the current decoding time step.

Using the decoder's hidden state at the current time step, as well as the encoder outputs, attention will calculate something called a **context vector**, which encapsulates the most meaningful information from the input sequence for a current decoder time step, and it's used as additional input for the decoder when calculating the time step's output. 

![image](chapter2_attention.png)

Attention uses trainable weights to calculate a context vector. It's like a mini neural network, which takes as input the decoder's current state and the encoder outputs, and uses its trainable weights to produce a context vector.

#### Attention mechanisms

The process for computing the context vector depends on the **attention mechanism** used. There are a few variations, but the popular ones in Tensorflow are `BahdanauAttention` and `LuongAttention`. The main difference between the two mechanisms is how they combine the encoder outputs and current time step hidden state when computing the context vector. The bahdanau mechanism uses an additive (concatenation-based) method, while the Luong mechanism uses a multiplicative method. 

Below is an example of implementing the attention mechanism. 

In [None]:
"""
import tensorflow as tf

# Placeholder representing the
# individual lengths of each input sequence in the batch
input_seq_lens = tf.placeholder(tf.int32, shape=(None,))

num_units = 8
bahdanau = tf.contrib.seq2seq.BahdanauAttention(
    num_units,
    # combined encoder outputs (from previous chapter)
    combined_enc_outputs,
    memory_sequence_length=input_seq_lens)
luong = tf.contrib.seq2seq.LuongAttention(
    num_units,
    # combined encoder outputs (from previous chapter)
    combined_enc_outputs,
    memory_sequence_length=input_seq_lens)
"""

#### TensorFlow AttentionWrapper

The implementation of attention requires a bit of linear algebra and advanced mathematics. So, TensorFlow gives an easy-to-use API for adding attention to an LSTM decoder cell via the AttentionWrapper function. Below is an example of using the AttentionWrapper.

When using the decoder in TensorFlow, we pass the **attention value** at each decoder time step into the cell state at the next time step. The output of the fully-connected layer is used as the attention value. 

Using a fully-connected layer to create the attention value can benefit the model's performance, by using the decoder's outputs as additional information. 

In [None]:
"""
import tensorflow as tf

# Decoder LSTM cell
dec_cell = tf.nn.rnn_cell.LSTMCell(8)
dec_cell = tf.contrib.seq2seq.AttentionWrapper(
    dec_cell,
    luong, # LuongAttention object
    attention_layer_size=8)
"""

In [None]:
"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column
tf_s2s = tf.contrib.seq2seq

# Seq2seq model
class Seq2SeqModel(object):
    def __init__(self, vocab_size, num_lstm_layers, num_lstm_units):
        self.vocab_size = vocab_size
        # Extended vocabulary includes start, stop token
        self.extended_vocab_size = vocab_size + 2
        self.num_lstm_layers = num_lstm_layers
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(
            num_words=vocab_size)
    
    def make_lstm_cell(self, dropout_keep_prob, num_units):
        cell = tf.nn.rnn_cell.LSTMCell(num_units)
        return tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=dropout_keep_prob)

    # Create multi-layer LSTM cells
    def stacked_lstm_cells(self, is_training, num_units):
        dropout_keep_prob = 0.5 if is_training else 1.0
        cell_list = [self.make_lstm_cell(dropout_keep_prob, num_units) for i in range(self.num_lstm_layers)]
        cell = tf.nn.rnn_cell.MultiRNNCell(cell_list)
        return cell

    # Helper funtion to combine BiLSTM encoder outputs
    def combine_enc_outputs(self, enc_outputs):
        enc_outputs_fw, enc_outputs_bw = enc_outputs
        return tf.concat([enc_outputs_fw, enc_outputs_bw], -1)

    # Create the stacked LSTM cells for the decoder
    def create_decoder_cell(self, enc_outputs, input_seq_lens, is_training):
        num_decode_units = self.num_lstm_units * 2
        dec_cell = self.stacked_lstm_cells(is_training, num_decode_units)
        combined_enc_outputs = self.combine_enc_outputs(enc_outputs)
        attention_mechanism = tf_s2s.LuongAttention(num_decode_units, combined_enc_units, memory_sequence_length = input_seq_lens)
        dec_cell = tf_s2s.AttentionWrapper(dec_cell, attention_mechanism, attention_layer_size = num_decode_units)
        return dec_cell
"""


### Training Helper

#### Decoding during training

During training, we have both the input and output sequences of a training pair. So, we can use the output sequence's ground truth tokens as input for the decoder. 

Below is a helper function that we can use for decoding during training. 

The `TrainingHelper` object is initialized with the (embedding) ground truth sequences and the lengths of the ground truth sequences. Note that we use separate embedding models for the encoder input and the decoder input (i.e., ground truth tokens). This is because there are different word relationships in the input and output sequences for a seq2seq task, and sometimes the sequences can be completely different (e.g., machine translation)

In [None]:
"""
import tensorflow as tf

# Placeholder representing the
# batch of (embedded) input sequences for the decoder
# Shape: (batch_size, max_seq_len, embed_dim)
decoder_embeddings = tf.placeholder(
    tf.float32, shape=(None, None, 12)
)

# Placeholder representing the
# individual lengths of each input sequence in the batch
decoder_seq_lens = tf.placeholder(tf.int32, shape=(None,))

helper = tf.contrib.seq2seq.TrainingHelper(
    decoder_embeddings, decoder_seq_lens)
"""

In [None]:
"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column
tf_s2s = tf.contrib.seq2seq

# Seq2seq model
class Seq2SeqModel(object):
    def __init__(self, vocab_size, num_lstm_layers, num_lstm_units):
        self.vocab_size = vocab_size
        # Extended vocabulary includes start, stop token
        self.extended_vocab_size = vocab_size + 2
        self.num_lstm_layers = num_lstm_layers
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(
            num_words=vocab_size)
    
    # Convert sequences to embeddings
    def get_embeddings(self, sequences, scope_name):
        with tf.variable_scope(scope_name):
            cat_column = tf_fc.sequence_categorical_column_with_identity(
                'sequences',
                self.extended_vocab_size)
            embedding_column = tf.feature_column.embedding_column(
                cat_column,
                int(self.extended_vocab_size**0.25))
            seq_dict = {'sequences': sequences}
            embeddings, sequence_lengths = tf_fc.sequence_input_layer(
                seq_dict,
                [embedding_column])
            return embeddings, tf.cast(sequence_lengths, tf.int32)

    # Create the helper for decoding
    def create_decoder_helper(self, decoder_inputs, is_training, batch_size):
        if is_training:
            dec_embeddings, dec_seq_lens = self.get_embeddings(decoder_inputs, 'decoder_emb')
            helper = tf_s2s.TrainingHelper(dec_embeddings, dec_seq_lens)
            pass
        else:
            # IGNORE FOR NOW
            pass
        return helper, dec_seq_lens
"""

### Decoder Object

