Original idea comes from paper: "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation"

Encode and decode RNNs (Often GRU or LSTM is used)

## Sequence to Sequence

Consists of two recurrent neural networks (RNNs):
1. Encoder maps a variable-length source sequence (input) to a fixed-length vector
2. Decoder maps the vector representation back to a variable-length target sequence (output)

Two RNNs are trained jointly to maximize the conditional probability of the target sequence given a source sequence

Vanila Encoder-Decoder RNN
<img src="img/encoder_decoder.png" alt="Drawing" style="width: 300px;"/>

Encoder-Decoder RNN for en2de Translation 
<img src="img/encoder_decoder_en2de.png" alt="Drawing" style="width: 600px;"/>

Encoder-Decoder RNN with Attention
<img src="img/encoder_decoder_attention.png" alt="Drawing" style="width: 500px;"/>


### Bucket Similar Sequence Lengthes Together

- Avoid too much padding that leads to extraneous computation
- Group sequences of similar lengths into the same buckets
- Create a separate subgraph for each bucket
- In theory, can use for v1.0:

      tf.contrib.training.bucket_by_sequence_length(max_length,
      examples, batch_size, bucket_boundaries, capacity=2 *
      batch_size, dynamic_pad=True)

- In practice, use the bucketing algorithm used in TensorFlow’s
translate model (because we’re using v0.12)

## Sampled Loss

Do you want to train a multiclass or multilabel model with thousands or millions of output classes (for example, a language model with a large vocabulary)? Training with a full Softmax is slow in this case, since all of the classes are evaluated for every training example. Candidate Sampling training algorithms can speed up your step times by only considering a small randomly-chosen subset of contrastive classes (called candidates) for each batch of training examples.

Among options are NCE (noise-contrastive estimation) loss, sampled softmax loss, 

### Sampled Softmax
This is a faster way to train a softmax classifier over a huge number of classes. 
- This operation is for training only. It is generally an underestimate of the full softmax loss. 
- At inference time, you can compute full softmax probabilities with the expression tf.nn.softmax(tf.matmul(inputs, tf.transpose(weights)) + biases).

Notes:
- Avoid the growing complexity of computing the normalization constant
- Approximate the negative term of the gradient, by importance sampling with a small number of samples.
- At each step, update only the vectors associated with the correct word w and with the sampled words in V’
- Once training is over, use the full target vocabulary to compute the output probability of each target word 

Also:
- Generally an underestimate of the full softmax loss.
- At inference time, compute the full softmax using:
  - tf.nn.softmax(tf.matmul(inputs, tf.transpose(weight)) + bias)

[1] On Using Very Large Target Vocabulary for Neural Machine Translation (Jean et al., 2015)

In [None]:
# Example use of sampled_softmax_loss
if config.NUM_SAMPLES > 0 and config.NUM_SAMPLES < config.DEC_VOCAB:
        weight = tf.get_variable('proj_w', [config.HIDDEN_SIZE, config.DEC_VOCAB])
        bias = tf.get_variable('proj_b', [config.DEC_VOCAB])
        self.output_projection = (w, b)
    def sampled_loss(inputs, labels):
        labels = tf.reshape(labels, [-1, 1])
        return tf.nn.sampled_softmax_loss(tf.transpose(weight), bias, inputs, labels,
        config.NUM_SAMPLES, config.DEC_VOCAB)
    self.softmax_loss_function = sampled_loss

## Vocabulary tradeoff
- Get all tokens that appear at least a number of time (twice)
- Alternative approach: get a fixed size vocabulary

Smaller vocabulary:
- Has smaller loss/perplexity but loss/perplexity isn’t everything
- Gives < unk> answers to questions that require personal information
- Doesn’t give the bot’s answers much response
- Doesn’t train much faster than big vocab using sampled softmax

## Sanity check

How do we know that we implemented our model correctly?
- Run the model on a small dataset (~2,000 pairs) 
- and run for a lot of epochs to see if it converges
- (learns all the responses by heart)

## Problems

- The bot is very dramatic (thanks to Hollywood screenwriters)
- Topics of conversations aren’t realistic
- Responses are always fixed for one encoder input
- Inconsistent personality
- Use only the last previous utterance as the input for the encoder
- Doesn’t keep track of information about users

NOTE: Please review these slides one more time! http://web.stanford.edu/class/cs20si/lectures/slides_13.pdf