In [1]:
import tensorflow as tf

### Introduction

In this section, we'll use an LSTM (long short-term memory) network for language modeling. 

#### Language modeling

A language model can tell us the likelihood of each word in a sentence or passage, based on the words that came before it. We can then determine how likely a sentence or text passage is by aggregating its individual word probabilities.

#### Language model tasks

Languagemodels are useful for both text classification and generation. In classification, we can use the language model probabilities to separate text into different categories.

For example, if we trained a language model on spam email subject titles, we could take a title and train the model to identify it. 

We can also use a language model to generate text based off an incomplete input sentence (e.g., autocomplete). The model gives suggestions to complete the sentence, based off using words with the highest probabilities. 

### Language Model

We can use language models to calculate word probabilities.

#### Word probabilities

The probability for each word is conditioned on the words that appear before it in the sequence. Because of this, the task of calculating a word probability based on the previous words in a sequence is essentially multiclass classification. 

Since each word in a sequence must come from the text corpus' vocabulary, we consider each vocabulary word as a separate class. We can use the previous sequence words as input and the word of interest as the word that we are predicting, in order to calculate probabilities for each vocabulary class.

![title](chapter2_wordProbs.png)

#### Inputs & targets

The goal behind training a language model is to predict each word in a sequence based on the words that come before it. 

But, instead of making training pairs where the target is a single word, we can make training pairs where the input and target sequences are the same length. 

![title](chapter2_example_training_words.png)

The language model attempts to predict each word in the target sequence based on its corresponding prefix in the input sequence. In the example, the input prefix-target word pairs are: 

![title](chapter2_language_pairs.png)

So, we can predict say, the third word in the text by using the first two words as the "prefix" and the third word as the target. 

#### Maximum length

In this setup, we see that words later on in the string will have a longer prefix than words earlier in the string. We can put a limit on the length of the training sequences. 

Using a fixed max sequence length can increase training speed and help the model avoid overfitting on uncommon text dependencies (which might show up more often in long, run-on sentences that are likely just people rambling).

![title](chapter2_language_fixed_length.png)


In [None]:
"""
import tensorflow as tf

def truncate_sequences(sequence, max_length):
    # CODE HERE
    input_sequence = sequence[:max_length-1]
    target_sequence = sequence[1:max_length]
    return (input_sequence, target_sequence)

# LSTM Language Model
class LanguageModel(object):
    # Model Initialization
    def __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.num_lstm_layers = num_lstm_layers
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

    def get_input_target_sequence(self, sequence):
        seq_len = len(sequence)
        if seq_len >= self.max_length:
            input_sequence, target_sequence = truncate_sequences(
                sequence, self.max_length
            )
        else:
            # Next chapter
            input_sequence, target_sequence = pad_sequences(
                sequence, self.max_length
            )
        return input_sequence, target_sequence
"""

### Padding

We can use padding for our tokenized sequences

#### Varied length sequences

For most neural networks, the input data always has a fixed length. This is the case because most neural networks are ***feed-forward***, which means that they use multiple layers of ****fixed**** sizes to compute the network's output. 

But, since text data involves different-sized text sequences (E.g., sentences, passages, paragraphs, etc.), the language model must be flexible enough to handle input data of different lengths. 

As a result, we use a recurrent neural network (RNN) for the language model. 

#### Padded sequences

Even though RNNs and LSTMs allow us to take in different length input texts, we still need to tokenize each text sequence in a training batch to be the same length. This is because a training batch must have the proper tensor shape (e.g., be a 2-d matrix)

To enforce this, we can use padding. For sequences that are shorter than the maximum sequence length, we can append a special non-vocabulary token to the end of the sequence until its length is equal to the maximum sequence length (typically the token is given ID = 0, while each vocabulary word ID is a positive integer)

e.g., 

1. [1, 5, 10, 9] with max length = 7 --> [1, 5, 10, 9, 0, 0, 0]
2. [3, 6, 1] with max length = 7 --> [3, 6, 1 , 0, 0, 0, 0]

In [None]:
"""
def pad_sequences(sequence, max_length):
    padding_amount = max_length - len(sequence)
    padding = [0 for i in range(padding_amount)]
    input_sequence = sequence[:-1] + padding
    target_sequence = sequence[1:] + padding
    return input_sequence, target_sequence
"""

### RNN/LSTM

#### Neural network types

Source of notes (besides Educative): https://towardsdatascience.com/recurrent-neural-networks-d4642c9bc7ce

Here, we're using recurrent neural networks (RNN), which are specially designed to work with sequential data of varying lengths. These are better for our context than feed-forward neural networks, which aren't good for dealing with sequences of text data. We use RNNs, rather than feed-forward neural networks, because this allows us to capture the sequential portion of our sequences (otherwise, if we just use feedforward NNs, we just see "many" inputs, rather than sequential inputs)

RNNs "remember" the past and its decisions are influenced by what it learned in the past. 


Feedforward NNs "remember" things too, but they remember things they learned during training. For example, a convolutional NN learns what a "1" looks like during trainin and uses that knowledge to classify a training set. 

RNNs take this idea a step further by "remembering" things they learned **while** they're being trained. For example, input 10 can take information about inputs 1-9 and outputs 1-9 and its weights are affected not just by the regular weights (like in feedforward NNs), but also by "hidden state" weights. This is why they're good for sequential data - training on a sentence such as "I am very happy right now" gives a different sequence than "I very am now right happy". A feedforward NN would consider these sequences the same way, but a recurrent NN takes into account the order that it sees the word (e.g., its prediction for the word "happy" depends on whether it came before "very" or after "very", in this case). 

So, for a RNN, the same input could produce a different output, depending on previous inputs in the series. 

If we were to apply a feedforward NN on the following text:

"Going to the mall is my favorite weekend activity and it makes me happy", 

Then a vanilla NN would take the sequence in its entirety and give one prediction. 

You can improve this by tokenizing the string, and taking each word separetely (e.g., "going", "to", "the", "mall", ...) and run a sequence of NNs on these words. But, the downside is that you only get more inputs, and you lose the fact that the definition of "happy" depends on what words come before or after it in the sequence.

RNNs and CNNs are so successful because of the idea of "parameter sharing", where they can use information from neighboring values (so, for images, neighboring pixels, or for text, the neighboring words). 

![image](chapter2_RNNs.png)

The diagram above illustrates how a RNN works to carry pertinent information from one input item to the next. 

For example, for the line "I really like basketball", the first text, "I" gets multiplied by some weight $W_x$, then $W_h$, then $W_y$ in order to get an output $Y_1$. The next word, "really", gets multiplied by some weight $W_x$, then a new $W_h$ that uses information from the last word, "I" (which adjusted the value of this weight), then $W_y$, then an output $Y_2$. The next word, "like", follows the same process - multiplied by a weight $W_x$, then by a new weight $W_h$ that had been adjusted by the presence of "I" and "like" (which came earlier in the sequence) and finally multiplied by $W_y$ to get an output $Y_3$. 

This is an example of how RNNs can be fed sequences of text, and then use the sequencing information to get predictions of words based on their surrounding text (the example above is a unidirectional RNN, but it can also be made bidirectional). 

To make RNNs "deeper" (and, therefore, add the multi-level abstractions and representations we gain through "depth" in a typica neural network), there are typically four things we can do: (1) add hidden states, one on top of the other, feeding the output of one to the next, (2) add additional nonlinear hidden layers between input to hidden states, (3) increase depth in the hidden to hidden transition, or (4) increase depth in the hidden to output transition. (see this link: https://arxiv.org/pdf/1312.6026.pdf)

![image](chapter2_deepRNNs.png)

Above is an example of how we can make deeper RNNs

We can also create bidirectional RNNs, by creating two sets of hidden layers, one going in the forward direction and one going in the backward direction. This lets us use the entire sentence all at once, and use information from words before and after a given target word. 

![image](chapter2_bidirectional_RNN.png)

#### LSTMs

LSTMs are a modification of RNNs (it's a subtype of RNNs) that changes how we compute outputs and hidden states using inputs. 

Here is a link with information about LSTMs: (http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Typical RNNs can be illustrated like the following: 

![image](chapter2_unrolled_RNNs.png)

An appeal of RNNs is that they can connect previous information (e.g., words earlier in the sentence) to the present task (e.g., predicting the sentiment of the current word)

The ability of RNNs depends on the sequence that we're looking at. For example, if we're predicting the last word of a sentence such as "the clouds are in the -----", it's pretty obvious that the next word is "sky". Where the gap between the relevant information and the place where it's needed is small (i.e., "clouds" and "sky" are only a few words apart), RNNs can learn to use the past info. 

But, imagine a longer sentence, such as "I grew up in France. I really liked going to school there. I had a lot of friends and we really liked playing soccer (football) and going out to the parks. I speak fluent ----", the key word to figure out the missing word is "France", since "France" --> "French". But, this word is so early in the sentence that our hidden weights, from a regular RNN, might not be adjusted enough to capture the fact that "France" was the key word in that sentence that we could use to predict that "French" was the last word. As sentences become longer, RNNs become unable to learn to connect the information. 

To get around this inability to learn long-term dependencies, LSTMs were created (see http://www.bioinf.jku.at/publications/older/2604.pdf). 

A typical RNN passes the input word through one layer (e.g., a single tanh layer), then the hidden layer then the output. This repeats for each of the words in a sequence (see below). 

![image](chapter2_RNNs_repeating_module.png)

LSTMs also share this chainlike structure across words in a sentence, but the repeating module (the module of operations that each input word goes through) has a different structure. Instead of having a single neural network layer, it has four layers, interacting in a special way:

![image](chapter2_LSTMs_repeating_module.png)

The following notation will be helpful as well:

![image](chapter2_LSTM_notation.png)

Each line carries an entire vector, from the output of one node to the input of the others. Pink circles represent pointwise operations (e.g., vector addition) while yellow boxes are learned neural network layers. Lines merging indicator concatenation, while a line forking denotes its content being copied and the copies going to different locations. 

#### Core idea behind LSTMs

The key to LSTMs is the **cell state**, the horizontal line in the top of the diagram that runs across the repeating modules. The cell state runs through the entire chain, with only minor linear interactions. It's very easy for information to flow from it unchanged. This lets us model "long-term" information, since it can take information from very early in the sequence and transfer it to later portions of the sequence, mostly untouched. 

For typical RNNs, the hidden layer carries information about previously seen text and uses it to make conclusions on the target word. The cell state for LSTMs is similar (and LSTMs use their own hidden layers too, in addition to cell states), but it's more selective about the previous information and the weights that it uses to make subsequent predictions. 

So, for example, in the previous example we needed the word "France" in order to predict "French". Using an LSTM allows us to use information about "France" later on in the sequence, practically untouched, which lets us predict the word "French" later on. 

![image](chapter2_LSTM_cell_state.png)

To maintain this ability to transfer long-term information, we need to decide what information gets put in the cell state, this horizontal line. To control what information gets added or removed from the cell state, LSTMs use structures called gates.

Gates are composed of a sigmoid neural net layer and a pointwise multiplication operation. 

The sigmoid layer outputs numbers between zero and one, describing how much of each component in the vector should be let through and added to the cell state. If it's zero, then that component of the vector shouldn't be added, and if it's one, it should be added. If it's between 0 and 1, it's weighted accordingly. 

e.g., if the vector has 4 components, and the sigmoid has weights [0, 0.2, 0.1, 0.8], then for V = [v1, v2, v3, v4], we add [0 * v1, 0.2 * v2, 0.1 * v3, 0.8 * v4] into the cell state. 

A typical LSTM has three of these gates, to proect and control the cell state

![image](chapter2_example_insert_gate.png)

#### Step-by-step LSTM walkthrough

1. We need to decide what information we throw away from the cell state. This decision is made with a sigmoid layer called the "forget gate layer". It looks at the hidden layer, $h_{t-1}$, and $x_1$, the input, and outputs a number between 0 and 1 for each number in the cell state $C_{t-1}$. A 1 represents "completely keep this", while a 0 represents "completely get rid of this". For example, if, from our previous input, we had information that used a gender such as "male", but our next input used "female", we want to throw away information associated with "male" because it won't help us predict "female".

![image](chapter2_LSTM_step1.png)

We see from the diagram above that the input information is fed into the "forget gate" and passed to the cell state (using pointwise multiplication) - so, for example, we can throw away information in the cell state about "male" if we know that our present input will be about women (so, the sigmoid output for the index for "male" would be 0, so the pointwise multiplication would give 0 and therefore the information in the cell state about "male" would be simplified to 0). 

2. Now that we know what information that we want to eliminate from our present cell state, we want to decide what information to store in the cell state. This procedure has two parts: (1) we use a sigmoid layer called the "input gate layer" that decides which values, for V = [v1, v2, ..., v_n] to update, and (2) a tanh layer, which creates a vector of new candidate values, $C_t$, that could be added to the state. We combine these two to create an update to the state. For example, we can decide which words from our input we want to include in the cell state (e.g., we can choose to include the 1st, 3rd, and 6th words), as well as values for those inputs (e.g., "female" or other words). One way to use this functionality is to replace information that was forgotten before. For example, if we had a sentence about "England", we can forget "England" in the previous step and replace "England" with "France" in this step, by noting (1) that we want to replace the value in the cell state that corresponded to "England" and (2) update this value with "France". 

![image](chapter2_LSTM_step2.png)

3. Now, we update the old cell state, $C_{t-1}$, into the new cell state, $C_t$. The previous steps gave us a procedure for how we want to combine forgetting and inserting information, in order to update our cell state. Steps 1 and 2 gave us outputs from their sigmoidal layers, a vector [v1, ... vn] with values and weights. Now we need to incorporate these to the cell state using pointwise operations (multiplication for the forget layer, so that a value of $v_i$ = 0 means that we remove that value from the cell state, and addition for the insertion layer (since we took care of the multiplication step earlier on, so we can just add the weights directly - see diagram)).

![image](chapter2_LSTM_step3.png)

4. Now that we've updated our long-term cell state (which allows us to model long-term dependencies), we can decide what to output. The output will be based on our cell state, but will be a filtered version. We pass the input through a sigmoid layer. Then, we take the cell state and pass it through a tanh layer. We take the outputs of the sigmoid layer (of the input) and the tanh layer (of the cell state, and pushes the values to be between -1 and 1) and do pointwise multiplication. Doing so allows us to make use of the cell state information in our predictions for the input. As a result, our output from the input is passed through the sigmoid layer, and then filtered based on the information that we want to use (which is informed by the weights in the cell state, which tell us which information to pass down to subsequent layers). For example, if the previous word in our input sentence was a subject, then the text is fed into a sigmoid layer, but the weights from the cell state might tell us to expect a verb next or whether the next term should be singular or plural or if the next word to expect is more likely to be "France" than "running". This is passed both as an output of that layer and as information fed into the next module/word of the program.

![image](chapter2_LSTM_step4.png)

So, the output of one LSTM unit is (1) an output prediction of the word in question, (2) information about what to directly expect from the next word (e.g., proper noun vs. verb), and (3) a cell state, which contains "overarching info" about the entire sequence and allows us to remember long-term dependencies.

Notice that our input is passed into four layer, each of which has a separate role. The first layer that it is passed into is the "forget layer", which is used to "forget" information from the cell state. The next two layers are "insert layers", which determine what info to actually include in the cell state. The final layer is a sigmoid layer where the input is actually transformed into an output. So, the first three layers are used to update the cell state, while the last layer is used to get the actual output. We use four copies of the same input, since we need it four times. 

#### Variations of LSTMs

One popular LSTM variant adds "peephole connections" (see ftp://ftp.idsia.ch/pub/juergen/TimeCount-IJCNN2000.pdf), which means that we let the gate layers look at the cell state so that they can use information from the cell state to determine what to forget / include, as well as how to feed the input into the sigmoid layer (layer 3)

![image](chapter2_LSTM_peephole.png)

Another variation is to couple the forget and input gates. Instead of separately deciding what to forget and what to include, we can make those decisions together. We can connect the "forget" and "include" gates such that we only forget when we're going to input something in its place, and we input new values to the state only when we forget something older. 

![image](chapter2_LSTM_coupled_forget_include.png)

Yet another variation of the LSTM, which is more dramatic, is the Gated Recurrent Unit (or GRU), introduced in the following paper: (http://arxiv.org/pdf/1406.1078v3.pdf). It combines the "forget" and "input" gates into a single "update gate". It also merges the cell state and hidden state (so that the output of the interaction b/w cell state + hidden state, which was separated before, now is combined, so the output becomes both the output for that unit and the cell state for the next unit)

![image](chapter2_LSTM_GRU.png)


In [None]:
"""
import tensorflow as tf

# LSTM Language Model
class LanguageModel(object):
    # Model Initialization
    def __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.num_lstm_layers = num_lstm_layers
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

    def make_lstm_cell(self, dropout_keep_prob):
        cell = tf.nn.rnn_cell.LSTMCell(self.num_lstm_units)
        return cell
"""

### Dropout

We can use dropout to train a better LSTM model

#### Regularization

RNNs can have many weight parameters (due to having a large number of hidden units in a cell or using multiple RNN layers), which can lead to overfitting the training set. To get around this, we can use regularizaton (modifying the neural network to mitigate risk of overfitting). 

Dropout is a popular method of regularization in feedforward neural networks, where we randomly set weights to zero. Dropout works in feedforward NNs since it forces each neuron in the network to become more proficient and less dependent on the other neurons (otherwise, a neuron can learn imperfect weights because another neuron can learn weights that complement that first neuron and cover its imperfect weights). We can also use this in RNNs, but with some modifications.

#### Dropouts in RNNs

In RNNs, we apply dropout to the input and/or output of each cell unit. When dropout is applied to a cell's input/output at a particular time step, the cell's connection to the time step is made zero. So, whatever the previous input/output value was, it is set to 0 due to dropout. 

Applying dropout reduces the cell computations performed for the eventual RNN output, reducing the risk that the RNN will overfit the data (because we randomly take out information from our model)

![image](chapter2_RNN_dropout.png)

To use this in Tensorflow, we can use `tf.nn.rnn_cell.DropoutWrapper`, which takes in an RNN/LSTM cell as its required argument. The "input_keep_prob" argument represents the probability of keeping the cell's input at each time step. The "output_keep_prob" argument represents the same thing for cell outputs. The default values are both 1.0, which represents no dropout. 

Usually, a good starting point is around 0.5 (randomly dropping half of inputs/outputs), and then adjusting based on how the model trains. 



In [None]:
"""
import tensorflow as tf

# LSTM Language Model
class LanguageModel(object):
    # Model Initialization
    def __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.num_lstm_layers = num_lstm_layers
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

    def make_lstm_cell(self, dropout_keep_prob):
        cell = tf.nn.rnn_cell.LSTMCell(
            self.num_lstm_units) # CODE UNDER THIS LINE
        dropout_cell = tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob = dropout_keep_prob)
        return dropout_cell

"""

### Multiple Layers

We can stack LSTM cell layers to improve performance. 

#### Stacking layers

Adding cell layers allows the model to pick up on more complex features from the input sequence and improve the performance on large datasets.

![image](chapter2_stacked_RNN_example.png)

At each time step in the diagram above, the first cell's output becomes the input for the second cell, and the second cell's output is the overall RNN's output for that particular time step. 

In the example above, [5, 13, 0, 3] is some tokenized sequence for a four-word sequence. x1, x2, x3, and x4 represent the outputs for the first layer. y1, y2, y3, y4 represent the outputs for the second layer

Below, we create a multi-layer LSTM model. We set up a list containing the LSTM cells we want to use in our model. 

In [3]:
"""
def stacked_lstm_cells(self, is_training):
    dropout_keep_prob = 0.5 if is_training else 1.0
    cell_list = [self.make_lstm_cell(dropout_keep_prob) for i in range(self.num_lstm_layers)]
    cell = tf.nn.rnn_cell.MultiRNNCell(cell_list)
    return cell
"""

### LSTM Output

#### TensorFlow implementation

We create and run a RNN in TensorFlow using `tf.nn.dynamic_rnn`. This function takes in two required arguments: (1) the cell object used to create the RNN (e.g., LSTMCell, MultiRNNCell, etc.) and (2) the batch of input sequences, which are usually first converted to word embedding sequences. 

We need to state either the `initial_state` or the `dtype` arguments. The `initial_state` argument specifies the starting state for the input cell object. The `dtype` argument specifies the type of both the initial cell state and the RNN output (typically we just use "tf.float32"). 

Below is some example code for running the LSTM code (where the input sequences have maximum length = 10 and embedding size = 12)
`
import tensorflow as tf
cell = tf.nn.rnn_cell.LSTMCell(7)
input_sequences = tf.placeholder(
    tf.float32,
    shape=(None, 10, 20)
)
output, final_state = tf.nn.dynamic_rnn(
    cell,
    input_sequences,
    dtype=tf.float32
)
`

The output of this function is a tuple with the RNN outputs and the final state of the RNN. 

The output first and second dimensions are equal to the input batch. The RNN calculates the output for each time step of each sequence in the input batch. The third dimension is equal to the number of hidden units in the cell object. For RNNs with multiple cells (e.g., MultiRNNCell cell object), the third dimension is equal to the number of hidden units in the final cell.

#### Sequence lengths

Since each of the input sequences can have varying lengths, it is likely that many will contain padding. Since padding is essentially a sequence filler (and therefore adds no value to the RNN), we don't want the RNN to waste computation on the padded parts of a sequence. 

As a result, we can use the `sequence_length` argument in `tf.nn.dynamic_run`, which takes a 1-D integer tensor that specifies the non-padded lengths of each sequence in the input batch.

`
import tensorflow as tf
lens = [4, 9, 10, 5, 10]
cell = tf.nn.rnn_cell.LSTMCell(7)
input_sequences = tf.placeholder(
    tf.float32,
    shape=(None, 10, 20)
)
output, final_state = tf.nn.dynamic_rnn(
    cell,
    input_sequences,
    sequence_length=lens,
    dtype=tf.float32
)
`

The function below, `run_lstm`, runs the LSTM model on input sequences. Our input sequences have already been converted to embeddings, and the sequence lengths have already been calculated. We need to run the LSTM model below.

In [None]:
"""
import tensorflow as tf

# LSTM Language Model
class LanguageModel(object):
    # Model Initialization
    def __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.num_lstm_layers = num_lstm_layers
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

    # Create a cell for the LSTM
    def make_lstm_cell(self, dropout_keep_prob):
        cell = tf.nn.rnn_cell.LSTMCell(self.num_lstm_units)
        return tf.nn.rnn_cell.DropoutWrapper(
            cell, output_keep_prob=dropout_keep_prob)
    
    # Stack multiple layers for the LSTM
    def stacked_lstm_cells(self, is_training):
        dropout_keep_prob = 0.5 if is_training else 1.0
        cell_list = [self.make_lstm_cell(dropout_keep_prob) for i in range(self.num_lstm_layers)]
        cell = tf.nn.rnn_cell.MultiRNNCell(cell_list)
        return cell
    
    # Convert input sequences to embeddings
    def get_input_embeddings(self, input_sequences):
        embedding_dim = int(self.vocab_size**0.25)
        initial_bounds = 0.5 / embedding_dim
        initializer = tf.random_uniform(
            [self.vocab_size, embedding_dim],
            minval=-initial_bounds,
            maxval=initial_bounds)
        self.input_embedding_matrix = tf.get_variable('input_embedding_matrix',
            initializer=initializer)
        input_embeddings = tf.nn.embedding_lookup(self.input_embedding_matrix, input_sequences)
        return input_embeddings
    
    # Run the LSTM on the input sequences
    def run_lstm(self, input_sequences, is_training):
        cell = self.stacked_lstm_cells(is_training)
        input_embeddings = self.get_input_embeddings(input_sequences)
        binary_sequences = tf.sign(input_sequences)
        sequence_lengths = tf.reduce_sum(binary_sequences, axis=1)
        def run_lstm(self, input_sequences, is_training):
    cell = self.stacked_lstm_cells(is_training)
    input_embeddings = self.get_input_embeddings(input_sequences)
    binary_sequences = tf.sign(input_sequences)
    sequence_lengths = tf.reduce_sum(binary_sequences, axis=1)
    lstm_outputs, _ = tf.nn.dynamic_rnn(
        cell,
        input_embeddings,
        sequence_length=sequence_lengths,
        dtype=tf.float32)
    return lstm_outputs, binary_sequences
"""

### Calculating Loss

#### Logits and loss

The task for a language model is no different from regular multiclass classification (since we can treat word predicting as predicting the correct "class" of the next word). 

Therefore, the loss function will still be the regular softmax cross entropy loss. We use a final fully-connected layer to convert model outputs into logits for each of the possible classes (i.e., each of the vocabulary words) and then return the word that has the highest probability.

In [None]:
"""
import tensorflow as tf
# Output from an LSTM
# Shape: (batch_size, time_steps, cell_size)
lstm_outputs = tf.placeholder(tf.float32, shape=(None, 10, 7))

vocab_size = 100
logits = tf.layers.dense(lstm_outputs, vocab_size)

# Target tokenized sequences
# Shape: (batch_size, time_steps)
target_sequences = tf.placeholder(tf.int64, shape=(None, 10))
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
    labels=target_sequences,
    logits=logits)

"""

The function that we use to calculate softmax cross entropy requires that the labels and logits have the same shape. In our example, `logits` has 3 dimensions, while `labels` (`target_sequences`) only has 2. Here, the `labels` are referred to as 'sparse' (i.e., they represent class indices rather than one-hot vectors), so we use the sparse version of the loss function. 

#### Padding mask

When we calculate the loss based on the model's outputs, we don't want to include the logits/loss calculated for the padded time steps (since these are zeros and therefore meaningless for us). So, we use a **padding mask** to mask the padded layers. The padding mask willl have the same shape as the labels (e.g., target batch), but it will only contain 0s and 1s. Locations containing 0s represent padded time steps, while locations containing 1s represent actual input sequence tokens. We multiply the padding mask by the loss, in order to zero-out the padded time step locations. 

Below is an exam of a padding mask, with a batch size of 1 and max sequence length of 5. We can cast the padding mask to "tf.float32" so that it matches the type of the loss:

In [None]:
"""
import tensorflow as tf
# loss: Softmax loss for LSTM
with tf.Session() as sess:
    print(repr(sess.run(loss)))

# Same shape as loss
pad_mask = tf.constant([
    [1., 1., 1., 1., 0.],
    [1., 1., 0., 0., 0.]
])

new_loss = loss * pad_mask
with tf.Session() as sess:
    print(repr(sess.run(new_loss)))
"""

In [None]:
"""
import tensorflow as tf

# LSTM Language Model
class LanguageModel(object):
    # Model Initialization
    def __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.num_lstm_layers = num_lstm_layers
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

    # Calculate model loss
    def calculate_loss(self, lstm_outputs, binary_sequences, output_sequences):
        # convert outputs of LSTM into logits
        logits = tf.layers.dense(lstm_outputs, self.vocab_size)
        # use sparse softmax cross entropy
        batch_sequence_loss = tf.nn.sparse_softmax_cross_entropy_loss(labels = output_sequences, logits = logits)
        # use padding mask
        unpadded_loss = batch_sequence_loss * tf.cast(binary_sequences, tf.float32)
        # set overall loss, after mask
        overall_loss = tf.reduce_sum(unpadded_loss)
        return overall_loss

"""

### Predictions

We can create word predictions based on the output of the LSTM model

#### Calculating probabilities

For an RNN, we can convert the logits to probabilities (on the final dimension of the logits). We apply the softmax function which gives us the probabilities for each word at each time step in each sequence. 

`
import tensorflow as tf
logits = tf.placeholder(tf.float32, shape=(None, 5, 100))
probabilities = tf.nn.softmax(logits, axis=-1)
`

#### Word predictions

The RNN's word predictions are calculated by taking the highest probability word at each time step. We don't have to worry about negative values since our tokenization ensures that each vocabulary word is a positive integer. Below is an example of taking the highest probability word as our prediction:

`
import tensorflow as tf
probabilities = tf.placeholder(tf.float32, shape=(None, 5, 100))
word_preds = tf.argmax(probabilities, axis=-1)
`
#### Current state-of-the-art

Currently, the best performing models use transformers (see this paper: https://arxiv.org/abs/1706.03762 and this interactive platform from OpenAI (which runs GPT-2): https://talktotransformer.com/)

#### Using tensor-indexing

How do we extract the word predictions? If it were a regular array, we could use list indexing/slicing. But, in Tensorflow, we have to use the `gather` functions to retrieve data at specific locations of the tensor. 

We can use either `tf.gather` and `tf.gather_nd`, which take in tthe same required arguments, `params` (the tensor that we want to retrieve data from) and `indices` (the locations in the tensor that we will index to)

The `tf.gather` function can be used to retrieve specific slices from a tensor, based on the `axis` keyword (default = 0). 

Below are some examples of `tf.gather` : 


In [None]:
"""
import tensorflow as tf
t1 = tf.constant([1, 2, 3])
with tf.Session() as sess:
    print(repr(sess.run(tf.gather(t1, 0))))
    print(repr(sess.run(tf.gather(t1, 2))))

print('\n')
t2 = tf.constant([[1, 2, 3], [4, 5, 6]])
with tf.Session() as sess:
    print(repr(sess.run(tf.gather(t2, 0))))
    print(repr(sess.run(tf.gather(t2, 1, axis=1))))
    print(repr(sess.run(tf.gather(t2, [0, 2], axis=1))))

print('\n')
t3 = tf.constant([
    [[1, 2, 3], [4, 5, 6]],
    [[5, 6, 7], [7, 8, 9]]
])
with tf.Session() as sess:
    print(repr(sess.run(tf.gather(t3, 0))))
    print(repr(sess.run(tf.gather(t3, 1, axis=1))))
    print(repr(sess.run(tf.gather(t3, [0, 2], axis=2))))
"""

"""
Output:

1
3


array([1, 2, 3], dtype=int32)
array([2, 5], dtype=int32)
array([[1, 3],
       [4, 6]], dtype=int32)

"""

We can use `tf.gather_nd` for specific tensor indexing. The parameters for `params` must be a multi-dimensional tensor (cannot be 1-D), and the indices argument cannot be a single integer. See below for examples:

In [None]:
"""
with tf.Session() as sess:
    print(repr(sess.run(tf.gather_nd(t2, [0, 1]))))
    print(repr(sess.run(tf.gather_nd(t2, [[0, 1], [1, 1]]))))

print('\n')
with tf.Session() as sess:
    print(repr(sess.run(tf.gather_nd(t3, [0, 1]))))
    print(repr(sess.run(tf.gather_nd(t3, [[0, 0], [1, 1]]))))
    print(repr(sess.run(tf.gather_nd(t3, [0, 1, 2]))))
"""

"""
Output:

2
array([2, 5], dtype=int32)


array([4, 5, 6], dtype=int32)
array([[1, 2, 3],
       [7, 8, 9]], dtype=int32)
6
"""

In [None]:
"""
import tensorflow as tf

# LSTM Language Model
class LanguageModel(object):
    # Model Initialization
    def __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.num_lstm_layers = num_lstm_layers
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

    # Predict next word ID
    def get_word_predictions(self, word_preds, binary_sequences, batch_size):
        row_indices = tf.range(batch_size)
        final_indexes = tf.reduce_sum(binary_sequences, axis = 1) - 1
        gather_indices = tf.transpose([row_indices, final_indexes])
        final_id_predictions = tf.gather_nd(word_preds, gather_indices)
        return final_id_predictions

"""