In [1]:
import tensorflow as tf

### Introduction

In this section, we'll use an LSTM (long short-term memory) network for language modeling. 

#### Language modeling

A language model can tell us the likelihood of each word in a sentence or passage, based on the words that came before it. We can then determine how likely a sentence or text passage is by aggregating its individual word probabilities.

#### Language model tasks

Languagemodels are useful for both text classification and generation. In classification, we can use the language model probabilities to separate text into different categories.

For example, if we trained a language model on spam email subject titles, we could take a title and train the model to identify it. 

We can also use a language model to generate text based off an incomplete input sentence (e.g., autocomplete). The model gives suggestions to complete the sentence, based off using words with the highest probabilities. 

### Language Model

We can use language models to calculate word probabilities.

#### Word probabilities

The probability for each word is conditioned on the words that appear before it in the sequence. Because of this, the task of calculating a word probability based on the previous words in a sequence is essentially multiclass classification. 

Since each word in a sequence must come from the text corpus' vocabulary, we consider each vocabulary word as a separate class. We can use the previous sequence words as input and the word of interest as the word that we are predicting, in order to calculate probabilities for each vocabulary class.

![title](chapter2_wordProbs.png)

#### Inputs & targets

The goal behind training a language model is to predict each word in a sequence based on the words that come before it. 

But, instead of making training pairs where the target is a single word, we can make training pairs where the input and target sequences are the same length. 

![title](chapter2_example_training_words.png)

The language model attempts to predict each word in the target sequence based on its corresponding prefix in the input sequence. In the example, the input prefix-target word pairs are: 

![title](chapter2_language_pairs.png)

So, we can predict say, the third word in the text by using the first two words as the "prefix" and the third word as the target. 

#### Maximum length

In this setup, we see that words later on in the string will have a longer prefix than words earlier in the string. We can put a limit on the length of the training sequences. 

Using a fixed max sequence length can increase training speed and help the model avoid overfitting on uncommon text dependencies (which might show up more often in long, run-on sentences that are likely just people rambling).

![title](chapter2_language_fixed_length.png)


In [None]:
"""
import tensorflow as tf

def truncate_sequences(sequence, max_length):
    # CODE HERE
    input_sequence = sequence[:max_length-1]
    target_sequence = sequence[1:max_length]
    return (input_sequence, target_sequence)

# LSTM Language Model
class LanguageModel(object):
    # Model Initialization
    def __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.num_lstm_layers = num_lstm_layers
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

    def get_input_target_sequence(self, sequence):
        seq_len = len(sequence)
        if seq_len >= self.max_length:
            input_sequence, target_sequence = truncate_sequences(
                sequence, self.max_length
            )
        else:
            # Next chapter
            input_sequence, target_sequence = pad_sequences(
                sequence, self.max_length
            )
        return input_sequence, target_sequence
"""

### Padding

We can use padding for our tokenized sequences

#### Varied length sequences

For most neural networks, the input data always has a fixed length. This is the case because most neural networks are ***feed-forward***, which means that they use multiple layers of ****fixed**** sizes to compute the network's output. 

But, since text data involves different-sized text sequences (E.g., sentences, passages, paragraphs, etc.), the language model must be flexible enough to handle input data of different lengths. 

As a result, we use a recurrent neural network (RNN) for the language model. 

#### Padded sequences

Even though RNNs and LSTMs allow us to take in different length input texts, we still need to tokenize each text sequence in a training batch to be the same length. This is because a training batch must have the proper tensor shape (e.g., be a 2-d matrix)

To enforce this, we can use padding. For sequences that are shorter than the maximum sequence length, we can append a special non-vocabulary token to the end of the sequence until its length is equal to the maximum sequence length (typically the token is given ID = 0, while each vocabulary word ID is a positive integer)

e.g., 

1. [1, 5, 10, 9] with max length = 7 --> [1, 5, 10, 9, 0, 0, 0]
2. [3, 6, 1] with max length = 7 --> [3, 6, 1 , 0, 0, 0, 0]

In [None]:
"""
def pad_sequences(sequence, max_length):
    padding_amount = max_length - len(sequence)
    padding = [0 for i in range(padding_amount)]
    input_sequence = sequence[:-1] + padding
    target_sequence = sequence[1:] + padding
    return input_sequence, target_sequence
"""

### RNN/LSTM

#### Neural network types

Source of notes (besides Educative): https://towardsdatascience.com/recurrent-neural-networks-d4642c9bc7ce

Here, we're using recurrent neural networks (RNN), which are specially designed to work with sequential data of varying lengths. These are better for our context than feed-forward neural networks, which aren't good for dealing with sequences of text data. We use RNNs, rather than feed-forward neural networks, because this allows us to capture the sequential portion of our sequences (otherwise, if we just use feedforward NNs, we just see "many" inputs, rather than sequential inputs)

RNNs "remember" the past and its decisions are influenced by what it learned in the past. 


Feedforward NNs "remember" things too, but they remember things they learned during training. For example, a convolutional NN learns what a "1" looks like during trainin and uses that knowledge to classify a training set. 

RNNs take this idea a step further by "remembering" things they learned **while** they're being trained. For example, input 10 can take information about inputs 1-9 and outputs 1-9 and its weights are affected not just by the regular weights (like in feedforward NNs), but also by "hidden state" weights. This is why they're good for sequential data - training on a sentence such as "I am very happy right now" gives a different sequence than "I very am now right happy". A feedforward NN would consider these sequences the same way, but a recurrent NN takes into account the order that it sees the word (e.g., its prediction for the word "happy" depends on whether it came before "very" or after "very", in this case). 

So, for a RNN, the same input could produce a different output, depending on previous inputs in the series. 

If we were to apply a feedforward NN on the following text:

"Going to the mall is my favorite weekend activity and it makes me happy", 

Then a vanilla NN would take the sequence in its entirety and give one prediction. 

You can improve this by tokenizing the string, and taking each word separetely (e.g., "going", "to", "the", "mall", ...) and run a sequence of NNs on these words. But, the downside is that you only get more inputs, and you lose the fact that the definition of "happy" depends on what words come before or after it in the sequence.

RNNs and CNNs are so successful because of the idea of "parameter sharing", where they can use information from neighboring values (so, for images, neighboring pixels, or for text, the neighboring words). 

![image](chapter2_RNNs.png)

The diagram above illustrates how a RNN works to carry pertinent information from one input item to the next. 

For example, for the line "I really like basketball", the first text, "I" gets multiplied by some weight $W_x$, then $W_h$, then $W_y$ in order to get an output $Y_1$. The next word, "really", gets multiplied by some weight $W_x$, then a new $W_h$ that uses information from the last word, "I" (which adjusted the value of this weight), then $W_y$, then an output $Y_2$. The next word, "like", follows the same process - multiplied by a weight $W_x$, then by a new weight $W_h$ that had been adjusted by the presence of "I" and "like" (which came earlier in the sequence) and finally multiplied by $W_y$ to get an output $Y_3$. 

This is an example of how RNNs can be fed sequences of text, and then use the sequencing information to get predictions of words based on their surrounding text (the example above is a unidirectional RNN, but it can also be made bidirectional). 

To make RNNs "deeper" (and, therefore, add the multi-level abstractions and representations we gain through "depth" in a typica neural network), there are typically four things we can do: (1) add hidden states, one on top of the other, feeding the output of one to the next, (2) add additional nonlinear hidden layers between input to hidden states, (3) increase depth in the hidden to hidden transition, or (4) increase depth in the hidden to output transition. (see this link: https://arxiv.org/pdf/1312.6026.pdf)

![image](chapter2_deepRNNs.png)

Above is an example of how we can make deeper RNNs

We can also create bidirectional RNNs, by creating two sets of hidden layers, one going in the forward direction and one going in the backward direction. This lets us use the entire sentence all at once, and use information from words before and after a given target word. 

![image](chapter2_bidirectional_RNN.png)

#### LSTMs

LSTMs are a modification of RNNs (it's a subtype of RNNs) that changes how we compute outputs and hidden states using inputs. 

Here is a link with information about LSTMs: (http://colah.github.io/posts/2015-08-Understanding-LSTMs/)

Typical RNNs can be illustrated like the following: 

![image](chapter2_unrolled_RNNs.png)

An appeal of RNNs is that they can connect previous information (e.g., words earlier in the sentence) to the present task (e.g., predicting the sentiment of the current word)

The ability of RNNs depends on the sequence that we're looking at. For example, if we're predicting the last word of a sentence such as "the clouds are in the -----", it's pretty obvious that the next word is "sky". Where the gap between the relevant information and the place where it's needed is small (i.e., "clouds" and "sky" are only a few words apart), RNNs can learn to use the past info. 

But, imagine a longer sentence, such as "I grew up in France. I really liked going to school there. I had a lot of friends and we really liked playing soccer (football) and going out to the parks. I speak fluent ----", the key word to figure out the missing word is "France", since "France" --> "French". But, this word is so early in the sentence that our hidden weights, from a regular RNN, might not be adjusted enough to capture the fact that "France" was the key word in that sentence that we could use to predict that "French" was the last word. As sentences become longer, RNNs become unable to learn to connect the information. 

To get around this inability to learn long-term dependencies, LSTMs were created (see http://www.bioinf.jku.at/publications/older/2604.pdf). 

A typical RNN passes the input word through one layer (e.g., a single tanh layer), then the hidden layer then the output. This repeats for each of the words in a sequence (see below). 

![image](chapter2_RNNs_repeating_module.png)

LSTMs also share this chainlike structure across words in a sentence, but the repeating module (the module of operations that each input word goes through) has a different structure. Instead of having a single neural network layer, it has four layers, interacting in a special way:

![image](chapter2_LSTMs_repeating_module.png)

The main component of an RNN is its cell
