In [1]:
import tensorflow as tf

### Introduction

In this section, we'll use an LSTM (long short-term memory) network for language modeling. 

#### Language modeling

A language model can tell us the likelihood of each word in a sentence or passage, based on the words that came before it. We can then determine how likely a sentence or text passage is by aggregating its individual word probabilities.

#### Language model tasks

Languagemodels are useful for both text classification and generation. In classification, we can use the language model probabilities to separate text into different categories.

For example, if we trained a language model on spam email subject titles, we could take a title and train the model to identify it. 

We can also use a language model to generate text based off an incomplete input sentence (e.g., autocomplete). The model gives suggestions to complete the sentence, based off using words with the highest probabilities. 

### Language Model

We can use language models to calculate word probabilities.

#### Word probabilities

The probability for each word is conditioned on the words that appear before it in the sequence. Because of this, the task of calculating a word probability based on the previous words in a sequence is essentially multiclass classification. 

Since each word in a sequence must come from the text corpus' vocabulary, we consider each vocabulary word as a separate class. We can use the previous sequence words as input and the word of interest as the word that we are predicting, in order to calculate probabilities for each vocabulary class.

![title](chapter2_wordProbs.png)

#### Inputs & targets

The goal behind training a language model is to predict each word in a sequence based on the words that come before it. 

But, instead of making training pairs where the target is a single word, we can make training pairs where the input and target sequences are the same length. 

![title](chapter2_example_training_words.png)

The language model attempts to predict each word in the target sequence based on its corresponding prefix in the input sequence. In the example, the input prefix-target word pairs are: 

![title](chapter2_language_pairs.png)

So, we can predict say, the third word in the text by using the first two words as the "prefix" and the third word as the target. 

#### Maximum length

In this setup, we see that words later on in the string will have a longer prefix than words earlier in the string. We can put a limit on the length of the training sequences. 

Using a fixed max sequence length can increase training speed and help the model avoid overfitting on uncommon text dependencies (which might show up more often in long, run-on sentences that are likely just people rambling).

![title](chapter2_language_fixed_length.png)


In [None]:
"""
import tensorflow as tf

def truncate_sequences(sequence, max_length):
    # CODE HERE
    input_sequence = sequence[:max_length-1]
    target_sequence = sequence[1:max_length]
    return (input_sequence, target_sequence)

# LSTM Language Model
class LanguageModel(object):
    # Model Initialization
    def __init__(self, vocab_size, max_length, num_lstm_units, num_lstm_layers):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.num_lstm_layers = num_lstm_layers
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=vocab_size)

    def get_input_target_sequence(self, sequence):
        seq_len = len(sequence)
        if seq_len >= self.max_length:
            input_sequence, target_sequence = truncate_sequences(
                sequence, self.max_length
            )
        else:
            # Next chapter
            input_sequence, target_sequence = pad_sequences(
                sequence, self.max_length
            )
        return input_sequence, target_sequence
"""

### Padding

We can use padding for our tokenized sequences

#### Varied length sequences

For most neural networks, the input data always has a fixed length. This is the case because most neural networks are ***feed-forward***, which means that they use multiple layers of ****fixed**** sizes to compute the network's output. 

But, since text data involves different-sized text sequences (E.g., sentences, passages, paragraphs, etc.), the language model must be flexible enough to handle input data of different lengths. 

As a result, we use a recurrent neural network (RNN) for the language model. 

#### Padded sequences



