### Text Classification

In this section, we'll build a bidirectional LSTM for classification

#### Classifying text

An important task in NLP is to classify text into different categories. Some uses of this include spam filtering, classifying user product reviews, and automatically flagging inappropriate or harmful social media posts.

### Sentiment Analysis

#### Classifying sentiment

We can classify sentiment of text (i.e., the writer's attitude) using multiclass classification (e.g., figure out their emotion along a scale) or binary classification (e.g., "positive" vs. "negative")

#### Training pairs

Our input data is tokenized text sequences, and each text sequence will be labeled with a class/category.

For example: 

1. ([1, 5, 6, 8, 2], 0)
2. ([3, 5, 2, 9, 8], 1)

These are two examples of training tuples, where the first entry in each tuple is the tokenized text IDs and the last entry is the label

In [None]:
"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column

# Text classification model
class ClassificationModel(object):
    # Model initialization
    def __init__(self, vocab_size, max_length, num_lstm_units):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    def tokenize_text_corpus(self, texts):
        self.tokenizer.fit_on_texts(texts)
        sequences = self.tokenizer.texts_to_sequences(texts)
        return sequences
    
    # Create training pairs for text classification
    def make_training_pairs(self, texts, labels):
        sequences = self.tokenize_text_corpus(texts)
        for i in range(len(sequences)):
            sequence = sequences[i]
            if len(sequence) > self.max_length:
                sequences[i] = sequence[:self.max_length]
        training_pairs = list(zip(sequences, labels))
        return training_pairs
"""

### Embeddings

We can use TensorFlow feature columns to create input embeddings

#### Feature columns

Prior approaches covered pre-training an embedding model as well as training an embedding model in tandem with an LSTM. In both of these cases, we had to write our own embedding matrix and handle its initialization.

We can instead use TensorFlow's feature column API. In particular, we can use the `tf.feature_column.embedding_column` function, which lets you incorporate an embedding matrix automatically into the model, which will be trained alongside the LSTM.

#### Sequential categorial column

The `embedding_column` function takes in two required arguments. The second argument is the embedding size, which is typically set to be the 4th root of the vocabulary size. The first argument is a "categorical column", and for functions that work with sequential data, we have to use `tf.contrib.feature_column`, which comes from the extended feature column API. Since each of the vocabulary in the text corpus is converted to a unique integer, we can use the sequece_categorical_column_with_identity function. It takes in two arguments: a string name for the categorical colummn and the vocabulary size of the text corpus. The string name will be used when creating the input dictionary for the main conversion function. Categorical columns don't contain any data until a computational graph is run (just like tf.placeholder objects)

Below is some code that creates an embedding column, "embed_col", using a categorical column, "input_col" as input.

`
import tensorflow as tf
vocab_size = 10000
input_col = tf.contrib.feature_column \
              .sequence_categorical_column_with_identity(
                  'input', vocab_size)
embed_size = int(10000**0.25)
embed_col = tf.feature_column.embedding_column(
                  input_col, embed_size)
`

#### Converting to embeddings

The main conversion function used to create the embedded input sequences is `sequence_input_layer`. It takes two arguments: (1) a dictionary of input data, where each key is the name of a categorical column, and (2) a list of feature columns corresponding to data in the input dictionary. 

Since the input data is just a batch of tokenized sequences, the dictionary will only contain a single key-value pair. The key is the same string set in the `sequence_categorical_column_with_identity` function. The second argument will just be a list containing the embedding column.

Below is an example of `sequence_input_layer` with tokenized sequences ("input_seqs") and embed_col from above:

`
import tensorflow as tf
input_seqs = tf.placeholder(tf.int64, shape=(None, 30)) # thirty time steps
input_dict = {'input': input_seqs}
embed_seqs, sequence_lengths = tf.contrib.feature_column \
                                 .sequence_input_layer(
                                     input_dict, [embed_col])
`

The output of `sequence_input_layer` is a tuple containing the embedded sequences and the sequence lengths. The sequence lengths output is used for calculating the sequence lengths for variable length sequence inputs.



### Bidirectional LSTM

#### Forwards and bckwards

When we have access to a completed text sequence (e.g., text classification, as opposed to predicting the next word in a sentence), it might be beneficial to look at the sequence in both the forwards and backwards directions. 

#### Bidirectional LSTM (BiLSTM)

A bidirectional LSTM is just model which has both a forward LSTM and a backwards LSTM (which reads the input sequence in reverse)

In Tensorflow, we can use the `tf.nn.bidirectional_dynamic_rnn` to create a bidirectional LSTM. It runs similar to the `tf.nn.dynamic_rnn`, except it takes in two LSTM cells rather than one. See below for an example

In [1]:
"""
import tensorflow as tf
cell_fw = tf.nn.rnn_cell.LSTMCell(7)
cell_bw = tf.nn.rnn_cell.LSTMCell(7)

# Embedded input sequences
# Shape: (batch_size, time_steps, embed_dim)
input_embeddings = tf.placeholder(
    tf.float32, shape=(None, 10, 12))
outputs, final_states = tf.nn.bidirectional_dynamic_rnn(
    cell_fw,
    cell_bw,
    input_embeddings,
    dtype=tf.float32)
print(outputs[0])
print(outputs[1])
"""

'\n\n'

The `tf.nn.bidirectional_dynamic_rnn` function returns a tuple containing the LSTM outputs and final LSTM states. Since a BiLSTM contains two LSTMs, both `outputs`and `final_states` are tuples. `outputs[0]`represents the outputs of the forward LSTM, while `outputs[1]` represents the outputs of the backwards LSTM.

In [None]:
"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column

# Text classification model
class ClassificationModel(object):
    # Model initialization
    def __init__(self, vocab_size, max_length, num_lstm_units):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    # Make LSTM cell with dropout
    def make_lstm_cell(self, dropout_keep_prob):
        cell = tf.nn.rnn_cell.LSTMCell(self.num_lstm_units)
        return tf.nn.rnn_cell.DropoutWrapper(cell, output_keep_prob=dropout_keep_prob)

    # Use feature columns to create input embeddings
    def get_input_embeddings(self, input_sequences):
        inputs_column = tf_fc.sequence_categorical_column_with_identity(
            'inputs',
            self.vocab_size)
        embedding_column = tf.feature_column.embedding_column(
            inputs_column,
            int(self.vocab_size**0.25))
        inputs_dict = {'inputs': input_sequences}
        input_embeddings, sequence_lengths = tf_fc.sequence_input_layer(
            inputs_dict,
            [embedding_column])
        return input_embeddings, sequence_lengths
    
    # Create and run a BiLSTM on the input sequences
    def run_bilstm(self, input_sequences, is_training):
        input_embeddings, sequence_lengths = self.get_input_embeddings(input_sequences)
        dropout_keep_prob = 0.5 if is_training else 1.0
        cell_fw = self.make_lstm_cell(dropout_keep_prob)
        cell_bw = self.make_lstm_cell(dropout_keep_prob)
        lstm_outputs, _ = tf.nn.bidirectional_dynamic_rnn(cell_fw, cell_bw, input_embeddings, sequence_length = sequence_lengths, dtype = tf.float32)
        return (lstm_outputs, sequence_lengths)
"""

### Logits

How can we calculate logits from a BiLSTM model?

#### Concatenation

The BiLSTM gives us two outputs, the forwards and backwards outputs. To calculate the logits, we need to combine the outputs through concatenation, using `tf.concat`, which allows us to concatenate a list of tensors, along a particular dimension. 

Below is an example of `tf.concat`:


In [None]:
"""
import tensorflow as tf
# Shape: (2, 2, 3)
t1 = tf.constant([
    [[1, 2, 3], [4, 5, 6]],
    [[0, 4, 8], [3, 2, 2]]
])

# Shape: (1, 2, 3)
t2 = tf.constant([
    [[9, 9, 9], [8, 8, 8]]
])

# Shape: (2, 2, 2)
t3 = tf.constant([
    [[9, 9], [1, 1]],
    [[7, 2], [8, 8]]
])

with tf.Session() as sess:
    o1 = sess.run(tf.concat([t1, t2], 0))
    o2 = sess.run(tf.concat([t1, t3], -1))

print(repr(o1))
print(repr(o2))
"""

"""
OUTPUT:

array([[[1, 2, 3],
        [4, 5, 6]],

       [[0, 4, 8],
        [3, 2, 2]],

       [[9, 9, 9],
        [8, 8, 8]]], dtype=int32)
array([[[1, 2, 3, 9, 9],
        [4, 5, 6, 1, 1]],

       [[0, 4, 8, 7, 2],
        [3, 2, 2, 8, 8]]], dtype=int32)
"""

When concatenating tensors, the tensors need to have the exact same shape, except for the axis that's being concatenated. 

A quick way to do this is to set -1 for the second argument, to specify the final tensor dimension as the axis of concatenation. 

#### Final time step

When we create an LSTM for classification, we only use the final time step output for each sequence in the data  (since we want to take in the prediction after the model has seen every word, as opposed to the predictions from last chapter, where we got a prediction after each word because we wanted to complete partial sequecnes). 

So, after combining the forwards and backwards LSTM outputs, we retrieve the final time step values using `tf.gather_nd` to get the final value. We then pass those values through a final fully-connected layer in order to get the logits

Below, we create a `calculate_logits` function, which calculates logits based on the outputs of the BiLSTM

The input, `lstm_outputs` is a tuple containing the outputs of the forwards and backwards LSTMs. We first need to separate the tuple into two distinct variables. Then, we concatenate the output values along their final dimension. We calculate the indices of each sequence's final time step, and use `tf.gather_nd` to retrieve said final time step. Since our task is binary text classification, we use a final fully-connected layer with a single node, in order to obtain the model's logits (since we'll have multipe outputs, which we need to aggregate into a single class)

In [None]:
"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column

# Text classification model
class ClassificationModel(object):
    # Model initialization
    def __init__(self, vocab_size, max_length, num_lstm_units):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    def get_gather_indices(self, batch_size, sequence_lengths):
        row_indices = tf.range(batch_size)
        final_indexes = tf.cast(sequence_lengths - 1, tf.int32)
        return tf.transpose([row_indices, final_indexes])

    # Calculate logits based on the outputs of the BiLSTM
    def calculate_logits(self, lstm_outputs, batch_size, sequence_lengths):
        lstm_outputs_fw, lstm_outputs_bw = lstm_outputs
        combined_outputs = tf.concat([lstm_outputs_fw, lstm_outputs_bw], -1)
        gather_indices = self.get_gather_indices(batch_size, sequence_lengths)
        final_outputs = tf.gather_nd(combined_outputs, gather_indices)
        logits = tf.layers.dense(final_outputs, 1)
        return logits
"""

### Loss

We can calculate the model's loss using sigmoid cross entropy, using the code below:

We first calculate the logits (using the code above). Then, we use sigmoid cross entropy for the loss (and convert the integer labels into floats)

The output of the function is the overall aggregate loss, so we need to sum each individual sequence's loss in the batch.

To complete our prediction, we can just round to the nearest integer (so, 0 or 1).

In [None]:
"""
import tensorflow as tf
tf_fc = tf.contrib.feature_column

# Text classification model
class ClassificationModel(object):
    # Model initialization
    def __init__(self, vocab_size, max_length, num_lstm_units):
        self.vocab_size = vocab_size
        self.max_length = max_length
        self.num_lstm_units = num_lstm_units
        self.tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=self.vocab_size)

    def get_gather_indices(self, batch_size, sequence_lengths):
        row_indices = tf.range(batch_size)
        final_indexes = tf.cast(sequence_lengths - 1, tf.int32)
        return tf.transpose([row_indices, final_indexes])

    # Calculate logits based on the outputs of the BiLSTM
    def calculate_logits(self, lstm_outputs, batch_size, sequence_lengths):
        lstm_outputs_fw, lstm_outputs_bw = lstm_outputs
        combined_outputs = tf.concat([lstm_outputs_fw, lstm_outputs_bw], -1)
        gather_indices = self.get_gather_indices(batch_size, sequence_lengths)
        final_outputs = tf.gather_nd(combined_outputs, gather_indices)
        logits = tf.layers.dense(final_outputs, 1)
        return logits
    
    # Calculate the loss for the BiLSTM
    def calculate_loss(self, lstm_outputs, batch_size, sequence_lengths, labels):
        logits = self.calculate_logits(lstm_outputs, batch_size, sequence_lengths)
        float_labels = tf.cast(labels, tf.float32)
        batch_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels = float_labels, logits = logits)
        overall_loss = tf.reduce_sum(batch_loss)
        return overall_loss
    
    # Convert logits to predictions
    def logits_to_predictions(self, logits):
        probs = tf.nn.sigmoid(logits)
        preds = tf.round(probs)
        return preds
"""

### Improving Model Performance

To improve BiLSTM performance, we can perhaps use more LSTM layers or hidden LSTM units (with the caveat that we need to be sure not to overfit)