# 11 Deep learning for text

In [1]:
# Load data and create datasets

!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

# Listing 11.12 Preparing integer sequence datasets
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=1)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  31.5M      0  0:00:02  0:00:02 --:--:-- 31.5M
Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


## 11.4 The Transformer architecture

- “neural attention” could be used to build powerful sequence models that didn’t feature any recurrent layers or convolution layers

### 11.4.1 Understanding self-attention

- not all input information seen by a model is equally important to the task at hand, so models should “pay more attention” to some features and “pay less attention” to other features.
- a smart embedding space would provide a different vector representation for a word depending on the other words surrounding it
- The purpose of self-attention is to modulate the representation of a token by using the representations of related tokens in the sequence.
- Steps
  - Step 1 is to compute relevancy scores between the vector for “a word” and every other word in the sentence. These are our “attention scores.”
  - Step 2 is to compute the sum of all word vectors in the sentence, weighted by our relevancy scores

- NumPy-like pseudocode

  ```py
  def self_attention(input_sequence):
    output = np.zeros(shape=input_sequence.shape)
    for i, pivot_vector in enumerate(input_sequence):
      scores = np.zeros(shape=(len(input_sequence),))
      for j, vector in enumerate(input_sequence):
        scores[j] = np.dot(pivot_vector, vector.T)
      scores /= np.sqrt(input_sequence.shape[1])
      scores = softmax(scores)
      new_pivot_representation = np.zeros(shape=pivot_vector.shape)
      for j, vector in enumerate(input_sequence):
        new_pivot_representation += vector * scores[j]
      output[i] = new_pivot_representation
    return output
  ```

- Keras built-in layer

  ```py
  num_heads = 4
  embed_dim = 256
  mha_layer = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
  outputs = mha_layer(inputs, inputs, inputs)
  ```

#### GENERALIZED SELF-ATTENTION: THE QUERY-KEY-VALUE MODEL

- Transformer architecture was originally developed for machine translation, where you have to deal with two input sequences: the source sequence and the target sequence
- A Transformer is a sequence-to-sequence model: it was designed to convert one sequence into another
- self-attention mechanism
  - “for each token in inputs (A), compute how much the token is related to every token in inputs (B), and use these scores to weight a sum of tokens from inputs (C).”
  - “for each element in the query, compute how much the element is related to every key, and use these scores to weight a sum of values”
- Transformer-style attention
  - got a reference sequence that describes something you’re looking for: the query.
  - got a body of knowledge that you’re trying to extract information from: the values. 
  - Each value is assigned a key that describes the value in a format that can be readily compared to a query. You simply match the query to the keys. Then you return a weighted sum of values.

### 11.4.2 Multi-head attention

- “multi-head”
  - the initial query, key, and value are sent through three independent sets of dense projections, resulting in three separate vectors. Each vector is processed via neural attention, and the different outputs are concatenated back together into a single output sequence. Each such subspace is called a “head.”
- The presence of the learnable dense projections enables the layer to actually learn something, as opposed to being a purely stateless transformation that would require additional layers before or after it to be useful. In addition, having independent heads helps the layer learn different groups of features for each token, where features within one group are correlated with each other but are mostly independent from features in a different group.

### 11.4.3 The Transformer encoder

Factoring outputs into multiple independent spaces, adding residual connections, adding normalization layers—all of these are standard architecture patterns that one would be wise to leverage in any complex model. Together, these bells and whistles form the Transformer encoder

- Transformer architecture
  - a Transformer encoder that processes the source sequence
  - a Transformer decoder that uses the source sequence to generate a translated version

In [2]:
# Listing 11.21 Transformer encoder implemented as a subclassed Layer
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim
    self.dense_dim = dense_dim
    self.num_heads = num_heads
    self.attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
    self.dense_proj = keras.Sequential(
        [layers.Dense(dense_dim, activation="relu"),
         layers.Dense(embed_dim),])
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()
  def call(self, inputs, mask=None):
    if mask is not None:
      mask = mask[:, tf.newaxis, :]
    attention_output = self.attention(inputs, inputs, attention_mask=mask)
    proj_input = self.layernorm_1(inputs + attention_output)
    proj_output = self.dense_proj(proj_input)
    return self.layernorm_2(proj_input + proj_output)
  def get_config(self):
    config = super().get_config()
    config.update({
        "embed_dim": self.embed_dim,
        "num_heads": self.num_heads,
        "dense_dim": self.dense_dim,})
    return config

- Saving custom layers

  ```py
  config = layer.get_config()
  new_layer = layer.__class__.from_config(config)
  ```

  ```py
  layer = PositionalEmbedding(sequence_length, input_dim, output_dim)
  config = layer.get_config()
  new_layer = PositionalEmbedding.from_config(config)
  ```

- Loading a model

  ```
  model = keras.models.load_model(
    filename, custom_objects={"PositionalEmbedding": PositionalEmbedding}
  )
  ```

- LayerNormalization

  ```py
  def layer_normalization(batch_of_sequences):
    mean = np.mean(batch_of_sequences, keepdims=True, axis=-1)
    variance = np.var(batch_of_sequences, keepdims=True, axis=-1)
    return (batch_of_sequences - mean) / variance
  ```

- BatchNormalization

  ```py
  def batch_normalization(batch_of_images):
    mean = np.mean(batch_of_images, keepdims=True, axis=(0, 1, 2))
    variance = np.var(batch_of_images, keepdims=True, axis=(0, 1, 2))
    return (batch_of_images - mean) / variance
  ```

In [3]:
# isting 11.22 Using the Transformer encoder for text classification
vocab_size = 20000
embed_dim = 256
num_heads = 2
dense_dim = 32
 
inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(
    optimizer="rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"])

model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 256)         5120000   
                                                                 
 transformer_encoder (Transf  (None, None, 256)        543776    
 ormerEncoder)                                                   
                                                                 
 global_max_pooling1d (Globa  (None, 256)              0         
 lMaxPooling1D)                                                  
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 257   

In [4]:
# Listing 11.23 Training and evaluating the Transformer encoder based model
callbacks = [
  keras.callbacks.ModelCheckpoint(
      "transformer_encoder.keras",
      save_best_only=True)]

model.fit(int_train_ds, validation_data=int_val_ds, epochs=20,
 callbacks=callbacks)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fc21b466950>

In [5]:
model = keras.models.load_model(
    "transformer_encoder.keras",
    custom_objects={
        "TransformerEncoder": TransformerEncoder})

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

Test acc: 0.876


#### USING POSITIONAL ENCODING TO RE-INJECT ORDER INFORMATION

- give the model access to word-order information, we’re going to add the word’s position in the sentence to each word embedding. Our input word embeddings will have two components: the usual word vector, which represents the word independently of any specific context, and a position vector, which represents the position of the word in the current sentence

In [6]:
# Listing 11.24 Implementing positional embedding as a subclassed layer
class PositionalEmbedding(layers.Layer):
  def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
    super().__init__(**kwargs)
    self.token_embeddings = layers.Embedding(input_dim=input_dim, output_dim=output_dim)
    self.position_embeddings = layers.Embedding(input_dim=sequence_length, output_dim=output_dim)
    self.sequence_length = sequence_length
    self.input_dim = input_dim
    self.output_dim = output_dim
  def call(self, inputs):
    length = tf.shape(inputs)[-1]
    positions = tf.range(start=0, limit=length, delta=1)
    embedded_tokens = self.token_embeddings(inputs)
    embedded_positions = self.position_embeddings(positions)
    return embedded_tokens + embedded_positions 
  def compute_mask(self, inputs, mask=None):
    return tf.math.not_equal(inputs, 0)
  def get_config(self):
    config = super().get_config()
    config.update({
        "output_dim": self.output_dim,
        "sequence_length": self.sequence_length,
        "input_dim": self.input_dim,})
    return config

#### PUTTING IT ALL TOGETHER: A TEXT-CLASSIFICATION TRANSFORMER

In [7]:
# Listing 11.25 Combining the Transformer encoder with positional embedding
vocab_size = 20000
sequence_length = 600
embed_dim = 256
num_heads = 2
dense_dim = 32
 
inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)

model.compile(
    optimizer="rmsprop",
    loss="binary_crossentropy",
    metrics=["accuracy"])
model.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, None)]            0         
                                                                 
 positional_embedding (Posit  (None, None, 256)        5273600   
 ionalEmbedding)                                                 
                                                                 
 transformer_encoder_1 (Tran  (None, None, 256)        543776    
 sformerEncoder)                                                 
                                                                 
 global_max_pooling1d_1 (Glo  (None, 256)              0         
 balMaxPooling1D)                                                
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                           

In [8]:
callbacks = [
  keras.callbacks.ModelCheckpoint(
      "full_transformer_encoder.keras",
      save_best_only=True)
]
model.fit(int_train_ds, validation_data=int_val_ds, epochs=20, callbacks=callbacks)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fc19e1e0a50>

In [9]:
model = keras.models.load_model(
    "full_transformer_encoder.keras",
    custom_objects={
        "TransformerEncoder": TransformerEncoder,
        "PositionalEmbedding": PositionalEmbedding}) 

print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")


Test acc: 0.879


### 11.4.4 When to use sequence models over bag-of-words models

- pay close attention to the ratio between the number of samples in your training data and the mean number of words per sample. 
- If that ratio is small—less than 1,500—then the bag-of-bigrams model will perform better
- If that ratio is higher than 1,500, then you should go with a sequence model.
- In other words, sequence models work best when lots of training data is available and when each sample is relatively short.


---