In [1]:
faqs = """
 Multi-headed attention
 In our simple example, we only used the embeddings “as is” to compute the attention
 scores and weights, but that’s far from the whole story. In practice, the self-attention
 layer applies three independent linear transformations to each embedding to generate
 the query, key, and value vectors. These transformations project the embeddings and
 each projection carries its own set of learnable parameters, which allows the self
attention layer to focus on different semantic aspects of the sequence.
 It also turns out to be beneficial to have multiple sets of linear projections, each one
 representing a so-called attention head. The resulting multi-head attention layer is
 illustrated in Figure 3-5. But why do we need more than one attention head? The rea
son is that the softmax of one head tends to focus on mostly one aspect of similarity.
 Having several heads allows the model to focus on several aspects at once. For
 instance, one head can focus on subject-verb interaction, whereas another finds
 nearby adjectives. Obviously we don’t handcraft these relations into the model, and
 they are fully learned from the data. If you are familiar with computer vision models
 you might see the resemblance to filters in convolutional neural networks, where one
 filter can be responsible for detecting faces and another one finds wheels of cars in
 images.
 The Encoder 
| 
67
Figure 3-5. Multi-head attention
 Let’s implement this layer by first coding up a single attention head:
 class AttentionHead(nn.Module):
 def __init__(self, embed_dim, head_dim):
 super().__init__()
 self.q = nn.Linear(embed_dim, head_dim)
 self.k = nn.Linear(embed_dim, head_dim)
 self.v = nn.Linear(embed_dim, head_dim)
 def forward(self, hidden_state):
 attn_outputs = scaled_dot_product_attention(
 self.q(hidden_state), self.k(hidden_state), self.v(hidden_state))
 return attn_outputs
 Here we’ve initialized three independent linear layers that apply matrix multiplication
 to the embedding vectors to produce tensors of shape [batch_size, seq_len,
 head_dim], where head_dim is the number of dimensions we are projecting into.
 Although head_dim does not have to be smaller than the number of embedding
 dimensions of the tokens (embed_dim), in practice it is chosen to be a multiple of
 embed_dim so that the computation across each head is constant. For example, BERT
 has 12 attention heads, so the dimension of each head is 768/12 = 64.
 Now that we have a single attention head, we can concatenate the outputs of each one
 to implement the full multi-head attention layer:
 class MultiHeadAttention(nn.Module):
 def __init__(self, config):
 super().__init__()
 embed_dim = config.hidden_size
 num_heads = config.num_attention_heads
 head_dim = embed_dim // num_heads
 self.heads = nn.ModuleList(
 [AttentionHead(embed_dim, head_dim) for _ in range(num_heads)]
 )
 self.output_linear = nn.Linear(embed_dim, embed_dim)
 68 
| 
Chapter 3: Transformer Anatomy
def forward(self, hidden_state):
 x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)
 x = self.output_linear(x)
 return x
 Notice that the concatenated output from the attention heads is also fed through a
 final linear layer to produce an output tensor of shape [batch_size, seq_len,
 hidden_dim] that is suitable for the feed-forward network downstream. To confirm,
 let’s see if the multi-head attention layer produces the expected shape of our inputs.
 We pass the configuration we loaded earlier from the pretrained BERT model when
 initializing the MultiHeadAttention module. This ensures that we use the same set
tings as BERT:
 multihead_attn = MultiHeadAttention(config)
 attn_output = multihead_attn(inputs_embeds)
 attn_output.size()
 torch.Size([1, 5, 768])
 It works! To wrap up this section on attention, let’s use BertViz again to visualize the
 attention for two different uses of the word “flies”. Here we can use the head_view()
 function from BertViz by computing the attentions of a pretrained checkpoint and
 indicating where the sentence boundary lies:

 from bertviz import head_view
 from transformers import AutoModel
 model = AutoModel.from_pretrained(model_ckpt, output_attentions=True)
 sentence_a = "time flies like an arrow"
 sentence_b = "fruit flies like a banana"
 viz_inputs = tokenizer(sentence_a, sentence_b, return_tensors='pt')
 attention = model(**viz_inputs).attentions
 sentence_b_start = (viz_inputs.token_type_ids == 0).sum(dim=1)
 tokens = tokenizer.convert_ids_to_tokens(viz_inputs.input_ids[0])
 head_view(attention, tokens, sentence_b_start, heads=[8])
 
 The Encoder 
| 
69
This visualization shows the attention weights as lines connecting the token whose
 embedding is getting updated (left) with every word that is being attended to (right).
 The intensity of the lines indicates the strength of the attention weights, with dark
 lines representing values close to 1, and faint lines representing values close to 0.
 In this example, the input consists of two sentences and the [CLS] and [SEP] tokens
 are the special tokens in BERT’s tokenizer that we encountered in Chapter 2. One
 thing we can see from the visualization is that the attention weights are strongest
 between words that belong to the same sentence, which suggests BERT can tell that it
 should attend to words in the same sentence. However, for the word “flies” we can see
 that BERT has identified “arrow” as important in the first sentence and “fruit” and
 “banana” in the second. These attention weights allow the model to distinguish the
 use of “flies” as a verb or noun, depending on the context in which it occurs!
 Now that we’ve covered attention, let’s take a look at implementing the missing piece
 of the encoder layer: position-wise feed-forward networks.
"""

In [2]:

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [3]:
tokenizer = Tokenizer()

In [4]:
tokenizer.fit_on_texts([faqs])

In [5]:
len(tokenizer.word_index)

354

In [6]:
input_sequences = []
for sentence in faqs.split('\n'):
  tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]

  for i in range(1,len(tokenized_sentence)):
    input_sequences.append(tokenized_sentence[:i+1])

In [7]:
input_sequences

[[33, 150],
 [33, 150, 2],
 [8, 83],
 [8, 83, 151],
 [8, 83, 151, 57],
 [8, 83, 151, 57, 10],
 [8, 83, 151, 57, 10, 152],
 [8, 83, 151, 57, 10, 152, 153],
 [8, 83, 151, 57, 10, 152, 153, 1],
 [8, 83, 151, 57, 10, 152, 153, 1, 84],
 [8, 83, 151, 57, 10, 152, 153, 1, 84, 154],
 [8, 83, 151, 57, 10, 152, 153, 1, 84, 154, 155],
 [8, 83, 151, 57, 10, 152, 153, 1, 84, 154, 155, 3],
 [8, 83, 151, 57, 10, 152, 153, 1, 84, 154, 155, 3, 156],
 [8, 83, 151, 57, 10, 152, 153, 1, 84, 154, 155, 3, 156, 1],
 [8, 83, 151, 57, 10, 152, 153, 1, 84, 154, 155, 3, 156, 1, 2],
 [157, 11],
 [157, 11, 34],
 [157, 11, 34, 85],
 [157, 11, 34, 85, 158],
 [157, 11, 34, 85, 158, 159],
 [157, 11, 34, 85, 158, 159, 18],
 [157, 11, 34, 85, 158, 159, 18, 1],
 [157, 11, 34, 85, 158, 159, 18, 1, 160],
 [157, 11, 34, 85, 158, 159, 18, 1, 160, 161],
 [157, 11, 34, 85, 158, 159, 18, 1, 160, 161, 8],
 [157, 11, 34, 85, 158, 159, 18, 1, 160, 161, 8, 86],
 [157, 11, 34, 85, 158, 159, 18, 1, 160, 161, 8, 86, 1],
 [157, 11, 34,

In [8]:
max_len = max([len(x) for x in input_sequences])

In [9]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_input_sequences = pad_sequences(input_sequences, maxlen = max_len, padding='pre')

In [10]:
padded_input_sequences

array([[  0,   0,   0, ...,   0,  33, 150],
       [  0,   0,   0, ...,  33, 150,   2],
       [  0,   0,   0, ...,   0,   8,  83],
       ...,
       [  0,   0,   0, ..., 353, 354, 136],
       [  0,   0,   0, ..., 354, 136,  47],
       [  0,   0,   0, ..., 136,  47, 107]])

In [11]:
X = padded_input_sequences[:,:-1]

In [12]:
y = padded_input_sequences[:,-1]

In [13]:
X.shape

(840, 16)

In [14]:
input_length_v = X.shape[1]

In [15]:
y.shape

(840,)

In [16]:
v=len(tokenizer.word_index)

In [17]:
from tensorflow.keras.utils import to_categorical
y = to_categorical(y,num_classes=v+1)

In [18]:
y.shape

(840, 355)

In [19]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [20]:
model = Sequential()
model.add(Embedding(v+1, 100, input_length=input_length_v))
model.add(LSTM(150, return_sequences=True))
model.add(LSTM(150))
model.add(Dense(v+1, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [21]:
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [22]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 16, 100)           35500     
                                                                 
 lstm (LSTM)                 (None, 16, 150)           150600    
                                                                 
 lstm_1 (LSTM)               (None, 150)               180600    
                                                                 
 dense (Dense)               (None, 355)               53605     
                                                                 
Total params: 420,305
Trainable params: 420,305
Non-trainable params: 0
_________________________________________________________________


In [23]:
model.fit(X,y,epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

<keras.callbacks.History at 0x2d51e51ddf0>

In [28]:
import time
text = "In"
import numpy as np
for i in range(10):
  # tokenize
  token_text = tokenizer.texts_to_sequences([text])[0]
  # padding
  padded_token_text = pad_sequences([token_text], maxlen=input_length_v, padding='pre')
  # predict
  pos = np.argmax(model.predict(padded_token_text))

  for word,index in tokenizer.word_index.items():
    if index == pos:
      text = text + " " + word
      print(text)
      time.sleep(2)

In our
In our simple
In our simple example
In our simple example we
In our simple example we only
In our simple example we only used
In our simple example we only used the
In our simple example we only used the embeddings
In our simple example we only used the embeddings “as
In our simple example we only used the embeddings “as is”


In [27]:
tokenizer.word_index

{'the': 1,
 'attention': 2,
 'to': 3,
 'head': 4,
 'dim': 5,
 'of': 6,
 'self': 7,
 'in': 8,
 'that': 9,
 'we': 10,
 'and': 11,
 'a': 12,
 'is': 13,
 'embed': 14,
 'linear': 15,
 'heads': 16,
 'sentence': 17,
 'from': 18,
 'one': 19,
 'layer': 20,
 'for': 21,
 'hidden': 22,
 'model': 23,
 'can': 24,
 'nn': 25,
 'output': 26,
 'each': 27,
 'on': 28,
 'state': 29,
 'attn': 30,
 'tokens': 31,
 'inputs': 32,
 'multi': 33,
 'weights': 34,
 'it': 35,
 'are': 36,
 'this': 37,
 'size': 38,
 'bert': 39,
 'embedding': 40,
 'focus': 41,
 'be': 42,
 'see': 43,
 'let’s': 44,
 'def': 45,
 'init': 46,
 'forward': 47,
 'config': 48,
 'num': 49,
 'x': 50,
 '1': 51,
 'use': 52,
 'as': 53,
 'b': 54,
 'viz': 55,
 'lines': 56,
 'example': 57,
 'these': 58,
 'which': 59,
 'have': 60,
 'representing': 61,
 'so': 62,
 '3': 63,
 '5': 64,
 'with': 65,
 'where': 66,
 'encoder': 67,
 'module': 68,
 'outputs': 69,
 'return': 70,
 'shape': 71,
 'multiheadattention': 72,
 'pretrained': 73,
 'same': 74,
 'bertviz': 7