# Homework 11

## 2 Assignment Transformers

Task: implement a Transformer architecture model (instead of an RNN model) that predicts a categorical distribution over possible next tokens such that sampling from this distribution leads to plausible next tokens. 
Implement a decoder-block based generative language model in order to use its autoregressive property to train it on prediction errors of all tokens in the input sequence. 

The model will take a fixed number of input tokens from a text and predict the distribution over the vocabulary for the next token.

## 2.1 Dataset, preprocessing and tokenization


In [1]:
# useful imports 
import tensorflow as tf
import tensorflow_text as tf_txt
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense, Conv2D, AveragePooling2D, TimeDistributed, LSTM, GlobalAvgPool2D, AbstractRNNCell, MaxPooling2D, RNN
import numpy as np
import matplotlib.pyplot as plt
import re
from collections import defaultdict
import datetime
import tqdm
import sentencepiece as sp
import io

Dataset of choice: Harry Potter Book 1 (downloaded from https://raw.githubusercontent.com/amephraim/nlp/master/texts/J.%20K.%20Rowling%20-%20Harry%20Potter%201%20-%20Sorcerer's%20Stone.txt)


In [2]:
# open the txt file
hp_raw = open("Harry_Potter_1_Sorcerers_Stone.txt", "r")  
# read file
data = hp_raw.read()  

In [3]:
# convert to lower case
data = data.lower()
# delete special characters, only alphanumeric values and white space/linebreaks remain
# (we keep whitespace/linebreaks for the tokenizer later)
data = re.sub("['.,;\-!?%$\"]", "", data)

In [4]:
# test 
data[0:100]

'harry potter and the sorcerers stone\n\n\nchapter one\n\nthe boy who lived\n\nmr and mrs dursley of number '

In [5]:
# create new txt file with preprocessed harry potter text for tokenizer
f = open("harrypotter.txt", "w")
f.write(data)
f.close()

In [6]:
# hyperparameter: vocabulary size
VOCAB_SIZE = 4242

In [7]:
# train tokenizer on preprocessed harry potter text
sp.SentencePieceTrainer.train(
    input='harrypotter.txt', model_prefix='tokenizer_model', model_type="unigram", vocab_size=VOCAB_SIZE)

In [8]:
# deserialize the trained model file to load it in the correct format
trained_tokenizer_model = tf.io.gfile.GFile('tokenizer_model.model', "rb").read()

# load the model as a tokenizer that can be used inside a tensorflow model
tokenizer = tf_txt.SentencepieceTokenizer(
    model=trained_tokenizer_model, out_type=tf.int32, nbest_size=-1, alpha=1, reverse=False,
    add_bos=False, add_eos=False, return_nbest=False, name=None
)

In [9]:
# test tokenizer
tokens = tokenizer.tokenize("magic is real")
print(tokens)
print(tokenizer.detokenize(tokens))
# because it's fun
tokens = tokenizer.tokenize("you are a wizard harry")
print(tokens)
print(tokenizer.detokenize(tokens))

tf.Tensor([226  82 980], shape=(3,), dtype=int32)
tf.Tensor(b'magic is real', shape=(), dtype=string)
tf.Tensor([ 15  85   6 280  10], shape=(5,), dtype=int32)
tf.Tensor(b'you are a wizard harry', shape=(), dtype=string)


We want to have input sequences of length m tokens (m should be between 32 and 256 - here: seq_length); for this we use tf text.sliding window and pass the tokenized text and the width m + 1 as arguments

In [10]:
# hyperparameter: sequence length
seq_length = 142

In [11]:
# read harry potter file
hp = open("harrypotter.txt", "r")  
data = hp.read()  
# tokenize
tokenized_data = tokenizer.tokenize(data)
# get sequence windows of size = seq_length
sequences = tf_txt.sliding_window(tokenized_data, width=seq_length + 1, axis=-1)

In [12]:
sequences.shape

TensorShape([86302, 143])

In [13]:
# create dataset out of sequences
hp_ds = tf.data.Dataset.from_tensor_slices(sequences)

In [14]:
# shape of one datapoint = one sequence
iterator = iter(hp_ds)
iterator.get_next()

<tf.Tensor: shape=(143,), dtype=int32, numpy=
array([  10,  134,    4,    3,  725,    8,  171,  738,  372,   45,    3,
        157,   78, 1274,  159,    4,  294,  239,    9,  653,  341,  771,
        523,   37, 1317,    5,  168,   23,   24,   37, 1939, 1134, 1034,
         15,   79,  173,   24,   37,    3,  153,  132,   15,   41,  758,
          5,   31, 1831,   14,  183,  445,  116, 1467,  155,   24,   73,
         68,  906,   30,  597, 2131,  159,  239,   11,    3, 1436, 1020,
          9,    6, 3045,  302,  672,  790,    8,  148,  203, 1629,    7,
         11,    6,  428, 2597,   46,  297,   30,  576,  192,  676,  424,
        976,    7,  126,   40,    6,   79,  247, 1498,  294,  239,   11,
       1112,    4, 1915,  161,    4,   19,  357,  815,    3,  454,    6,
       3865,    9,  676,  148,  167,   14,   79, 1673,   26,   47, 1071,
         48,  173,    9,   74,  104, 2713,   16,   72, 1458,  452, 2015,
          8,  507, 1980,   21,    3, 1587, 2488, 1793,    8,    3,  257])>

In [15]:
# out of the sequence with length m+1, the first m tokens are the inputs and the last token is the target
hp_ds = hp_ds.map(lambda seq: tf.split(sequences, [seq_length, 1], -1))

In [16]:
# shape of one datapoint = one sequence (input tokens + target token)
iterator = iter(hp_ds)
iterator.get_next()

(<tf.Tensor: shape=(86302, 142), dtype=int32, numpy=
 array([[  10,  134,    4, ..., 1793,    8,    3],
        [ 134,    4,    3, ...,    8,    3,  257],
        [   4,    3,  725, ...,    3,  257,   19],
        ...,
        [   9,   10,  827, ...,   30,  105,   49],
        [  10,  827,   16, ...,  105,   49, 1330],
        [ 827,   16,   56, ...,   49, 1330,    3]])>,
 <tf.Tensor: shape=(86302, 1), dtype=int32, numpy=
 array([[ 257],
        [  19],
        [   6],
        ...,
        [1330],
        [   3],
        [ 243]])>)

In [17]:
# hyperparameter: batch size
BATCH_SIZE = 32

In [18]:
# shuffle, batch, prefetch
hp_ds = hp_ds.cache().shuffle(1000).batch(BATCH_SIZE).prefetch(20)

## 2.2 The Model Components