<a href="https://colab.research.google.com/github/kimgeonhee317/nlpdemystifed-notes/blob/main/notebook/13_Recurrent_Neural_Networks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 13_Recurrent Neural Networks

### Import Library

In [2]:
import nltk
import numpy as np
import requests
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.callbacks import ModelCheckpoint
from nltk.corpus import treebank, brown, conll2000
from sklearn.model_selection import train_test_split
from tensorflow import keras

## Part-of-Speech Tagger with Bidirectional LSTM

In [None]:
# PoS tagging with LSTM is multiclass classification task for sequence.

# nltk offers free sets for labelled corpora.
# look at https://www.nltk.org/nltk_data
nltk.download('treebank')
nltk.download('brown')
nltk.download('conll2000')

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.


True

In [None]:
# Download tagset
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


True

In [None]:
# Download all PoS-tagged sentences and place them in one List.
tagged_sentences = treebank.tagged_sents(tagset='universal')+\
                   brown.tagged_sents(tagset='universal')+\
                   conll2000.tagged_sents(tagset='universal')
print(tagged_sentences[0])
print(f"Dataset size: {len(tagged_sentences)}")

[('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), ('61', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ'), (',', '.'), ('will', 'VERB'), ('join', 'VERB'), ('the', 'DET'), ('board', 'NOUN'), ('as', 'ADP'), ('a', 'DET'), ('nonexecutive', 'ADJ'), ('director', 'NOUN'), ('Nov.', 'NOUN'), ('29', 'NUM'), ('.', '.')]
Dataset size: 72202


In [None]:
sentences, sentence_tags = [], []

for s in tagged_sentences:
  sentence, tags = zip(*s) # multiple numbers of tuple according to sentences
  sentences.append(list(sentence))
  sentence_tags.append(list(tags))

In [None]:
print(sentences[0])
print(sentence_tags[0])

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
['NOUN', 'NOUN', '.', 'NUM', 'NOUN', 'ADJ', '.', 'VERB', 'VERB', 'DET', 'NOUN', 'ADP', 'DET', 'ADJ', 'NOUN', 'NOUN', 'NUM', '.']


In [None]:
print(len(sentences), len(sentence_tags)) # number of sentences

72202 72202


In [None]:
# split dataset according to the proportion below
train_ratio = 0.75
validation_ratio = 0.15
test_ratio=0.10

# train:test = 0.75:0.25
x_train, x_test, y_train, y_test = train_test_split(sentences, sentence_tags,
                                                     test_size= 1-train_ratio,
                                                     random_state = 317)
# train:val:test = 0.75:0.15:0.10
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test,
                                                test_size = test_ratio/(test_ratio + validation_ratio),
                                                random_state = 317)

In [None]:
print(len(x_train), len(y_train))
print(len(x_val), len(y_val))
print(len(x_test), len(y_test))

54151 54151
10830 10830
7221 7221


In [None]:
# Generate wordvectors for our sentenses
# default tokenizer, out-ov-vocabulary token as <OOV>
sentence_tokenizer = keras.preprocessing.text.Tokenizer(oov_token='<OOV')
sentence_tokenizer.fit_on_texts(x_train)
print(f"Vocabulary size: {len(sentence_tokenizer.word_index)}")

Vocabulary size: 52183


In [None]:
# we need another tokenizer for the tags are also sequences.
tag_tokenizer = keras.preprocessing.text.Tokenizer()
tag_tokenizer.fit_on_texts(y_train)

In [None]:
print(f"Number of PoS tags: {len(tag_tokenizer.word_index)}\n")
tag_tokenizer.get_config()

Number of PoS tags: 12



{'num_words': None,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'lower': True,
 'split': ' ',
 'char_level': False,
 'oov_token': None,
 'document_count': 54151,
 'word_counts': '{"adv": 51392, "verb": 175631, "adp": 137365, "det": 127600, ".": 143593, "adj": 81110, "noun": 288313, "conj": 35420, "num": 21374, "prt": 31340, "pron": 44737, "x": 6109}',
 'word_docs': '{"conj": 24581, "det": 44815, "verb": 50880, "num": 11905, "adj": 36440, "noun": 51202, "adp": 43937, "adv": 29599, ".": 53331, "prt": 21888, "pron": 26974, "x": 2668}',
 'index_docs': '{"9": 24581, "5": 44815, "2": 50880, "11": 11905, "6": 36440, "1": 51202, "4": 43937, "7": 29599, "3": 53331, "10": 21888, "8": 26974, "12": 2668}',
 'index_word': '{"1": "noun", "2": "verb", "3": ".", "4": "adp", "5": "det", "6": "adj", "7": "adv", "8": "pron", "9": "conj", "10": "prt", "11": "num", "12": "x"}',
 'word_index': '{"noun": 1, "verb": 2, ".": 3, "adp": 4, "det": 5, "adj": 6, "adv": 7, "pron": 8, "conj": 9, "prt": 10, "

In [None]:
tag_tokenizer.word_index

{'noun': 1,
 'verb': 2,
 '.': 3,
 'adp': 4,
 'det': 5,
 'adj': 6,
 'adv': 7,
 'pron': 8,
 'conj': 9,
 'prt': 10,
 'num': 11,
 'x': 12}

In [None]:
# convert text to number sequences by fitted tokenizer
x_train_seqs = sentence_tokenizer.texts_to_sequences(x_train)

In [None]:
print(x_train_seqs[0])
print(x_train[0])

[133, 1921, 19, 8, 13, 1606, 461, 15, 4344, 318, 12, 8, 1922, 28157, 1858, 25, 12287, 28158, 926, 926, 20609, 7, 6603, 6, 20610, 1568, 9, 3198, 39, 13, 145, 6, 1742, 23, 834, 295, 3038, 311, 28159, 32, 28160, 1284, 3, 11, 207, 20609, 1742, 429, 2072, 12288, 15, 4]
['Still', 'existing', 'on', 'a', '``', 'Northern', 'Union', "''", 'telegraph', 'form', 'is', 'a', 'typical', 'peremptory', 'message', 'from', 'Peru', 'grocer', 'J.', 'J.', 'Hapgood', 'to', 'Burton', 'and', "Graves'", 'store', 'in', 'Manchester', '--', '``', 'Get', 'and', 'send', 'by', 'stage', 'four', 'pounds', 'best', 'Porterhouse', 'or', 'serloin', 'stake', ',', 'for', 'Mrs.', 'Hapgood', 'send', 'six', 'sweet', 'oranges', "''", '.']


In [None]:
y_train_seqs = tag_tokenizer.texts_to_sequences(y_train)

In [None]:
print(tag_tokenizer.sequences_to_texts([y_train_seqs[0]]))
print(y_train_seqs[0])

['adv verb adp det . adj noun . noun noun verb det adj adj noun adp noun noun noun noun noun adp noun conj noun noun adp noun . . verb conj verb adp noun num noun adj noun conj noun noun . adp noun noun verb num adj noun . .']
[7, 2, 4, 5, 3, 6, 1, 3, 1, 1, 2, 5, 6, 6, 1, 4, 1, 1, 1, 1, 1, 4, 1, 9, 1, 1, 4, 1, 3, 3, 2, 9, 2, 4, 1, 11, 1, 6, 1, 9, 1, 1, 3, 4, 1, 1, 2, 11, 6, 1, 3, 3]


In [None]:
# Do the same things to valid dataset
x_val_seqs = sentence_tokenizer.texts_to_sequences(x_val)
y_val_seqs = tag_tokenizer.texts_to_sequences(y_val)

In [None]:
# Even if RNN can handle variable lengthes of sequences, it is musch better for performance to unify the lengthes of each sequences
print(len(max(x_train_seqs, key=len))) # return the length of the longest sequence
MAX_LENGTH = len(max(x_train_seqs, key=len))
print(f"Length of longest input sequence: {MAX_LENGTH}")

271
Length of longest input sequence: 271


In [None]:
# we can pad every sentences with method "pad_sequences" from keras
x_train_padded = keras.preprocessing.sequence.pad_sequences(x_train_seqs, padding='post',
                                                            maxlen=MAX_LENGTH)


In [None]:
print(x_train_padded[0])
print(len(x_train_padded[0]))

[  133  1921    19     8    13  1606   461    15  4344   318    12     8
  1922 28157  1858    25 12287 28158   926   926 20609     7  6603     6
 20610  1568     9  3198    39    13   145     6  1742    23   834   295
  3038   311 28159    32 28160  1284     3    11   207 20609  1742   429
  2072 12288    15     4     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

In [None]:
# do the same things to training label
y_train_padded = keras.preprocessing.sequence.pad_sequences(y_train_seqs, padding='post',
                                                           maxlen=MAX_LENGTH)

In [None]:
x_val_padded = keras.preprocessing.sequence.pad_sequences(x_val_seqs, padding='post', maxlen=MAX_LENGTH)
y_val_padded = keras.preprocessing.sequence.pad_sequences(y_val_seqs, padding='post', maxlen=MAX_LENGTH)

In [None]:
# As PoS tagging is a multiclass classification task done at each timestep,
# we need to convert every tag for every sentence into one-hot encoding.
y_train_categoricals = keras.utils.to_categorical(y_train_padded)
print(y_train_categoricals[0]) # sequence is now composed of one-hot encodings

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [1. 0. 0. ... 0. 0. 0.]]


In [None]:
# one-hot encoding for a single tag in a sequence
print(y_train_categoricals[0][0])

[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]


In [None]:
# we can determind PoS tag from oh-encoding by "look-up" in index_word dictionary from tag_tokenizer
idx = np.argmax(y_train_categoricals[0][0]) # argmax return the index of elememt having maximum value in OHencoding array
print(f"Index: {idx}")
print(f"Tag: {tag_tokenizer.index_word[idx]}")

Index: 7
Tag: adv


In [None]:
# one hot encoding to val_labels
y_val_categoricals = keras.utils.to_categorical(y_val_padded)

[notes]
At this point, we're ready to build our model. We'll train word embeddings concurrently with our model (though you can use pretrained word vectors as well).

[notes]
1. Ignore padding values :
The embedding layers has *mask_zero* parameter. we added padding in order to make our batches the same size, but we don't want to makd PoS predictions on padding. Setting *mask_zero* to *True* makes the layers following the embedding layer ignore padding values.

2. Return sequences not only one output :
we're using *bidriectional LSTM*. The Bidrectional layer is a wrapper to which we pass an LSTM layer. The first parameter to the LSTM layer is the number of units in the cell. The second parameter, return_sequences, control whether the RNN returns an output for each timestep or only the last output. Since we're doing PoS-tagging, we want an aoutput for each timestep and so *return_sequences* is set to *True*.

In [None]:
# For the embedding layer. "+1" to account for the padding token.
num_tokens = len(sentence_tokenizer.word_index) + 1 # +1 for padding token
embedding_dim = 128

# For the output layer, The number of classes corresponds to the number of possible tags
num_classes = len(tag_tokenizer.word_index) + 1 # also +1 for padding token

In [None]:
# we set random_set_seed and kerner_initializer parameter to get same result.
tf.random.set_seed(317)

model = keras.Sequential()

# input layer(embedding layer : each tokens -> embedding_dim )
model.add(layers.Embedding(input_dim = num_tokens,
                           output_dim = embedding_dim,
                           input_length = MAX_LENGTH,
                           mask_zero=True))

# hidden layer (bidrectional)
model.add(layers.Bidirectional(layers.LSTM(128, return_sequences=True,
                                           kernel_initializer=tf.keras.initializers.random_normal(seed=317))))

# output layer for each timestep with softmax activation fucntion
model.add(layers.Dense(num_classes, activation='softmax',
                       kernel_initializer=tf.keras.initializers.random_normal(seed=317)))


model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])



In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 271, 128)          6679552   
                                                                 
 bidirectional (Bidirectiona  (None, 271, 256)         263168    
 l)                                                              
                                                                 
 dense (Dense)               (None, 271, 13)           3341      
                                                                 
Total params: 6,946,061
Trainable params: 6,946,061
Non-trainable params: 0
_________________________________________________________________


[notes] \
1. The embedding layer output has three dimensions
- Batch size : None => we haven't specified it yet
- Sequence length : 217
- Embedding dimension : 128

2. Bidirectional LSTM outputs a vector twice the size of what we specified because its bidirectional. Remember two LSTM output will be concatenated before going to output layer.

3. Output layer also has three dimensions
- Batch size
- Sequence length
- Output dimension : 13 as number of tag classes


In [None]:
# we put early-stopping for trainig to be stopped when validation loss stops improving.
es_callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)

history = model.fit(x_train_padded, y_train_categoricals, epochs=20,
                    batch_size=256, validation_data=(x_val_padded, y_val_categoricals),
                    callbacks=[es_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20


In [None]:
# After our model is trained, go to test data set
# preprocessing(tokenize, pad) it and oh encoding

x_test_seqs = sentence_tokenizer.texts_to_sequences(x_test)
x_test_padded = keras.preprocessing.sequence.pad_sequences(x_test_seqs, padding='post', maxlen=MAX_LENGTH)

y_test_seqs = tag_tokenizer.texts_to_sequences(y_test)
y_test_padded = keras.preprocessing.sequence.pad_sequences(y_test_seqs, padding='post', maxlen=MAX_LENGTH)
y_test_categoricals = keras.utils.to_categorical(y_test_padded)

In [None]:
model.evaluate(x_test_padded, y_test_categoricals)



[0.1051989421248436, 0.9687950611114502]

In [None]:
# now we can use our models to tag sentences.

samples = [
    "Brown refused to testify.",
    "Brown sofas are on sale",
]

In [None]:
# develop simple function for doing this task.

def tag_sentences(sentences):
  sentences_seqs = sentence_tokenizer.texts_to_sequences(sentences)
  sentences_padded = keras.preprocessing.sequence.pad_sequences(sentences_seqs,
                                                               maxlen=MAX_LENGTH,
                                                               padding='post')
  # tag_preds is each list of probabilty distribution (softmax)
  tag_preds = model.predict(sentences_padded)

  sentence_tags = []

  # each iteration is one sequence
  for i, preds in enumerate(tag_preds):

    print(preds)
    # seq of most probable ones in sequence
    tags_seq = [np.argmax(p) for p in preds[:len(sentences_seqs[i])]]
    words = [sentence_tokenizer.index_word[w] for w in sentences_seqs[i]]
    tags = [tag_tokenizer.index_word[t] for t in tags_seq]
    sentence_tags.append(list(zip(words, tags)))

  return sentence_tags

In [None]:
tagged_sample_sentences = tag_sentences(samples)

[[3.2334467e-06 9.9331737e-01 2.9946794e-04 ... 3.9344709e-04
  1.2747691e-05 2.2187696e-05]
 [1.2938746e-06 3.1313516e-04 9.9806172e-01 ... 1.4160351e-04
  4.5686566e-09 1.3718624e-06]
 [1.2729881e-06 2.0816846e-05 6.1781978e-04 ... 4.8411652e-01
  3.2293276e-05 3.9092065e-06]
 ...
 [7.5027689e-02 8.0436386e-02 7.9266421e-02 ... 7.5670168e-02
  7.5860150e-02 7.5894080e-02]
 [7.5027689e-02 8.0436386e-02 7.9266421e-02 ... 7.5670168e-02
  7.5860150e-02 7.5894080e-02]
 [7.5027689e-02 8.0436386e-02 7.9266421e-02 ... 7.5670168e-02
  7.5860150e-02 7.5894080e-02]]
[[4.2764573e-06 1.5268865e-01 1.3265687e-04 ... 3.5671474e-05
  2.9227600e-04 1.2731762e-05]
 [2.3185294e-08 9.9997950e-01 6.1821049e-07 ... 1.7081586e-06
  1.0583433e-06 1.6246985e-06]
 [4.9634735e-10 3.8132605e-06 9.9999332e-01 ... 1.6310411e-08
  2.7530690e-13 1.3395328e-08]
 ...
 [7.5027689e-02 8.0436386e-02 7.9266421e-02 ... 7.5670168e-02
  7.5860150e-02 7.5894080e-02]
 [7.5027689e-02 8.0436386e-02 7.9266421e-02 ... 7.5670168e-

In [None]:
print(tagged_sample_sentences[0])
print(tagged_sample_sentences[1])

[('brown', 'noun'), ('refused', 'verb'), ('to', 'adp'), ('testify', 'verb')]
[('brown', 'adj'), ('sofas', 'noun'), ('are', 'verb'), ('on', 'adp'), ('sale', 'noun')]


It's just one way of buidling a PoS tagger, these days' PoS tagger is much more sophisticated models in which transfomer is used.

## Language Modelling With Stacked LSTMs

In [3]:
# download sample corpus
art_of_war = requests.get('https://raw.githubusercontent.com/nitinpunjabi/nlp-demystified/main/datasets/art_of_war.txt')\
                     .text

art_of_war[:300]

'1. Sun Tzŭ said: The art of war is of vital importance to the State.\n\n2. It is a matter of life and death, a road either to safety or to\nruin. Hence it is a subject of inquiry which can on no account be\nneglected.\n\n3. The art of war, then, is governed by five constant factors, to be\ntaken into accou'

[notes]\
We will build character-based language model(opposed to word-based)

Character-level models have the advantage of:
1. Smaller prediction space (smaller character than words.. in english obviously)
2. More resilient to the out-of-vocabulary(OOV) problem. and better able to learn lower mechanics of language (including punctuation)

On the other hand, character-level models need to learn a sequence of charcters to "make sense" of a word (e.g. the sequence of "c", "a", "t" to identify "cat" as a pattern) which can be inefficient and result in lower performance.

RNNs can process any kind of sequence so what's shown here can easily be applied at the word level. When we cover transformers, there is an alterniative approaches called subword tokenization which is the moddle-ground between these two approaches.


In [4]:
tokenizer = keras.preprocessing.text.Tokenizer(char_level = True)

In [5]:
tokenizer.fit_on_texts([art_of_war])

In [6]:
tokenizer.get_config()

{'num_words': None,
 'filters': '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
 'lower': True,
 'split': ' ',
 'char_level': True,
 'oov_token': None,
 'document_count': 1,
 'word_counts': '{"1": 179, ".": 896, " ": 9794, "s": 3081, "u": 1467, "n": 3565, "t": 4398, "z": 20, "\\u016d": 13, "a": 3475, "i": 3573, "d": 1681, ":": 48, "h": 2558, "e": 5837, "r": 2776, "o": 3548, "f": 1238, "w": 981, "v": 478, "l": 1722, "m": 1201, "p": 769, "c": 1390, "\\n": 1443, "2": 127, ",": 634, "y": 1055, "b": 708, "j": 23, "q": 55, "g": 1007, "3": 87, "k": 345, "\\u2019": 57, "4": 66, "(": 59, ")": 59, ";": 168, "5": 58, "6": 51, "_": 62, "7": 39, "8": 36, "9": 34, "0": 38, "x": 49, "\\u2014": 16, "?": 8, "!": 8, "-": 57, "\\u201c": 3, "\\u201d": 3, "\\u0153": 7, "\\u00fc": 3, "\\u2018": 1}',
 'word_docs': '{"(": 1, "l": 1, "s": 1, "5": 1, "m": 1, "!": 1, "7": 1, "3": 1, "j": 1, "?": 1, "c": 1, "k": 1, ".": 1, "\\u2019": 1, "h": 1, "u": 1, "r": 1, "1": 1, "\\u201d": 1, "i": 1, "o": 1, "\\n": 1, "g": 1, "a": 

In [7]:
print(f"Tokenizer \"Vocabulary\" size : {len(tokenizer.word_index)}")

Tokenizer "Vocabulary" size : 56


In [8]:
seq = tokenizer.texts_to_sequences([art_of_war])[0] # first sequence

In [9]:
print(f"Text length: {len(seq)}")

Text length: 61054


In [10]:
# Sanity check.
tokenizer.sequences_to_texts([seq[:10]])

['1 .   s u n   t z ŭ']

In [11]:
# we'll gonna use "Tensorflow Data API" which makes it easy to build preprocessing pipelines by chaining operations together
# To use this, we need to converted our vectorized corpus into Dataset object (from_tensor_slices)
slices = tf.data.Dataset.from_tensor_slices(seq)
type(slices)

tensorflow.python.data.ops.from_tensor_slices_op._TensorSliceDataset

In [12]:
list(slices.take(10))

[<tf.Tensor: shape=(), dtype=int32, numpy=27>,
 <tf.Tensor: shape=(), dtype=int32, numpy=21>,
 <tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=int32, numpy=8>,
 <tf.Tensor: shape=(), dtype=int32, numpy=13>,
 <tf.Tensor: shape=(), dtype=int32, numpy=5>,
 <tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=int32, numpy=3>,
 <tf.Tensor: shape=(), dtype=int32, numpy=47>,
 <tf.Tensor: shape=(), dtype=int32, numpy=49>]

In [13]:
seq[:10]

[27, 21, 1, 8, 13, 5, 1, 3, 47, 49]

In [14]:
# we'll use window method. window size will be 100. Dataset of 1000 willbe sequence of 10 datasets, each containing 100 elements
# Here, we create window "input_timesteps + 1" which is for target/label for each training example.
# Shift to 1, window shift tactics.
# Drop_remainder to True which ensure All windows contain exactly N elements. i.e. once input contians fewer than N, it dropped.
input_timesteps = 100
window_size = input_timesteps + 1
windows = slices.window(window_size, shift=1, drop_remainder=True)

In [15]:
for w in windows.take(3):
  arr = list(w.as_numpy_iterator())
  print(len(arr), arr)

101 [27, 21, 1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12]
101 [21, 1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12, 2]
101 [1, 8, 13, 5, 1, 3, 47, 49, 1, 8, 7, 4, 12, 41, 1, 3, 10, 2, 1, 7, 9, 3, 1, 6, 16, 1, 20, 7, 9, 1, 4, 8, 1, 6, 16, 1, 25, 4, 3, 7, 11, 1, 4, 17, 22, 6, 9, 3, 7, 5, 15, 2, 1, 3, 6, 1, 3, 10, 2, 1, 8, 3, 7, 3, 2, 21, 14, 14, 29, 21, 1, 4, 3, 1, 4, 8, 1, 7, 1, 17, 7, 3, 3, 2, 9, 1, 6, 16, 1, 11, 4, 16, 2, 1, 7, 5, 12, 1, 12, 2

In [16]:
# window method returns a nested dataset of datasets
print(windows, '\n')

for w in windows.take(2):
  print(w)

<_WindowDataset element_spec=DatasetSpec(TensorSpec(shape=(), dtype=tf.int32, name=None), TensorShape([]))> 

<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>
<_VariantDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>


In [17]:
# Our model only accept tensors. We need to extract tensors from each window using 'flat_map'
dataset = windows.flat_map(lambda window: window.batch(window_size))

In [18]:
# Now we have single dataset of tensors, where each tensor is "input_timesteps+1" long and shifted by 1.
for d in dataset.take(2):
  print(d)

tf.Tensor(
[27 21  1  8 13  5  1  3 47 49  1  8  7  4 12 41  1  3 10  2  1  7  9  3
  1  6 16  1 20  7  9  1  4  8  1  6 16  1 25  4  3  7 11  1  4 17 22  6
  9  3  7  5 15  2  1  3  6  1  3 10  2  1  8  3  7  3  2 21 14 14 29 21
  1  4  3  1  4  8  1  7  1 17  7  3  3  2  9  1  6 16  1 11  4 16  2  1
  7  5 12  1 12], shape=(101,), dtype=int32)
tf.Tensor(
[21  1  8 13  5  1  3 47 49  1  8  7  4 12 41  1  3 10  2  1  7  9  3  1
  6 16  1 20  7  9  1  4  8  1  6 16  1 25  4  3  7 11  1  4 17 22  6  9
  3  7  5 15  2  1  3  6  1  3 10  2  1  8  3  7  3  2 21 14 14 29 21  1
  4  3  1  4  8  1  7  1 17  7  3  3  2  9  1  6 16  1 11  4 16  2  1  7
  5 12  1 12  2], shape=(101,), dtype=int32)


In [19]:
# Create batches from our data set. shuffle and create!
batch_size = 32
batches = dataset.shuffle(10000).batch(batch_size)

In [20]:
for b in batches.take(2):
  print(b)

tf.Tensor(
[[ 5 12  1 ...  5 12  1]
 [ 1  9  2 ...  1  3  6]
 [ 2  1  7 ...  4 11 25]
 ...
 [ 9  2  3 ... 34  1 23]
 [16  9  6 ...  7  9  4]
 [ 1  8  3 ... 18  6 13]], shape=(32, 101), dtype=int32)
tf.Tensor(
[[ 3  7  5 ...  5  1  3]
 [ 2  1  7 ...  1  7  5]
 [ 9  8 24 ...  2  5 12]
 ...
 [22  9  2 ... 14 14 30]
 [ 5  1  2 ...  9 15  2]
 [ 1 17  7 ... 10  1  3]], shape=(32, 101), dtype=int32)


[notes]\
We talked about _Teacher Forcing_
1. At each timestep during training, the output is compared to a label
2. At the next timestep, rather than feeding the model the previous output, we feed it the next character of input sequence(i.e. what model should've outputted - label!)

This is why each sequence is of size _input_timestep+1_. Each sequence is now going to be separated into Two sequences. The first sequence will be training input and will be length input_timepsteps. The second sequence will be the label/target and will consist of all the sequence elements shifted by 1.

So if processed sequence is "she swam in the lake", then:
+ the input will be "she swam in the lak" (drop last char)
+ the target/label will be "he swam in the lake" (drop first char)


In [21]:
# x-y paired bathces
xy_batches = batches.map(lambda batch: (batch[:, :-1], batch[:, 1:]))

In [22]:
for b in xy_batches.take(1):
  print(b)

(<tf.Tensor: shape=(32, 100), dtype=int32, numpy=
array([[ 2, 17,  8, ...,  8,  7,  5],
       [ 9,  2,  3, ..., 29, 34,  1],
       [ 1, 20, 10, ...,  1, 20,  7],
       ...,
       [12,  8,  1, ...,  6,  3,  1],
       [50, 14, 33, ...,  1, 17,  6],
       [15,  2,  1, ..., 13,  7,  4]], dtype=int32)>, <tf.Tensor: shape=(32, 100), dtype=int32, numpy=
array([[17,  8,  1, ...,  7,  5, 12],
       [ 2,  3,  9, ..., 34,  1, 23],
       [20, 10,  6, ..., 20,  7, 18],
       ...,
       [ 8,  1,  8, ...,  3,  1, 20],
       [14, 33, 30, ..., 17,  6,  8],
       [ 2,  1,  6, ...,  7,  4,  5]], dtype=int32)>)


In [23]:
# For greater clarity, this is first input sequence from the first batch, and its's corresponding label/target sequence.
for b in xy_batches.take(1): # first batch
  print("x1 length: ", len(b[0][0].numpy())) # first sequence (X)
  print("x1: ", b[0][0].numpy())

  print("\n")

  print("y1 length: ", len(b[1][0].numpy())) # first sequence (X)
  print("y1: ", b[1][0].numpy())

x1 length:  100
x1:  [ 1  3  6  1 25  4 15  3  6  9 18 24  1 17 13  8  3  1  5  6  3  1 23  2
  1 12  4 25 13 11 19  2 12 14 23  2 16  6  9  2 10  7  5 12 21 14 14 29
 39 21  1  5  6 20  1  3 10  2  1 19  2  5  2  9  7 11  1 20 10  6  1 20
  4  5  8  1  7  1 23  7  3  3 11  2  1 17  7 26  2  8  1 17  7  5 18  1
 15  7 11 15]


y1 length:  100
y1:  [ 3  6  1 25  4 15  3  6  9 18 24  1 17 13  8  3  1  5  6  3  1 23  2  1
 12  4 25 13 11 19  2 12 14 23  2 16  6  9  2 10  7  5 12 21 14 14 29 39
 21  1  5  6 20  1  3 10  2  1 19  2  5  2  9  7 11  1 20 10  6  1 20  4
  5  8  1  7  1 23  7  3  3 11  2  1 17  7 26  2  8  1 17  7  5 18  1 15
  7 11 15 13]


[notes] \
Last step is one-hot encode the inputs. because
1. We're not using embeddings for the input. We can, but since this is a charcter model with just a few dozen possible choices, we can get away with one-hot encoding. There's also no reason to think a particular letter should be closer to another in vector space as we would want in a word-level model
2. Since we're not using embeddings and our input is categorical, we need to one-hot encode

Also, NOT one-hot encode label/target data.
This is becuase we'll be using a loss function that can help us skip that step (more below)

In [26]:
num_tokens = len(tokenizer.word_index) + 1

# One-hot encode the input sequences, don't do anything with the label/target sequences
xy_batches = xy_batches.map(lambda inputs, labels: (tf.one_hot(inputs, depth=num_tokens), labels))

In [27]:
# Each input sequence is now a sequence of one-hot encodings
for b in xy_batches.take(1):
  print("x1 ", b[0][0].numpy())
  print("\n")
  print("y1 ", b[1][0].numpy())


x1  [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


y1  [ 2  1 23 13 11 20  7  9 26  1  4  8 14 15  6 17 22 11  2  3  2  1  7  3
  1  7 11 11  1 22  6  4  5  3  8 28  1  3 10  2  1  8  3  7  3  2  1 20
  4 11 11  1 23  2  1  8  3  9  6  5 19 28  1  4 16  1  3 10  2  1 23 13
 11 20  7  9 26  1  4  8 14 12  2 16  2 15  3  4 25  2 24  1  3 10  2  1
  8  3  7  3]


[note] \
At this point, we've
+ Segmented our corpus into fix-ed length sequences
+ Created training and label/target sequences
+ Organized them into batches.

We need to add "prefetching" which is optimizing step.
This way, while the model trains on the current batch of data, the pipeline read and prepares the next batch.


In [28]:
dataset = dataset.prefetch(tf.data.AUTOTUNE)

[note] \
Now build our model
Three new things to remember
1. We're stacking two LSTMs, the sequential output of the first LSTM will become the sequential input to the second LSTM
2. We're adding some _recurrent_dropout_. This drops connections between the recurrent units(i.e. the dropout is applied horizontally across time). You can still use regular dropout as well which will be applied to the inputs/outputs.
3. We're using __sparse_categorical_crossentropy__. This allows us to provide labels as integers for multiclass classification rather than one-hot encodings

In [29]:
model = keras.models.Sequential()

model.add(layers.LSTM(128, return_sequences=True, input_shape=[None, num_tokens], recurrent_dropout=0.2))
model.add(layers.LSTM(128, return_sequences=True, input_shape=[None, num_tokens], recurrent_dropout=0.2))

model.add(layers.Dense(num_tokens, activation='softmax'))

model.compile(loss="sparse_categorical_crossentropy", optimizer='adam')



In [30]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm (LSTM)                 (None, None, 128)         95232     
                                                                 
 lstm_1 (LSTM)               (None, None, 128)         131584    
                                                                 
 dense (Dense)               (None, None, 57)          7353      
                                                                 
Total params: 234,169
Trainable params: 234,169
Non-trainable params: 0
_________________________________________________________________


In [31]:
# This model takes a few hours to train, we're using model checkpoints to save the weights after every epoch.
# If something goes wrong with our system during training, we can reload the last set of weights from the checkpoint.
filepath="./ArtofWarLM/training1/cp.ckpt"

# Create a callback that saves the model's weights
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=filepath,
                                                 save_seights_only=True,
                                                 verbose=1)

In [32]:
# when call model's fit method,
# Simply pass in the batches as is (no need to separate into explicit x and y arguments)
# pass the model checkpoint callback to save weights after every epoch
# [alert!] fit commented out becuase this model takes a few hours to train, I trained it ahead of time and saved it
#
# history = model.fit(xy_batches, epochs = 50, callback=[cp_callback])
#

In [33]:
# Once the model training is complete, we can save this weight using model's save
#
# model.save('art_of_war_char_level_lm')
#

In [34]:
# Download the pretrained model..
!wget https://github.com/nitinpunjabi/nlp-demystified/raw/main/models/art_of_war_char_level_lm.zip
!unzip -o art_of_war_char_level_lm.zip

--2023-08-11 15:33:49--  https://github.com/nitinpunjabi/nlp-demystified/raw/main/models/art_of_war_char_level_lm.zip
Resolving github.com (github.com)... 20.205.243.166
Connecting to github.com (github.com)|20.205.243.166|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nitinpunjabi/nlp-demystified/main/models/art_of_war_char_level_lm.zip [following]
--2023-08-11 15:33:50--  https://raw.githubusercontent.com/nitinpunjabi/nlp-demystified/main/models/art_of_war_char_level_lm.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2691531 (2.6M) [application/zip]
Saving to: ‘art_of_war_char_level_lm.zip’


2023-08-11 15:33:50 (207 MB/s) - ‘art_of_war_char_level_lm.zip’ saved [2691531/2691531]

Archive

In [35]:
# Load
model = keras.models.load_model('art_of_war_char_level_lm')



[note] \
Now that we have a trained model, let's generate some text.

The function below takes some seed text and uses that to generate a certain number of characters. For each character, it uses the generated text so far as the input. It's not the most efficient function but it'll work here.

There's also a temperature parameter. The next character is picked from a probability distribution. By dividing the log of this distribution by temperature, we can influence the randomness of the output.

When the temperature is low (< 1), the probability distribution sharpens and the model will be more strict in recreating the original text. As we raise the temperature, the distribution flattens and there's a higher chance the model picks something unexpected, resulting in greater surprise in the output. In practice, a high enough temperature will result in nonsense.

In [36]:
print(input_timesteps)

100


In [39]:
def generate_text(model, tokenizer, seed_text, num_chars=200, temperature=1):

  text = seed_text

  for _ in range(num_chars):

    # Take the last *input_timesteps* number of characters in the text so far
    # as input.
    input = np.array(tokenizer.texts_to_sequences([text[-input_timesteps:]]))
    input = tf.one_hot(input, num_tokens)

    # Create probability distribution for next character adjusted by temperature.
    preds = model.predict(input)[0, -1:, :] # <-- We want only the last character so we're extracting the softmax output for that.
    preds = tf.math.log(preds) / temperature

    # Sample next character and add to running text.
    next_char = tf.random.categorical(preds, num_samples=1)
    next_char = tokenizer.sequences_to_texts(next_char.numpy())[0]

    text += next_char

  return text

In [40]:
%%time
print(generate_text(model, tokenizer, "Banana peels on the battlefield can", num_chars=300, temperature=0.2))

Banana peels on the battlefield can never come again
into being; nor can the dead ever be brought back to life.

22. hence the enlightened ruler lays his plans well ahead;
the good general cultivates his resources.

17. move not unless you see an advantage; use not your troops unless
there is something to be gained; fight not unless 
CPU times: user 41.9 s, sys: 1.22 s, total: 43.1 s
Wall time: 46.6 s


In [41]:
print(generate_text(model, tokenizer, "It's time to release the Kraken when", num_chars=300, temperature=0.5))

It's time to release the Kraken when the well-being of your men, and do not attack.

8. (3) when the force of the flames has reached its height, follow it
up with an attack, if that is practicable; if not, stay where
you are.

20. anger may in time change to gladness; vexation may be succeeded by
content.

21. but a kingdom that has o


In [42]:
print(generate_text(model, tokenizer, "Crush your enemies, see them driven before you, and", num_chars=300,
                    temperature=1))

Crush your enemies, see them driven before you, and we may are has
destroyed converted spies and available for our service.

22. it is through the information brought by the converted spy that we
are able to acquire and employ local and inward spies.

23. it is owing to his information, again, that we can cause the doomed
spy. hence it is essential 


In [43]:
print(generate_text(model, tokenizer, "What is best in life?", num_chars=300, temperature=2))

What is best in life?

2ar confunder, the path defined battles confer.

1. skowing them outlay inflisate of a sovoutint; dyet a city, or event he will be commotiom, and doome

22.;t, it is auther is _doi too lü yüa who had served under the yin hapi-man. is it bskignoned
_marruitly and remaining fie of the fourth.

3. le


In [45]:
print(generate_text(model, tokenizer, "The best pub in Manchester is", num_chars=300, temperature=0.1))

The best pub in Manchester is knowledge of
the enemy; and this knowledge can only be derived, in the first
instance, from the converted spy. hence it is essential that the
converted spy be treated with the utmost liberality.

26. of old, the rise of the yin dynasty was due to i chih who had
served under the yin.

27. hence it i


A few observations of the preceding outputs:

1. Despite being a character-level model, the model managed to "learn" spelling, cadence, punctuation, spacing, grammar, and even numbered bullet points just from trying to predict the next character.

2. It's pretty cool how the model manages to take our initial seed text and complete a sentence with it before moving on.

3. We can see the output getting increasingly nonsensical as the temperature rises. What temperature to use ultimately depends on the nature of your corpus and your goals with the language model.

Also, in contrast to our language model, GPT-3 has 175 billion parameters and was trained on 45 terabytes of data, but the high-level principle of learning through prediction remains the same.

## Further Exploration
1. Check out Sunspring, a sci-fi short written by an LSTM. The director and actors played the script straight and the result is hilarious.
https://www.youtube.com/watch?v=LY7x2Ihqjmc
https://en.wikipedia.org/wiki/Sunspring

2. Everything we learned here can be applied at the word-level. Try creating a word-level language model with a different corpus (maybe download something from https://www.gutenberg.org/) and try using word embeddings.

3. We didn't evaluate our language model using perplexity. Find out online how to do it.