We'll build a language model trained on the Art of War by Sun Tzu.

The language model we'll build will be **Word**-based (as opposed to charachter-based). That is, given a sequence of one or more characters, the model will be asked to predict the next character.<br><br>

A word-level language model is a statistical language model that predicts the probability of a word given the preceding words in a sequence. 

Word-level models are preferred over character-level models because they have a smaller vocabulary and display higher accuracy and lower computational cost than char-based models. 

Word-level models can use a distributed representation where different words with similar meanings have similar representation and can use a large context of recently observed words when making predictions.



RNNs can process any kind of sequence so what's shown here can easily be applied at the Char level.

# Steps
- Load the dataset
- Prepare the dataset for modeling
- Define the model architecture
- Train the model
- Generate text using the trained model

Let's get started!



# Step 1: Load the dataset

In [4]:
import requests

# Download the art of war dataset
url = "https://raw.githubusercontent.com/jrreda/AI-projects/main/Language%20Modelling/art_of_war.txt"
text = requests.get(url).text
print(text[:100])

1. Sun Tzŭ said: The art of war is of vital importance to the State.

2. It is a matter of life and 


# Step 2: Prepare the dataset for modeling

Next, you need to prepare the text dataset for modeling. In this step, we'll do the following:

- Tokenize the text into words
- Convert the words to lowercase
- Create sequences of words to use as input/output pairs for the model
- Encode the words as integers

Here's the code to do these steps:

In [5]:
import string

# clean the text data
text = text.translate(str.maketrans('', '', string.punctuation)) # remove punctuation
text = text.lower() # convert to lowercase

In [67]:
from tensorflow.keras.preprocessing.text import Tokenizer

# tokenize the text into words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
sequences = tokenizer.texts_to_sequences([text])[0]

# create a dictionary that maps words to integers
word_index = tokenizer.word_index

In [7]:
import numpy as np

# encode the sequences of words as integers
vocab_size = len(word_index) + 1
encoded = np.array(sequences)

In [8]:
# create sequences of words for training the model
seq_len = 128
sequences = []
for i in range(seq_len, len(encoded)):
  sequence = encoded[i-seq_len: i+1]
  sequences.append(sequence)

In [9]:
sequences[0]

array([ 66,  87, 104, 105,   1,  98,   2,  70,   5,   2, 962, 963,   3,
         1, 170,  67,  11,   5,   6, 620,   2, 446,   4, 171,   6, 447,
       356,   3, 357,  54,   3, 448,  56,  11,   5,   6, 449,   2, 964,
        30,  31,  17,  49, 229,   8, 965,  68,   1,  98,   2,  70,  71,
         5, 966,  22,  75, 450, 967,   3,   8, 621,  57, 229,   7, 194,
       622,  20, 623,   3, 624,   1, 279, 968,   7,   1, 172,  72,  76,
        10,  66,   1, 358, 359,  67, 230,  68, 155,  72,   1, 625,  73,
       231,   4, 195,  73,  88,   1, 358, 359, 232,   1, 360,   3,   8,
         7, 280, 626,  23,  42, 196,  77,  21,  38,   9, 361,  43, 969,
         2,  42, 451, 970,  22, 120, 362,  94, 230, 971, 281,   4])

In [10]:
# pad sequences to make them all the same length
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

max_len = max([len(seq) for seq in sequences]) # len(max(sequences, key=len))
sequences = pad_sequences(sequences, maxlen=max_len, padding='pre')

X, y = sequences[:, :-1], sequences[:, -1]
y = to_categorical(y, num_classes=vocab_size)

In [11]:
X[0], y[0]

(array([ 66,  87, 104, 105,   1,  98,   2,  70,   5,   2, 962, 963,   3,
          1, 170,  67,  11,   5,   6, 620,   2, 446,   4, 171,   6, 447,
        356,   3, 357,  54,   3, 448,  56,  11,   5,   6, 449,   2, 964,
         30,  31,  17,  49, 229,   8, 965,  68,   1,  98,   2,  70,  71,
          5, 966,  22,  75, 450, 967,   3,   8, 621,  57, 229,   7, 194,
        622,  20, 623,   3, 624,   1, 279, 968,   7,   1, 172,  72,  76,
         10,  66,   1, 358, 359,  67, 230,  68, 155,  72,   1, 625,  73,
        231,   4, 195,  73,  88,   1, 358, 359, 232,   1, 360,   3,   8,
          7, 280, 626,  23,  42, 196,  77,  21,  38,   9, 361,  43, 969,
          2,  42, 451, 970,  22, 120, 362,  94, 230, 971, 281], dtype=int32),
 array([0., 0., 0., ..., 0., 0., 0.], dtype=float32))

`BUFFER_SIZE` is a parameter that controls how much data is shuffled before batching. When I created the training dataset from the pairs of input and output sequences, I used the shuffle method to randomize the order of the pairs. This helps to avoid overfitting and improve generalization. However, shuffling the entire dataset at once can be very expensive and slow. Therefore, I used a `BUFFER_SIZE` to specify how many pairs to shuffle at a time. For example, if I have 10000 pairs and I use a `BUFFER_SIZE` of 1000, the shuffle method will take 1000 pairs at a time and shuffle them randomly, then take another 1000 pairs and shuffle them, and so on. This way, I can shuffle the data faster and more efficiently.
<br><br>
`.prefetch` is a method that allows the dataset to fetch the next batch of data while the model is training on the current batch. This helps to reduce the idle time of the model and improve the training speed.


In [12]:
import tensorflow as tf

# define batch size and buffer size
batch_size = 32
buffer_size = 10000

# create dataset from X and y
dataset = tf.data.Dataset.from_tensor_slices((X, y))

# shuffle the dataset and split into batches
dataset = dataset.shuffle(buffer_size).batch(batch_size, drop_remainder=True).prefetch(tf.data.AUTOTUNE)

In [13]:
dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(32, 128), dtype=tf.int32, name=None), TensorSpec(shape=(32, 2232), dtype=tf.float32, name=None))>

# Step 3: Define the model architecture

Now that you have your dataset prepared, you can define your language model. In this example, we'll use a simple feedforward neural network with an embedding layer, a dense hidden layer, and a softmax output layer.

Here's the code to define the model:

In [1]:
# download glove embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2023-03-27 14:26:05--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2023-03-27 14:26:05--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2023-03-27 14:26:05--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [27]:
# load the GloVe word vectors
embeddings_index = {}
with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

# create an embedding matrix
embedding_dim = 100
num_words = len(tokenizer.word_index) + 1
embedding_matrix = np.zeros((num_words, embedding_dim))
for word, i in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

In [62]:
from keras.models import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional

# recurrent_dropout=0 to support cuDNN 
model = Sequential([
    Embedding(vocab_size, embedding_dim, weights=[embedding_matrix], input_length=max_len-1),
    Dropout(0.3),
    Bidirectional(LSTM(128, return_sequences=True, recurrent_dropout=0)),  
    Dropout(0.3),
    Bidirectional(LSTM(128, return_sequences=True, recurrent_dropout=0)),  
    Dropout(0.3),
    Bidirectional(LSTM(128, return_sequences=True, recurrent_dropout=0)),  
    LSTM(128, recurrent_dropout=0),
    Dense(vocab_size, activation='softmax')
])

model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_4 (Embedding)     (None, 128, 100)          223200    
                                                                 
 dropout_8 (Dropout)         (None, 128, 100)          0         
                                                                 
 bidirectional (Bidirectiona  (None, 128, 256)         234496    
 l)                                                              
                                                                 
 dropout_9 (Dropout)         (None, 128, 256)          0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 128, 256)         394240    
 nal)                                                            
                                                                 
 dropout_10 (Dropout)        (None, 128, 256)         

In [41]:
# from keras.utils import plot_model

# plot_model(model, show_shapes=True)

In [63]:
from tensorflow.keras.optimizers import Adam

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(0.002, weight_decay=0.0001))

Step 4: Train the model

Now that you have your model defined, you can train it on the input/output pairs. 

In [64]:
# model.fit(X, y, epochs=20, verbose=1)
model.fit(dataset, epochs=50, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7fc3c4790850>

In [44]:
# save model
model.save('art_of_war_word_level_100embed_50epochs')



In [45]:
# load saved model
from tensorflow.keras.models import load_model

model = load_model('art_of_war_word_level_100embed_50epochs')

# Step 5: Generate text using the trained model

Finally, you can use the trained model to generate text. To do this, you'll start with a seed word and generate the next word in the sequence using the model. You'll continue this process, generating one word at a time, until you've generated the desired amount of text.

In [68]:
import tensorflow as tf

def generate_text(seed_text, length=10, temperature=1):

  text = seed_text  

  for _ in range(length):

    # Take the last *seq_len* number of characters in the text so far as input.
    sequence = tokenizer.texts_to_sequences([seed_text])[0]
    sequence = pad_sequences([sequence], maxlen=max_len-1, padding='pre')

    # Create probability distribution for next character adjusted by temperature.
    preds = model.predict(sequence, verbose=0) # <-- We want only the last character so we're extracting the softmax output for that.
    preds = tf.math.log(preds) / temperature

    # Sample next character and add to running text.
    next_word = tf.random.categorical(preds, num_samples=1)
    next_word = tokenizer.sequences_to_texts(next_word.numpy())[0]
    # print(next_word)

    text += next_word
    text += ' '

  return text.strip()

In [69]:
generate_text("Banana peels on the battlefield can ", length=30, temperature=0.2)

'Banana peels on the battlefield can ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever ever'

In [70]:
generate_text("It's time to release the Kraken when ", length=30, temperature=0.5)

"It's time to release the Kraken when wing the wing wing will will will the will the you wing wing wing wing will wing the wing will you wing wing wing wing wing wing will wing will"

In [71]:
generate_text("Crush your enemies, see them driven before you, and ", length=30, temperature=1)

'Crush your enemies, see them driven before you, and penetrated force gone penetrated penetrated the situated grown get fallen in fallen fallen fallen fallen in gone grown gone to penetrated hold heard a still penetrated penetrated leading grown fallen'

In [72]:
generate_text("What is best in life? ", length=30, temperature=2)

'What is best in life? nor without 35 1 10 on 28 so in and 14 17 and and but 13 then strike attack 18 on foe strike i if 14 may such where so'

In contrast to our language model, GPT-3 has 175 billion parameters and was trained on 45 terabytes of data, but the high-level principle of learning through prediction remains the same.

# Using a pre-trained language model like GPT-2

GPT-2 has 1.5B parameters

In [None]:
!pip install -U transformers

In [53]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

In [73]:
# Load pre-trained GPT-2 model and tokenizer
gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')
gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [None]:
# Set the model to generate text
gpt2_model.eval() 

In [76]:
# Generate text
input_text = "Crush your enemies, see them driven before you, and "
input_ids = gpt2_tokenizer.encode(input_text, return_tensors='pt')
output = gpt2_model.generate(input_ids, max_length=250, do_sample=True, num_beams=5, temperature=1.0)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [77]:
# Decode the generated output
generated_text = gpt2_tokenizer.decode(output[0], skip_special_tokens=True).replace('\xa0', ' ')
generated_text

"Crush your enemies, see them driven before you, and  make sure you don't let them get close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too close to you.  Don't let them get too"