# Introduction
In this notebook a text generator based on Agata Christie's books is developed.
Different methods are investigated.
First a simple n-gram language model is created.
It is followed by RNN based on LSTMs.
Finally GPT-2 transformer is fine tuned.
# Import libraries
We start with the import of libraries that are necessary for our n-gram model.

In [1]:
import json
import numpy as np
import os
import re
import nltk
import string
from string import punctuation
from scipy.sparse import csr_matrix

# Load and preprocess dataset
We gather the books from Gutenberg project [website](https://www.gutenberg.org/ebooks/author/451).
For each file we remove preface and comments at the end.
Data is loaded from a google drive.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
!cp ./drive/MyDrive/data_science/christie.tar.gz ./christie.tar.gz

Let's list the files.

In [4]:
!tar -xvf ./christie.tar.gz
!rm christie.tar.gz

christie/863-0.txt
christie/61168-0.txt
christie/
christie/58866-0.txt
christie/61262-0.txt
christie/1155-0.txt
christie/65238-0.txt


We load the files and we preprocess it.

In [5]:
file_names = os.listdir('./christie')
files = []
for fil in file_names:
  with open('./christie/{}'.format(fil)) as f:
    files.append(f.read())
text = ' '.join(files).lower()
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'[!?]', '.', text)
text = re.sub(r'[0-9]+', '', text)
text = text.replace('mr.', 'mr')
text = text.replace('mrs.', 'mrs')
text = text.replace('n’t', ' not')

my_punctuation = punctuation.replace('.', '')
my_punctuation = my_punctuation + '—'
my_punctuation = my_punctuation + '’'
my_punctuation = my_punctuation + '‘'
my_punctuation = my_punctuation + '”'
my_punctuation = my_punctuation + '“'

translator = str.maketrans(my_punctuation, ' '*len(my_punctuation)) #map punctuation to space
text = text.translate(translator)

# Markov Chain
First, we build a language model based on n-grams and Markov Chains.
Markov Chain is a stochastic model where a probability of a next event depend solely on the outcome of the previous one (so the current state).
To create it we import the nltk library that will be used for tokenization.

In [6]:
from nltk.tokenize import WordPunctTokenizer
nltk.download('punkt')
tokens = WordPunctTokenizer().tokenize(text)
#tokens = nltk.word_tokenize(text)
print("Number of tokens: ", len(tokens))
vocab = list(set(tokens))
vocab_len = len(vocab)
print("Length of vocabulary: ", vocab_len)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Number of tokens:  448908
Length of vocabulary:  14954


And we use collections.Counter to build n-grams.

In [7]:
from nltk.util import ngrams
from collections import Counter

unigrams = Counter(ngrams(tokens,1))
bigrams = Counter(ngrams(tokens,2))
trigrams = Counter(ngrams(tokens,3))
fourgrams = Counter(ngrams(tokens,4))
grams = [unigrams, bigrams, trigrams, fourgrams]

Let's see counts of unigram 'over' and bigram 'over the'.

In [8]:
grams[0][('over',)]

574

In [9]:
bigrams[('over', 'the')]

133

Now, based on the counts we calculate probabilities of next words.

In [10]:
def calculate_prob(next_word, n_gram_key, grams, vocab, k=0.):
  vocab_len = len(vocab)
  gram_len = len(n_gram_key)
  if n_gram_key:
    n_gram = n_gram_key + (next_word,)
    if grams[gram_len-1][n_gram_key]:
      prob = (grams[gram_len][n_gram] + k) / (grams[gram_len-1][n_gram_key] + k*vocab_len)
    else:
      prob = 0  
  else:
    prob = (grams[0].get((next_word,), 0.)  + k) / (np.sum(list(grams[0].values())) + k*vocab_len)
  return prob

print("Probability of 'the' after the word 'over': ", calculate_prob('the', ('over',), grams, vocab, k=0.0))
print("Probability of 'the' itself: ", calculate_prob('the', '', grams, vocab, k=0.0))

0.23170731707317074
0.04496244219305515


And next we build transition matrix based on the probabilities with possibility of Laplacian smoothing. For that we use Compressed Sparse Row matrix format from scipy. To keep sparsity in case of Laplacian smoothing and to make matrix fit into RAM we store the probability of not-occuring words in a separate variable 'zero_smoothing' variable.

In [11]:
def make_transition(grams, vocab, k=0.):
  target_dict = {token: num for num, token in enumerate(vocab)}
  i = 1
  grams_keys_dict = [{}, {}, {}]
  trans_mat = []
  # unigram
  uni_prob = []
  for name in target_dict.items():
    uni_prob.append(calculate_prob(name[0], '', grams, vocab, k=k))
  uni_prob = np.array(uni_prob)
  uni_prob = uni_prob/np.sum(uni_prob)
  trans_mat.append(uni_prob)
  # 2 to 4-grams
  for i in range(3):
    key = ''
    row = -1
    rows = [] 
    cols = []
    probs = []
    for word in sorted(list(grams[i+1].keys()), key=lambda tup: tup[:-1]):
      key_p = word[:-1]
      if key_p != key:
        key = key_p
        row += 1
        grams_keys_dict[i][key] = row
      target = word[-1:]
      col = target_dict[target[0]]
      prob = calculate_prob(target[0], key, grams, vocab, k=k)
      rows.append(row)
      cols.append(col)
      probs.append(prob)
    trans_mat.append(csr_matrix((probs, (rows, cols)), shape=(row+1, len(vocab))))
  if k != 0.:
    zero_smooth = k / (k*vocab_len)
  else:
    zero_smooth = 0.
  return trans_mat, target_dict, grams_keys_dict, zero_smooth

trans_mat, target_dict, grams_keys_dict, zero_smooth = make_transition(grams, vocab, k=1e-15)

We create a function that returns a next word based on a key and the transition matrix. The function chooses the next words based on all n-grams that are scaled with weights. Moreover, a temperature variable is added to scale the final probabilities of words.

In [12]:
def return_word(key, trans_mat, target_dict, grams_keys_dict, vocab, 
                zero_smooth, weights=[0.25, 0.25, 0.25, 0.25], temp=1.0):
  prob = np.zeros(len(vocab))
  for x in range(len(key)):
    key_p = key[-len(key)+x:]
    i = len(key_p)-1
    row_n = grams_keys_dict[i].get(key_p, -1)
    if row_n != -1:
      row = [el if el != 0. else zero_smooth for el in trans_mat[i+1][row_n].toarray()[0]]
    else:
      # make it smaller
      row = [zero_smooth / len(vocab)] * len(vocab)
    prob += np.array(row)*weights[x]
  prob += trans_mat[0]*weights[0]
  prob = np.where(prob == 0, 0, np.log(prob + 1e-15)) / temp
  prob = np.exp(prob)
  prob = prob / np.sum(prob)
  word = np.random.choice(vocab, p=prob)
  return word

return_word(('.', 'the', 'book'), trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth)

'maple'

Let's see how the function works.

In [13]:
return_word(('.', 'the', 'book'), trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth)

'prominent'

In [14]:
return_word(('.',), trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth)

'you'

Now we create a function that will provide out_len tokens based on a provided string.

In [15]:
def make_text(input_string, out_len, trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth, weights=[0.25, 0.25, 0.25, 0.25], temp=1.0):
  
  if input_string:
    text = input_string.lower()
    text = re.sub(r'\s+', ' ', text)
    text = text.translate(str.maketrans("", "", my_punctuation))
    tokens = nltk.word_tokenize(text)
  else:
    tokens = [return_word(('.',), trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth, weights, temp)]

  for x in range(out_len):
    n_gram = tuple(tokens[-3:])
    tokens.append(return_word(n_gram, trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth, weights, temp))
  return tokens

And we create preprocessing to make the output human readable.

In [16]:
def preprocess_output(tokens):
  tokens_out = []
  cap = True
  for tok in tokens:
    if cap:
      word = tok.capitalize()
    else:
      word = tok
    tokens_out.append(word)
    cap = False
    if tok == '.':
      cap = True
  string_out = ' '.join(tokens_out)
  string_out = string_out.replace(' .', '.')
  string_out = string_out.replace(' ve ', ' have ')
  string_out = string_out.replace(' re ', ' are ')
  string_out = string_out.replace(' i ', ' I ')
  string_out = string_out.replace(' mr ', ' Mr. ')
  string_out = string_out.replace(' mrs ', ' Mrs. ')
  string_out = string_out.replace(' s ', "'s ")
  return string_out + '...'

Now let's see how the created model works.

In [17]:
preprocess_output(make_text('The murder on', 50, trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth, temp=1.0))

'The murder on all clever invention independent lent the indecision. Elements. Shielding commit crimes efficient because gestures derangement these swayed feet way adventurers discomposed. But I yessir rags said treacherous betrayal leave you it man appreciate your nose felt he passed his tongue cupboards completely unmade curdling noise refreshed inspectors...'

Since the maximal n-gram is four words long the text globally makes no sense. But locally they may sometime seem to be correct. To increase the local coherence we may give more weight to larger n-grams.

In [18]:
preprocess_output(make_text('The murder on', 50, trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth, weights=[0.01, 0.09, 0.4, 0.5], temp=1.0))

'The murder on untruthful backs extricating rejoinder disbelieve foeman flourished coal lift. Foisted design weaken accent nerves surveying underground and bringing out waving extravagant incensed magistrate may be loveliness that inglethorpe habits unaffectedly well I proudly extreme domes of hob seeking disgorge substantial fabric paintings aiding him. Importantly. Infatuated destroyed...'

Decresing temperature decreases the randomness of the text, making the model to use simpler and more common words.

In [19]:
preprocess_output(make_text('The murder on', 50, trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth, temp=0.5))

'The murder on I just caught the train. Of course. Of course. I can tell you. That whistle was the signal knock the demand for a number and the reply was full and they wandered about looking for a friend of yours arthur minks alias the tactful chaperone....'

In [20]:
preprocess_output(make_text('The murder on', 50, trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth, temp=0.1))

'The murder on I just devour the papers. I do not know. I was not. I do not know. I have got a plan. Obviously what we have got to do with the crime. I said. I was not. I have got a plan....'

Let's see what happens if also weights are changed.

In [21]:
preprocess_output(make_text('The murder on', 50, trans_mat, target_dict, grams_keys_dict, vocab, zero_smooth, weights=[0.01, 0.09, 0.4, 0.5], temp=0.1))

'The murder on the other hand. I was a little. I am not a little. I was a little. I was a little. I was a little man. I do not know. I said. I was a little. I had been a little....'

Giving more weights to larger n-grams and reducing randomness by setting temperature to 0.1 leads to 'I was a ...' loophole.

# Recurrent neural networks: Long short-term memory cells
Now to increase the length of investigated sequences and to develop something more advanced than a simple probability-based model we are going to use recurrent neural network with LSTM cells.  
For that, we import the necessary components.


In [22]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

This time for preprocessing we are going to use TextVectorization provided by tensorflow. For that we need to check our vocabulary size.

In [23]:
len(vocab)

14954

And we build our vectorization layer. We also append [BEG] token to take into account the beginning of a sequence.

In [24]:
VOCAB_SIZE = 14500
SEQUENCE_LENGTH = 10
EMBED_SIZE = 128

preprocessed_text = ' '.join(['[BEG]']*SEQUENCE_LENGTH  + tokens)

vectorize_layer = TextVectorization(
    standardize=None,
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH)

vectorize_layer.adapt([preprocessed_text])

Let's check length of the preprocessed text.

In [26]:
len(preprocessed_text.split())

448918

And first ten the most common words.

In [27]:
print(vectorize_layer.get_vocabulary()[:10])

['', '[UNK]', '.', 'the', 'i', 'to', 'a', 'of', 'and', 'you']


If we want we may tokenize a simple sentence.

In [28]:
output = vectorize_layer([["[BEG] asdsa my help ."]])
output.numpy()[0]

array([3032,    1,   25,  401,    2,    0,    0,    0,    0,    0])

Building training sequence based on the whole dataset will cause RAM memory overflow so we are going to use a DataGenerator. For it, it is important to define \__init__, \__len__, \__get_item__ and on_epoch_end methods.

In [29]:
class DataGenerator(keras.utils.Sequence):
    def __init__(self, text, vectorize_layer, batch_size=32, seq_length=8, shuffle=True):
        self.tokens = text.split()
        self.batch_size = batch_size
        self.seq_length = seq_length
        self.shuffle = True
        self.length = len(self.tokens)-self.seq_length
        output = vectorize_layer.get_vocabulary()
        self.output_pos = {word:num for num, word in enumerate(output)}
        self.on_epoch_end()

    def __len__(self):
        return int(np.floor(self.length/self.batch_size))

    def __getitem__(self, index):
        ids = self.indexes[self.ind:self.ind+self.batch_size]
        self.ind += self.batch_size
        # Generate data
        X, y = self.__data_generation(ids)
        return X, y

    def on_epoch_end(self):
        self.indexes = np.arange(self.length)
        self.ind = 0
        if self.shuffle == True:
            np.random.shuffle(self.indexes)

    def __data_generation(self, ids):
        # Initialization
        X = []
        y = []
        for x in ids:
          X.append(' '.join(self.tokens[x:x+self.seq_length]))
          y.append(self.output_pos.get(self.tokens[x+self.seq_length:x+self.seq_length+1][0], 
                                       self.output_pos['[UNK]']))
        return vectorize_layer(X), tf.convert_to_tensor([y])

Finally, we create our RNN model. We use an embedding layer, two LSTM layers and two dense layers with a droput layer. Normally, in one hot encoding the output would have to be equal to the length of the vocabulary, so in our case 14500. To omit such a long and sparse output, the final dense layer may be multiplied with our initial embedding weights as a simple dot product of two vectors. In that way it is sufficient to have an output layer that in length is equal to length of an embedding vector, so in our case 128. Moreover, if pretrained embedding layer is used or dataset is sufficiently large, the embedding vector of similar words will be close to each other so even if generator makes mistake the sentence may still make sense. For example mistaking 'king' with 'man' is better than mistaking 'king' with 'guitar'.

In [30]:
class GenText(tf.keras.Model):
  def __init__(self, vocab_size, embed_size, seq_length):
    super(GenText, self).__init__(name='')
    self.embed = tf.keras.layers.Embedding(vocab_size, embed_size, input_length=seq_length)
    self.lstm1 = tf.keras.layers.LSTM(256, return_sequences=True)    
    self.lstm2 = tf.keras.layers.LSTM(128)
    self.dens1 = tf.keras.layers.Dense(512, activation='relu')
    self.drop = tf.keras.layers.Dropout(rate=0.2)
    self.dens2 = tf.keras.layers.Dense(embed_size, activation='relu')
    self.softmax = tf.keras.layers.Softmax()

  def call(self, x):
    x = self.embed(x)
    x = self.lstm1(x)
    x = self.lstm2(x)                      
    x = self.dens1(x)
    x = self.drop(x)
    x = self.dens2(x)
    embed_matrix = self.embed.weights
    x = tf.matmul(x, tf.transpose(embed_matrix[0]))
    x = self.softmax(x)
    return x

model = GenText(VOCAB_SIZE, EMBED_SIZE, SEQUENCE_LENGTH)

And we train our model for 50 epochs with ADAM optimizer.

In [31]:
model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam') 

train_generator = DataGenerator(preprocessed_text, vectorize_layer, 
                                batch_size=128, seq_length=SEQUENCE_LENGTH , shuffle=True)
model.fit(train_generator, 
          epochs=50
          )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f1c71879a10>

We save the model for further re-use.

In [32]:
!mkdir date
model.save('data/lstm_model')



INFO:tensorflow:Assets written to: data/lstm_model/assets


INFO:tensorflow:Assets written to: data/lstm_model/assets


And we build routines to make predictions. This time our routine besides temperature will have possibility to apply a top-k sampling.

In [33]:
def make_text(input_string, vectorize_layer, num_of_words, k=1, temp=1.0):
  seq_length = vectorize_layer._output_sequence_length
  output = vectorize_layer.get_vocabulary()
  if input_string:
    text = input_string.lower()
    text = re.sub(r'\s+', ' ', text)
    text = text.translate(str.maketrans("", "", my_punctuation))
    tokens = text.split()
    if len(tokens) < seq_length:
      tokens = ['[BEG]']*(seq_length-len(tokens)) + tokens
    text = ' '.join(tokens[-seq_length:])  
  else:
    text = ' '.join(['[BEG]']*seq_length)
  vec = vectorize_layer([text])
  text = vec.numpy()[0]
  for x in range(num_of_words):
    pred = np.array(model.predict(vec))
    top_k = np.array(model.predict(vec))[0].argsort()[-k:][::-1]
    top_k_prob = pred[0][top_k]
    top_k_prob = np.exp(np.log(top_k_prob + 1e-15) / temp)
    top_k_prob = top_k_prob/np.sum(top_k_prob)
    choice = np.random.choice(top_k, p=top_k_prob)
    text = np.append(text, choice)
    vec = np.array([text[-seq_length:]])
  tokens = [output[number] for number in text]
  return np.array(tokens)

In [34]:
def preprocess_output_lstm(tokens):
  tokens = tokens[np.invert(np.array(tokens)=='[BEG]')]
  tokens_out = []
  cap = True
  for tok in tokens:
    if cap:
      word = tok.capitalize()
    else:
      word = tok
    tokens_out.append(word)
    cap = False
    if tok == '.':
      cap = True
  string_out = ' '.join(tokens_out)
  string_out = string_out.replace(' .', '.')
  string_out = string_out.replace(' re ', ' are ')
  string_out = string_out.replace(' i ', ' I ')
  string_out = string_out.replace(' mr ', ' Mr. ')
  string_out = string_out.replace(' mrs ', ' Mrs. ')
  return string_out + '...'

Let's see some of predictions.

In [35]:
preprocess_output_lstm(make_text('The murderer on', vectorize_layer, 120, k=1, temp=1.0))

'The murderer on the quay. A few minutes later the man opened the door and the door swung open and the man came into his pocket and handed it to the door. The door flew open and the german drew out. He was a very dirty looking woman with a neat thick mobile thin nose eyeglasses and a foppish clothing. The man had just been murdered. I had not been mistaken. I was not going to see you and I thought you would say nothing. But I m afraid you are not a connoisseur are you. I asked. I asked eagerly. I was not in the least alone. I was in the...'

This looks much better than the n-gram generated text. Let's try a bigger k to increase the randomness of the text.

In [36]:
preprocess_output_lstm(make_text('The murderer on', vectorize_layer, 120, k=3, temp=1.0))

'The murderer on which clue and then comes out in the habit of jealousy but they are not able to be offended but always said lightly. I am afraid so. I do not know what I want for it. The man stared. She is a man who will have to be able to describe the man in the council chamber any sign of the tragedy. The two american men were famous to take a careful effect from the other side that laverguier was a lot of letters from their own lines. He took a sunday stoppered bottle from powder into the room. The car drew up and again and the tall man opened the door and...'

The text is more random but it looses coherence due to the short sequence length.

In [37]:
preprocess_output_lstm(make_text('The murderer on', vectorize_layer, 120, k=5, temp=1.0))

'The murderer on the outskirts of their dressing table. At last I went out on the right shoulder when the train opened and the next member of the railway expert he had abandoned it by their own devices closely. The doctor was in a state of excitement. I went out on the terrace in front of him and then remarked that it might have struck a faint pack of power in their life. When she had seen hear they know everything that they could. The only man was on the ground with a brief circles heralding as she saw her dead and replaced in a table with his ears he went up to the house with the same...'

We can always decrease the tempreture to make the text less random.

In [38]:
preprocess_output_lstm(make_text('The murderer on', vectorize_layer, 120, k=5, temp=0.5))

'The murderer on the outskirts of the lake near and I felt fortified for lifting a fugitive in far under a similar direction which had been a most fair and exquisitely dressed underling with an accomplice and fell edged on the knocker. I do not know who was so charming. Since I was not here to take the police on the scene of the tragedy. I was startled by the case on the lips. I was pleased to suggest that I could apologize I said. I m afraid you are right in it all right. Asked the german quietly. I think that I am not in love with that. I thought you were not going...'

In [39]:
preprocess_output_lstm(make_text('The murderer on', vectorize_layer, 120, k=5, temp=0.1))

'The murderer on the quay. He asked breathlessly. The man came up and down the room. I went on to the window and looked up the drive. I felt a few william with a good deal of internal going to lose their former type. He was standing in a murderous house and a very beautiful lady. I was prepared to secure her with rage and was surprised by our wife. The whole place was empty. I was not in love with you. I m not going to see you. I do not know what to do about it. I am not in the habit of asking you to make it. I...'

Again with lowering of the temperature the text becomes simpler.

# GPT-2 Transformers
Finally, we use a pretrained GPT-2 transormers, which are currently one of the state-of-the-art models. 
# Data preprocessing
First we join all books into a single train.txt file.

In [40]:
file_names = os.listdir('./christie')
files = []
for fil in file_names:
  with open('./christie/{}'.format(fil)) as f:
    files.append(f.read())
text = ' '.join(files).lower()
text = re.sub(r'\s+', ' ', text)
with open('./christie/train.txt', 'w') as f:
  f.write(text)

We install the necessary libraries.

In [41]:
!pip install transformers
!pip install spacy ftfy==4.4.3
!python -m spacy download en

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d8/b2/57495b5309f09fa501866e225c84532d1fd89536ea62406b2181933fb418/transformers-4.5.1-py3-none-any.whl (2.1MB)
[K     |████████████████████████████████| 2.1MB 9.9MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/ae/04/5b870f26a858552025a62f1649c20d29d2672c02ff3c3fb4c688ca46467a/tokenizers-0.10.2-cp37-cp37m-manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 28.2MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 31.4MB/s 
Installing collected packages: tokenizers, sacremoses, transformers
Successfully installed sacremoses-0.0.45 tokenizers-0.10.2 transformers-4.5.1
Collecting ftfy==4.4.3
[?25l  Downloading https://files.pyt

And we load our pretrained model with corresponding tokenizer and routines that are needed for traininig and predictions.

In [42]:
from transformers import (GPT2Tokenizer, DataCollatorForLanguageModeling, 
                          TextDataset, GPT2LMHeadModel, TrainingArguments,
                          Trainer, pipeline)

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355256.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




We define out TextDataset wrapper.

In [43]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path='./christie/train.txt',
    block_size=128)



And we configure our training routine.

In [44]:
!mkdir out

In [45]:
training_args = TrainingArguments(
    output_dir = 'out/', 
    overwrite_output_dir = True,
    per_device_train_batch_size = 32,
    learning_rate = 5e-5,
    num_train_epochs = 15,
)

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator=data_collator,
    train_dataset = train_dataset
)

Finally we train it for 15 epochs.

In [46]:
trainer.train()

Step,Training Loss
500,3.0405
1000,2.726
1500,2.5764
2000,2.4827


TrainOutput(global_step=2235, training_loss=2.6790607904694492, metrics={'train_runtime': 1571.1663, 'train_samples_per_second': 1.423, 'total_flos': 6810779840348160.0, 'epoch': 15.0, 'init_mem_cpu_alloc_delta': 1271455744, 'init_mem_gpu_alloc_delta': 511148032, 'init_mem_cpu_peaked_delta': 0, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': -227045376, 'train_mem_gpu_alloc_delta': 1521358336, 'train_mem_cpu_peaked_delta': 241668096, 'train_mem_gpu_peaked_delta': 8584833024})

To make inferences we use a pipeline.

In [47]:
generator = pipeline('generator', tokenizer='gpt2', model='out/checkpoint-500')
generator('The murderer', max_length=120)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The murderer’s own words seemed to ring so true after he left davenheim that he had no difficulty in recalling them. “no, i do not comprehend--no matter what he may have thought to carry them out--but he has made a very good name, and yet this fellow keeps trying to do us harm.” “you remember my father standing under the shade of the window a long time ago, and asking me what has happened to me lately and why?” “we live, my friends, in a village, _bien_, and'

In [48]:
generator('The murderer', max_length=120)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The murderer was a young man of about thirty-three years of age. “i know exactly everything,” said superintendent battle, “but i dare say it’s not all pretty. i wonder if he had time to hide something, or had he to go on hiding out in london to-day so his name wasn’t in the papers?” he drew out a small envelope and opened it hastily. “pardon the delay,” he said. “a good word of apology for miss marvell. but the thing i want to'

This looks really coherent and literally looks like a book fragments due to the quotations. Let's try a larger k-sampling.

In [49]:
generator('The murderer', max_length=120, k=5)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The murderer’s dagger, he told us, was held in the hand of the stranger—not always i understand.” “but the thief was holding it?” said the inspector. “if you can’t get hold of it, it must be done out of a drawer. there was a good deal of time we had before you came in to find marthe.” “i will tell you—and i’m afraid not. he did not come to his own assistance. he had just been on board the boat on the way out'

In [50]:
generator('The murderer', max_length=120)[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'The murderer must have committed an absolute hoax, because he deliberately placed the revolver between the two bottles. if he had, it might not have been stolen. but in view of the fact that you had given him the name of mr. davenheim, you could easily guess that he meant to send the money by the name of “_the_ assassin.” “do you think that can reasonably be the case?” “yes, it would.” “what shall not you do, poirot, if you will get hold of this revolver?'

The GPT-2 model gives much better results, compared to previous models that can easily be mistaken with book fragments. But it has to be stated that it also requires much more computational time and memory.