# Transformer

https://keras.io/examples/generative/text_generation_with_miniature_gpt/

They suggest at least one million words of text.

https://stackabuse.com/gpt-style-text-generation-in-python-with-tensorflowkeras/

We will work on full sentences. Let's use the bigger dataset and simply remove the longest sentences.

In [2]:
import re
from nltk.tokenize import sent_tokenize

import glob
import textwrap

import random
import numpy as np

import keras_nlp
import tensorflow as tf
from tensorflow import keras
import tensorflow.keras.utils as ku
from tensorflow.keras.utils import Sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend


In [2]:
keras_nlp.__version__

'0.6.1'

## 30 MB of Polish novels

#### **Reading files.**

In [31]:
# removing chapter names

def remove_chapter_names(input_string, regex_string):
  a1 = input_string
  a2 = re.sub(rf'{regex_string}', '', a1)
  return a2

In [32]:
sentences = []

for file in glob.glob("30 MB noweli/*"):

    #read the file
    myfile = open(file,"r")
    text = myfile.read()
    myfile.close()

    #clean chapter names
    text = remove_chapter_names(text, 'ROZDZIAŁ[^\n]+')
    text = remove_chapter_names(text, 'Rozdział[^\n]+')

    #lower
    text = text.lower()

    #split to sentences
    text = sent_tokenize(text)
    #print("file ", file, " generated ", len(text), " sentences")
    
    sentences.extend(text)
print("We have", len(sentences), "sentences.")

We have 285068 sentences.


In [5]:
continuous_corpus = " ".join(sentences)
print("Full text consists of", len(continuous_corpus.replace('\n', ' ').split(' ')), "words.")

Full text consists of 4497449 words.


In [6]:
sentences[:5]

['agada i przypowieść\n\njeśli chcesz poznać stwórcę świata, czytaj agadę.',
 'przez nią zrozumiesz istotę boga, oby był błogosławiony.',
 'dzięki niej będziesz wiedział, jak się zachować i kroczyć jego drogami.',
 'nie traktuj lekko przypowieści.',
 'z małą, groszową świeczką można czasem znaleźć monetę albo cenną perłę.']

#### **Sentences lengths analysis.**

In [7]:
lens = []
for sentence in sentences:
  lens.append(len(sentence.replace('\n', ' ').split(' ')))

print("Sentences are of length", min(lens), "to", max(lens))

Sentences are of length 1 to 353


In [8]:
#quantiles - 90% of sequences consists of at most 32 words, at most 15% is of length 5 or less
lens.sort()
print("Quantiles:\n0.15 is", lens[int(0.15*len(lens))],
 "\n0.5 is", lens[int(0.5*len(lens))], 
 "\n0.8 is", lens[int(0.8*len(lens))], 
 "\n0.9 is", lens[int(0.9*len(lens))],
 "\n0.95 is", lens[int(0.95*len(lens))])
print("Let's remove the sentences longer than 40.")

Quantiles:
0.15 is 5 
0.5 is 13 
0.8 is 24 
0.9 is 32 
0.95 is 40
Let's remove the sentences longer than 40.


In [9]:
#removing long sentences
#lowering the letters
#removing new line signs

sentences_short = []
for sentence in sentences:
  if not len(sentence.replace('\n', ' ').split(' ')) > 40:
    sentence = sentence.lower()
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('—', '-')
    sentences_short.append(sentence)

In [10]:
lens2 = []
for sentence in sentences_short:
  lens2.append(len(sentence.replace('\n', ' ').split(' ')))

print("Short sentences are of length", min(lens2), "to", max(lens2))

Short sentences are of length 1 to 40


In [11]:
random.shuffle(sentences_short)
sentences_short[:5]

['wreście, nie pozdrowiwszy ich po chrześciańsku, co ich uderzyło obu, stary zawołał.',
 'a spotka, bądź tego pewny!',
 'ale nawet nie czekając tego pociągu, byłbym mógł, ubrawszy się spiesznie, jechać jeszcze tego wieczora, gdyby rodzice mi pozwolili.',
 'był nadzwyczaj nieostrożny, a więc prawdopodobnie młody.',
 'i w pierwszej chwili uniesienia i wdzięczności padł jej do nóg.']

### **I**

#### **Tokenization. No punctuation included**

In [13]:
# Fitting the Tokenizer on the Corpus
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences_short)

# Vocabulary count of the corpus
total_words = len(tokenizer.word_index)

print("Total Unique Words:", total_words)      

Total Unique Words: 222657


In [14]:
# Converting the text into embeddings
input_sequences = []
for sentence in sentences_short:
    token_list = tokenizer.texts_to_sequences([sentence])[0]
    input_sequences.append(token_list)

#### **Padding.**

In [15]:
maxlen = max(lens2)

input_sequences = np.array(pad_sequences(input_sequences, maxlen=maxlen+1, padding='pre'))  #maxlen +1

# predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
# #label = ku.to_categorical(label1, num_classes=total_words+1)

In [16]:
# predictors.shape, label.shape      #sample of length 40 was cut by one (because it is a label)

In [17]:
#lookup dictionary
tokenizer.index_word

{1: 'i',
 2: 'się',
 3: 'w',
 4: 'nie',
 5: 'na',
 6: 'z',
 7: 'do',
 8: 'to',
 9: 'że',
 10: 'a',
 11: 'o',
 12: 'ale',
 13: 'jak',
 14: 'co',
 15: 'tak',
 16: 'za',
 17: 'po',
 18: 'jest',
 19: 'go',
 20: 'od',
 21: 'już',
 22: 'mu',
 23: 'jego',
 24: 'było',
 25: 'mnie',
 26: 'tego',
 27: 'jej',
 28: 'tylko',
 29: 'mi',
 30: 'był',
 31: '–',
 32: 'dla',
 33: 'sobie',
 34: 'jeszcze',
 35: 'tym',
 36: 'ich',
 37: 'ja',
 38: 'przez',
 39: 'bo',
 40: 'ze',
 41: 'gdy',
 42: 'który',
 43: 'może',
 44: 'ten',
 45: 'aby',
 46: 'czy',
 47: 'pan',
 48: 'tu',
 49: 'nim',
 50: 'ją',
 51: 'pod',
 52: 'rzekł',
 53: 'by',
 54: 'tej',
 55: 'nawet',
 56: 'ci',
 57: 'on',
 58: 'przed',
 59: 'była',
 60: 'które',
 61: 'tam',
 62: 'być',
 63: 'przy',
 64: 'wszystko',
 65: 'iż',
 66: 'ma',
 67: 'teraz',
 68: 'sam',
 69: 'nic',
 70: 'więc',
 71: 'miał',
 72: 'nad',
 73: 'będzie',
 74: 'kiedy',
 75: 'u',
 76: 'też',
 77: 'bez',
 78: 'bardzo',
 79: 'ani',
 80: 'jako',
 81: 'lecz',
 82: 'tych',
 83: 'niego'

#### **Tensorflow Dataset instead of Generator with categorization.**

Problem with model.fit: https://stackoverflow.com/questions/56604825/keras-invalidargumenterror-with-model-fit. Sequential type of input is incorrect for parallel processing model.

In [18]:
# class DataGenerator(Sequence):
#     def __init__(self, x_set, y_set, batch_size):
#         self.x, self.y = x_set, y_set
#         #self.x, self.y = tf.expand_dims(x_set, -1), tf.expand_dims(y_set, -1)
#         self.batch_size = batch_size

#     def __len__(self):
#         return int(np.ceil(len(self.x) / float(self.batch_size)))

#     def __getitem__(self, idx):
#         batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
#         #it would be possible here to additionally normalize the values -> batch_x = batch_x / float(total_words) 
#         batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
#         batch_y = ku.to_categorical(batch_y, num_classes=total_words+1)
#         return batch_x, batch_y

# train_gen = DataGenerator(predictors, label, 1)  #the smallest possible batch

In [19]:
# a, b = train_gen.__getitem__(0)
# a, b

In [20]:
# a.shape, b.shape

In [21]:
batch_size = 8

train_dataset = tf.data.Dataset.from_tensor_slices(input_sequences)
train_dataset = train_dataset.shuffle(buffer_size=256)
train_dataset = train_dataset.batch(batch_size)
type(train_dataset)

tensorflow.python.data.ops.dataset_ops.BatchDataset

In [23]:
maxlen = max(lens2)

def preprocessing(text):
    text = tf.expand_dims(text, -1)
    print(text.shape)
    predictors, labels = text[:, :-1], text[:, 1:]    #offset by one + label is long!
    print(predictors.shape, labels.shape)
    return predictors, labels

In [24]:
train_dataset = train_dataset.map(preprocessing)
train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)

(None, 41, 1)
(None, 40, 1) (None, 40, 1)


In [23]:
# for entry in train_dataset.take(1):
#     print(entry)

#### **Small model with TokenAndPositionEmbedding layer.**

In [29]:
embed_dim = 32  #inicially 128
num_heads = 4

def create_model():
    inputs = keras.layers.Input(shape=(maxlen, ), dtype=tf.int32, name='transf_input')
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(total_words, maxlen, embed_dim, name='transf_embed')(inputs)
    decoder = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim, 
                                                            num_heads=num_heads, 
                                                            dropout=0.5, 
                                                            name='transf_decod')(embedding_layer)
    outputs = keras.layers.Dense(total_words, activation='softmax', name='transf_dense')(decoder)
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), 
        loss='sparse_categorical_crossentropy',
        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    )
    return model

model_transf = create_model()
model_transf.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transf_input (InputLayer)   [(None, 40)]              0         
                                                                 
 transf_embed (TokenAndPosit  (None, 40, 32)           7126304   
 ionEmbedding)                                                   
                                                                 
 transf_decod (TransformerDe  (None, 40, 32)           6464      
 coder)                                                          
                                                                 
 transf_dense (Dense)        (None, 40, 222657)        7347681   
                                                                 
Total params: 14,480,449
Trainable params: 14,480,449
Non-trainable params: 0
_________________________________________________________________


**Training**

Small parameters: batch_size = 8, epochs = 5, embed_dim = 32, num_heads = 4 -> generate nans as weights.

Small parameters: batch_size = 16, epochs = 2, embed_dim = 32, num_heads = 4 -> generate ResourceExhaustedError.

Small parameters: batch_size = 8, epochs = 2, embed_dim = 64, num_heads = 4 -> generate ResourceExhaustedError.

Small parameters: batch_size = 8, epochs = 2, embed_dim = 32, num_heads = 4, Adam learning_rate=0.0003 (inicially 0.001) -> generate nans as weights.

Small parameters: batch_size = 8, epochs = 1, embed_dim = 32, num_heads = 4, Adam learning_rate=0.0001 (inicially 0.001) -> generate nans as weights.

In [None]:
history = model_transf.fit(train_dataset, epochs=1)

**Saving.**

Error with layer names: https://stackoverflow.com/questions/73187155/valueerror-unable-to-create-dataset-name-already-exists .

In [32]:
#https://stackoverflow.com/questions/72776335/valueerror-unable-to-create-dataset-name-already-exists-when-using-modelcheck

# for i in range(len(model_transf.weights)):
#     model_transf.weights[i]._handle_name = model_transf.weights[i].name + "_" + str(i)

model_transf.weights[1]._handle_name

'transf_embed/embeddings:0'

In [33]:
#model_transf.export(transformer)    #AttributeError: 'Functional' object has no attribute 'export'

The problem discovered is nans as weights. Looks like a gradient explosion in a tiny model.

https://stackoverflow.com/questions/66542007/transformer-model-output-nan-values-in-pytorch

In [34]:
model_transf.weights[0]   #nans!!

<tf.Variable 'transf_embed/embeddings:0' shape=(222657, 32) dtype=float32, numpy=
array([[        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       ...,
       [ 0.00542013,  0.00179588, -0.00123428, ...,  0.00217171,
         0.00526386,  0.00295905],
       [-0.00165877, -0.00269027, -0.00138578, ...,  0.00394636,
        -0.00650978,  0.00835048],
       [ 0.00433999,  0.00709753,  0.00513451, ..., -0.0047528 ,
        -0.00021433, -0.00423851]], dtype=float32)>

About saving problem - after changing the layer names for unique ones, and weights names for unique ones, the model can be saved without optimizer weights (crucial in case of further training).

In [50]:
#https://stackoverflow.com/questions/62169315/runtimeerror-unable-to-create-link-name-already-exists-keras

#https://stackoverflow.com/questions/67321942/tensorflow-2x-what-exactly-does-the-parameter-include-optimizer-affect-in-tenso
# Saving the optimizer parameters allows you to restart training in exactly the same state as you saved the checkpoint, 
# whereas without saving the optimizer state, even the same model parameters might result in a variety of training outcomes 
# with different optimizer parameters.

#model_transf.save("transformer.keras", include_optimizer=False)
loaded_model_transf = tf.keras.models.load_model("transformer.keras")



After changing optimizer weights names for unique ones, the model is finally saved.

In [62]:
# for i in range(len(model_transf.optimizer.weights)):
#     model_transf.optimizer.weights[i]._handle_name = model_transf.optimizer.weights[i].name + "_" + str(i)

model_transf.optimizer.weights[0]._handle_name

'Adam/iter:0_0'

In [64]:
#model_transf.save("transformer_full.keras")
loaded_model_transf = tf.keras.models.load_model("transformer_full.keras")

#### **Model with two decoders.**

In [24]:
embed_dim = 8  #inicially 128
num_heads = 2

def create_model_2():
    inputs = keras.layers.Input(shape=(maxlen, ), dtype=tf.int32, name='transf_input_b')
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(total_words, maxlen, embed_dim, name='transf_embed_b')(inputs)
    decoder1 = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim, 
                                                            num_heads=num_heads,
                                                            name='transf_decod1_b')(embedding_layer)
    decoder2 = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim, 
                                                            num_heads=num_heads,
                                                            name='transf_decod2_b')(decoder1)     
    dropout = keras.layers.Dropout(0.5, name='transf_dropout_b')(decoder2)                                                   
    outputs = keras.layers.Dense(total_words, activation='softmax', name='transf_dense_b')(dropout)
    
    model2 = keras.Model(inputs=inputs, outputs=outputs)
    
    model2.compile(
        optimizer=tf.keras.optimizers.Adam(), 
        loss='sparse_categorical_crossentropy',
        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    )
    return model2

model_transf_2 = create_model_2()
model_transf_2.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transf_input_b (InputLayer)  [(None, 40)]             0         
                                                                 
 transf_embed_b (TokenAndPos  (None, 40, 8)            1781576   
 itionEmbedding)                                                 
                                                                 
 transf_decod1_b (Transforme  (None, 40, 8)            464       
 rDecoder)                                                       
                                                                 
 transf_decod2_b (Transforme  (None, 40, 8)            464       
 rDecoder)                                                       
                                                                 
 transf_dropout_b (Dropout)  (None, 40, 8)             0         
                                                             

**Training.**

In [25]:
history2 = model_transf_2.fit(train_dataset, epochs=5)

Epoch 1/5
Epoch 2/5
 1708/33893 [>.............................] - ETA: 55:32 - loss: nan - perplexity: nan - accuracy: 0.6681

KeyboardInterrupt: 

### **II**

#### **Tokenizer. No punctuation included (restricted number of words - 100 000).**

In [12]:
# Fitting the Tokenizer on the Corpus
tokenizer_restricted = Tokenizer(num_words=100000)
tokenizer_restricted.fit_on_texts(sentences_short)

# Vocabulary count of the corpus
total_words = len(tokenizer_restricted.word_index)

print("Total Unique Words:", total_words)      

Total Unique Words: 222657


In [13]:
# Converting the text into embeddings
input_sequences = []
for sentence in sentences_short:
    token_list = tokenizer_restricted.texts_to_sequences([sentence])[0]
    input_sequences.append(token_list)

In [14]:
maxlen = max(lens2)

input_sequences = np.array(pad_sequences(input_sequences, maxlen=maxlen+1, padding='pre'))  #maxlen +1

# predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
# #label = ku.to_categorical(label1, num_classes=total_words+1)

#### **Dataset**

In [15]:
batch_size = 8

train_dataset = tf.data.Dataset.from_tensor_slices(input_sequences)
train_dataset = train_dataset.shuffle(buffer_size=256)
train_dataset = train_dataset.batch(batch_size)
type(train_dataset)

tensorflow.python.data.ops.dataset_ops.BatchDataset

In [16]:
maxlen = max(lens2)

def preprocessing(text):
    text = tf.expand_dims(text, -1)
    print(text.shape)
    predictors, labels = text[:, :-1], text[:, 1:]    #offset by one + label is long!
    return predictors, labels

In [17]:
train_dataset = train_dataset.map(preprocessing)
train_dataset = train_dataset.prefetch(tf.data.AUTOTUNE)

(None, 41, 1)


#### **Model**

In [19]:
embed_dim = 32  #inicially 128
num_heads = 4

def create_model():
    inputs = keras.layers.Input(shape=(maxlen, ), dtype=tf.int32, name='transf_input')
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(total_words, maxlen, embed_dim, name='transf_embed')(inputs)
    decoder = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim, 
                                                            num_heads=num_heads, 
                                                            dropout=0.5, 
                                                            name='transf_decod')(embedding_layer)
    outputs = keras.layers.Dense(total_words, activation='softmax', name='transf_dense')(decoder)
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), 
        loss='sparse_categorical_crossentropy',
        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    )
    return model

model_transf = create_model()
model_transf.summary()

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transf_input (InputLayer)   [(None, 40)]              0         
                                                                 
 transf_embed (TokenAndPosit  (None, 40, 32)           7126304   
 ionEmbedding)                                                   
                                                                 
 transf_decod (TransformerDe  (None, 40, 32)           6464      
 coder)                                                          
                                                                 
 transf_dense (Dense)        (None, 40, 222657)        7347681   
                                                                 
Total params: 14,480,449
Trainable params: 14,480,449
Non-trainable params: 0
_________________________________________________________________


#### **Training**

In [21]:
history = model_transf.fit(train_dataset, epochs=3)

Epoch 1/3


Epoch 2/3
Epoch 3/3


In [22]:
model_transf.weights

[<tf.Variable 'transf_embed/embeddings:0' shape=(222657, 32) dtype=float32, numpy=
 array([[-0.16373327, -0.07548821,  0.01794494, ..., -0.10588541,
         -0.04432936, -0.05204703],
        [-0.01211155, -0.05791748, -0.0364533 , ..., -0.0373821 ,
          0.22052361, -0.03295827],
        [ 0.05500536, -0.09277041, -0.1420212 , ...,  0.02098169,
         -0.05133377,  0.03100514],
        ...,
        [-0.00215005, -0.00211307, -0.00230877, ..., -0.00181497,
         -0.00453821,  0.00380428],
        [ 0.0042766 ,  0.00036232, -0.0020407 , ...,  0.00509356,
         -0.0043844 ,  0.00034125],
        [ 0.00152116, -0.00226542,  0.00349827, ...,  0.00377238,
         -0.00101423,  0.00429469]], dtype=float32)>,
 <tf.Variable 'transf_embed/embeddings:0' shape=(40, 32) dtype=float32, numpy=
 array([[-0.03179058,  0.00981777, -0.00567404, ..., -0.23976088,
          0.03541889,  0.06951675],
        [-0.01757253, -0.08308528, -0.00318504, ..., -0.1154981 ,
          0.00413809, -0.00

**Saving**

In [24]:
for i in range(len(model_transf.weights)):
    model_transf.weights[i]._handle_name = model_transf.weights[i].name + "_" + str(i)

In [26]:
for i in range(len(model_transf.optimizer.weights)):
    model_transf.optimizer.weights[i]._handle_name = model_transf.optimizer.weights[i].name + "_" + str(i)

In [18]:
#model_transf.save("transformer_restricted.keras")
loaded_model_transf_restricted = tf.keras.models.load_model("transformer_restricted.keras")

**Testing**

In [None]:
def sample_token(logits):
    #print("logits shape: ", logits.shape)
    logits, indices = tf.math.top_k(logits, k=15, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    return np.random.choice(indices, p=preds)

The most frequent words obtained.

In [77]:
seed_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

next_words = 40 #maximum 40

sample_index = 0

while next_words-1 > sample_index:

    #embeddings
    token_list = tokenizer_restricted.texts_to_sequences([seed_text])[0]

    #padding
    maxlen = max(lens2)
    test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen+1, padding='pre'))

    #test sample
    test_sequence = test_sequence[:, :-1]

    #predictions
    soft_pred = loaded_model_transf_restricted.predict(test_sequence, verbose=0)
    #print("Softmax predictions shape:", soft_pred.shape)

    sample_index = len(seed_text.strip().split())-1
    #print("sample_index", sample_index)
    sampled_token = sample_token(soft_pred[0][sample_index])
    #print(sampled_token)

    output_word = ""
    #decoding tokens
    for word, index in tokenizer_restricted.word_index.items():
        if index == sampled_token:
            output_word = word
            break
    #sampled_token = index_lookup[sampled_token]
    seed_text += " " + output_word

print('\n'.join(textwrap.wrap(seed_text, 80)))

za górami za lasami gdy to  gdy gdy a a po po ale kiedy  kiedy to kiedy gdy tak
z   czy co niego się on gdy to w z na jak to ale do się siebie już że od i nemo


#### **More training**

In [97]:
loaded_model_transf_restricted = tf.keras.models.load_model("transformer_restricted.keras")

In [98]:
history2 = loaded_model_transf_restricted.fit(train_dataset, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


**Saving**

In [100]:
for i in range(len(loaded_model_transf_restricted.weights)):
    loaded_model_transf_restricted.weights[i]._handle_name = loaded_model_transf_restricted.weights[i].name + "_" + str(i)

In [101]:
for i in range(len(loaded_model_transf_restricted.optimizer.weights)):
    loaded_model_transf_restricted.optimizer.weights[i]._handle_name = loaded_model_transf_restricted.optimizer.weights[i].name + "_" + str(i)

In [102]:
#loaded_model_transf_restricted.save("transformer_restricted2.keras")
loaded_model_transf_restricted2 = tf.keras.models.load_model("transformer_restricted2.keras")

**Testing**

More words used, but lots is rejected by a dictionary (blank spaces, because the words are recognized as OOV). It can be seen because for each sentence loop, the output is not of full sentence length.

In [None]:
def sample_token(logits):
    #print("logits shape: ", logits.shape)
    logits, indices = tf.math.top_k(logits, k=15, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    return np.random.choice(indices, p=preds)

In [125]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen-1 > sample_index:

        #embeddings
        token_list = tokenizer_restricted.texts_to_sequences([seed_text])[0]

        #padding
        maxlen = max(lens2)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_restricted2.predict(test_sequence, verbose=0)
        #print("Softmax predictions shape:", soft_pred.shape)

        sample_index = len(seed_text.strip().split())-1
        #print("sample_index", sample_index)
        sampled_token = sample_token(soft_pred[0][sample_index])
        #print(sampled_token)

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_restricted.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        #sampled_token = index_lookup[sampled_token]
        seed_text += " " + output_word

    #print(seed_text)
    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])
    #print(full_text)
    #print(seed_text)


print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami i to a w na po to tak i gdy z gdy po ja ja lecz ale nimi
siebie o potem tym za po tego prostu ci co że tak pomocą dnia że ja i ma mnie
był nie nie to lecz w a i a na nie z na i nie i to zaś do jak mógł co gdy za
mógł świecie będzie znaczy tego go mi to z człowieka nie tych chcę tak kiedy
kiedy i co na to na z w to – i na to z – ale co w zaś się wszystko nim był z był
co do z w tu to głowy tym wszystko kraju w nie gdy co to ale z po a gdy i kiedy
nie – na lecz na ale co tej król się tej za się mógł niego nią to zaś samej to
ale że przyczyny tak woli a lecz kiedy i – – tak tak na z tak tak a nie z tak –
na i on nie ale mi koniec nie teraz sobą i sam ma że był że się ja już tego


#### **Even more training.**

In [18]:
loaded_model_transf_restricted2 = tf.keras.models.load_model("transformer_restricted2.keras")

In [19]:
history3 = loaded_model_transf_restricted2.fit(train_dataset, epochs=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


**Saving**

In [20]:
for i in range(len(loaded_model_transf_restricted2.weights)):
    loaded_model_transf_restricted2.weights[i]._handle_name = loaded_model_transf_restricted2.weights[i].name + "_" + str(i)

In [21]:
for i in range(len(loaded_model_transf_restricted2.optimizer.weights)):
    loaded_model_transf_restricted2.optimizer.weights[i]._handle_name = loaded_model_transf_restricted2.optimizer.weights[i].name + "_" + str(i)

In [22]:
#loaded_model_transf_restricted2.save("transformer_restricted3.keras")
loaded_model_transf_restricted3 = tf.keras.models.load_model("transformer_restricted3.keras")

**Testing**

In [25]:
def sample_token(logits):
    #print("logits shape: ", logits.shape)
    logits, indices = tf.math.top_k(logits, k=15, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    return np.random.choice(indices, p=preds)

In [26]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen-1 > sample_index:

        #embeddings
        token_list = tokenizer_restricted.texts_to_sequences([seed_text])[0]

        #padding
        maxlen = max(lens2)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_restricted3.predict(test_sequence, verbose=0)
        #print("Softmax predictions shape:", soft_pred.shape)

        sample_index = len(seed_text.strip().split())-1
        #print("sample_index", sample_index)
        sampled_token = sample_token(soft_pred[0][sample_index])
        #print(sampled_token)

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_restricted.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        #sampled_token = index_lookup[sampled_token]
        seed_text += " " + output_word

    #print(seed_text)
    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])
    #print(full_text)
    #print(seed_text)


print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami gdy a gdy ale nie ale a co tak ale lecz gdy w tak na ale ja
chwilę to nie to ma na długo tak ręku nich się za na nich to za każdym i mną
słowa jeśli a kiedy kiedy w a tak z tak gdy z i ale nie ale był – miejscu i pan
go domu tak tak początku z i rzekł tego to w jest na co nich świecie może gdy i
– nie gdy a ale ale gdy po – – a a nie ja co chwila z już wyrzekł mu co już za
nie było za sobą ale się nic na na za niego sobą nie to z w z co gdy co i a lecz
na w gdy w po co koniec na mogę tego jego pan to tak życiu nich by czele się
andrzej że i tych nie tak można to po w ale gdy i na to to z nie tak co po ja
ale gdy kilku jak na ten to to rzekłszy wiedział to nie się to człowiek na nie
ma by koniec nie domu


#### **More more training.**

In [18]:
#this model is already trained on 10 epochs
loaded_model_transf_restricted3 = tf.keras.models.load_model("transformer_restricted3.keras")

In [19]:
history4 = loaded_model_transf_restricted3.fit(train_dataset, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


**Saving**

In [20]:
for i in range(len(loaded_model_transf_restricted3.weights)):
    loaded_model_transf_restricted3.weights[i]._handle_name = loaded_model_transf_restricted3.weights[i].name + "_" + str(i)

In [21]:
for i in range(len(loaded_model_transf_restricted3.optimizer.weights)):
    loaded_model_transf_restricted3.optimizer.weights[i]._handle_name = loaded_model_transf_restricted3.optimizer.weights[i].name + "_" + str(i)

In [22]:
#loaded_model_transf_restricted3.save("transformer_restricted4.keras")
loaded_model_transf_restricted4 = tf.keras.models.load_model("transformer_restricted4.keras")

**Testing**

In [29]:
def sample_token(logits):
    #print("logits shape: ", logits.shape)
    logits, indices = tf.math.top_k(logits, k=15, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    return np.random.choice(indices, p=preds)

In [30]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen-1 > sample_index:

        #embeddings
        token_list = tokenizer_restricted.texts_to_sequences([seed_text])[0]

        #padding
        maxlen = max(lens2)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_restricted4.predict(test_sequence, verbose=0)
        #print("Softmax predictions shape:", soft_pred.shape)

        sample_index = len(seed_text.strip().split())-1
        #print("sample_index", sample_index)
        sampled_token = sample_token(soft_pred[0][sample_index])
        #print(sampled_token)

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_restricted.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        #sampled_token = index_lookup[sampled_token]
        seed_text += " " + output_word

    #print(seed_text)
    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])
    #print(full_text)
    #print(seed_text)


print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami po ja gdy a tak z a ale gdy a na co – a i ja tak późno nią
prostu to było z go dworze się po było to to wolna go tym już do z warszawy w
nie po gdy po z i gdy gdy a a ale tak to – a gdy była tej tem tym tym ja się nie
mi zwrócił się mierze sposobem nie wiem mu w był na pewny to a kiedy w kiedy tak
po – – gdy i co gdy tak i nie tak końcu świecie na mnie zaś czym co w w to długo
o tak był niej było niej na tylko niego gdyby na nie nie a i ja w na w co po i
po – ale nie w po był mogąc może też tem ty ja co może tym ich się się to nie
jak być mi nie w – i ja nie kiedy na w nie tak tak i to nie po po miał z mógł
odparł mam zaś istocie nie się wiem to nimi julian jak będzie to na nie za
wiedział


### **III**

Modification of **II**: just one word is a label

*InvalidArgumentError: Graph execution error:*

*Input to reshape is a tensor with 320 values, but the requested shape has 8
	 [[{{node ArithmeticOptimizer/ReorderCastLikeAndValuePreserving_bool_Reshape}}]] [Op:__inference_train_function_2524]*

## 10 MB of Polish fairytales and stories

#### **Reading files.**

In [46]:
# removing chapter names
def remove_chapter_names(input_string, regex_string):
    a1 = input_string
    a2 = re.sub(rf'{regex_string}', '', a1)
    return a2

# removing footnotes enclosed in square brackets
def remove_footnotes(input_string):
    a1 = input_string
    a2 = re.sub(r'\[[\d]*\]', '', a1)
    a2 = re.sub(r' \[[^]]*\]', '', a2)
    return a2

# dividing the punctuation with space from the words and removing the double spaces
# def divide_punctuation(input_string, punctuation_to_tokenize):
#     a1 = input_string
#     a2 = re.sub(r'(['+punctuation_to_tokenize+'])', r' \1 ', a1)
#     a2 = re.sub(r'  ', r' ', a2)
#     return a2

Cleaning chapter names, footnotes, no punctuation needed.

In [96]:
sentences_fairy = []

for file in glob.glob("Korpusy do bajek/*"):

    try:
        #read the file
        myfile = open(file,"r")
        text = myfile.read()
        myfile.close()

        #clean chapter names
        text = remove_chapter_names(text, 'ROZDZIAŁ[^\n]+')
        text = remove_chapter_names(text, 'Rozdział[^\n]+')

        #clean footnotes in square brackets
        text = remove_footnotes(text)

        #lower
        text = text.lower()

        #split to sentences
        text = sent_tokenize(text)
        #print("file ", file, " generated ", len(text), " words")
        
        sentences_fairy.extend(text)
    except:
       continue
    
print("We have", len(sentences_fairy), "sentences.")

We have 98829 sentences.


In [48]:
continuous_corpus_fairy = " ".join(sentences_fairy)
print("Full text consists of", len(continuous_corpus_fairy.replace('\n', ' ').split(' ')), "words.")

Full text consists of 1302994 words.


In [49]:
sentences_fairy[:5]

['- tuf – sapnął pociąg, oznajmiając wszem i wobec wszystkim spóźnialskim, że nadeszła ostatnia chwila aby wskoczyć do swojego przedziału i odjechać w siną dal\n\nlokomotywa ospale ruszyła, pociągając za sobą powoli doczepione wagony\n\nsiedmioletnia mania i dziesięcioletni jurek jechali na swoje pierwsze wakacje bez mamy i taty.',
 'oczywiście do babci jadzi eskortował ich dziadek tadek, ale fajny dziadek to nie to samo co strofująca swoje dzieci co chwilę mama\n\n- tato tylko uważaj na nie!',
 '– krzyknęła mama, żegnająca całą trójkę z perony\n\nmania z jurkiem, wychyleni przez otwarte okno machali mamie, dopóki jej czerwona bluzka nie znikła im w oddali\n\nz początku rodzeństwo zachwycone nowym środkiem transportu siedziało nawet spokojnie, dziadek tadek zamknął okno w przedziale, obciągnął blezer na swoim wydatnym brzuchu, wsadził na nos okulary i oddał się swojej ulubionej lekturze, działu sportowego.',
 'pociąg stukał, pukał, pochylał się na nierównościach, wagon pachniał dziwnie

#### **Sentences lengths analysis.**

In [50]:
lens_fairy = []
for sentence in sentences_fairy:
  lens_fairy.append(len(sentence.replace('\n', ' ').split(' ')))

print("Sentences are of length", min(lens_fairy), "to", max(lens_fairy))

Sentences are of length 1 to 189


In [51]:
#quantiles - 90% of sequences consists of at most 26 words, at most 15% is of length 4 or less
lens_fairy.sort()
print("Quantiles:\n0.15 is", lens_fairy[int(0.15*len(lens_fairy))],
 "\n0.5 is", lens_fairy[int(0.5*len(lens_fairy))], 
 "\n0.8 is", lens_fairy[int(0.8*len(lens_fairy))], 
 "\n0.9 is", lens_fairy[int(0.9*len(lens_fairy))],
 "\n0.95 is", lens_fairy[int(0.95*len(lens_fairy))])
print("Let's remove the sentences longer than 32.")

Quantiles:
0.15 is 4 
0.5 is 11 
0.8 is 20 
0.9 is 26 
0.95 is 32
Let's remove the sentences longer than 32.


In [52]:
#removing long sentences
#lowering the letters
#removing new line signs

sentences_short_fairy = []
for sentence in sentences_fairy:
  if not len(sentence.replace('\n', ' ').split(' ')) > 32:
    sentence = sentence.lower()
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('—', '-')
    sentences_short_fairy.append(sentence)

In [53]:
lens2_fairy = []
for sentence in sentences_short_fairy:
  lens2_fairy.append(len(sentence.replace('\n', ' ').split(' ')))

print("Short sentences are of length", min(lens2_fairy), "to", max(lens2_fairy))

Short sentences are of length 1 to 32


In [54]:
sentences_short_fairy[:5]

['oczywiście do babci jadzi eskortował ich dziadek tadek, ale fajny dziadek to nie to samo co strofująca swoje dzieci co chwilę mama  - tato tylko uważaj na nie!',
 'pociąg stukał, pukał, pochylał się na nierównościach, wagon pachniał dziwnie, a jurka pochłonęło rozpracowywanie konstrukcji podłokietników.',
 'mania przykleiła nos do szyby, za oknem mignęły jej ostatnie budynki i już po chwili wagony postukując wesoło mknęły wśród pól i łąk.',
 'później pociąg wjechał do lasu i zwolnił.',
 'teraz to i jurek przykleił się do szyby  jedynie pucia, czarna kudłata suczka rasy spaniel, towarzysząca rodzeństwu, nie wykazywał zainteresowania krajobrazem ani czymkolwiek innym.']

In [55]:
len(sentences_short_fairy)

93937

### **I**

#### **Tokenization. No punctuation included**

In [56]:
# Fitting the Tokenizer on the Corpus
tokenizer_fairy = Tokenizer(num_words=60000)
tokenizer_fairy.fit_on_texts(sentences_short_fairy)

# Vocabulary count of the corpus
total_words_fairy = len(tokenizer_fairy.word_index)

print("Total Unique Words:", total_words_fairy)      

Total Unique Words: 91617


In [57]:
# Converting the text into embeddings
input_sequences_fairy = []
for sentence in sentences_short_fairy:
    token_list = tokenizer_fairy.texts_to_sequences([sentence])[0]
    input_sequences_fairy.append(token_list)

#### **Padding.**

In [58]:
maxlen_fairy = max(lens2_fairy)

input_sequences_fairy = np.array(pad_sequences(input_sequences_fairy, maxlen=maxlen_fairy+1, padding='pre'))  #maxlen +1

# predictors, label = input_sequences[:, :-1], input_sequences[:, -1]
# #label = ku.to_categorical(label1, num_classes=total_words+1)

maxlen_fairy

32

#### **Tensorflow Dataset**

In [60]:
batch_size = 8

train_dataset_fairy = tf.data.Dataset.from_tensor_slices(input_sequences_fairy)
train_dataset_fairy = train_dataset_fairy.shuffle(buffer_size=256)
train_dataset_fairy = train_dataset_fairy.batch(batch_size)
type(train_dataset_fairy)

tensorflow.python.data.ops.dataset_ops.BatchDataset

In [61]:
def preprocessing(text):
    text = tf.expand_dims(text, -1)
    print(text.shape)
    predictors, labels = text[:, :-1], text[:, 1:]    #offset by one + label is long!
    print(predictors.shape, labels.shape)
    return predictors, labels

In [62]:
train_dataset_fairy = train_dataset_fairy.map(preprocessing)
train_dataset_fairy = train_dataset_fairy.prefetch(tf.data.AUTOTUNE)

(None, 33, 1)
(None, 32, 1) (None, 32, 1)


#### **The last successful model (previous dataset, attempt II).**

In [63]:
embed_dim = 32  #inicially 128
num_heads = 4

def create_model():
    inputs = keras.layers.Input(shape=(maxlen_fairy, ), dtype=tf.int32, name='transf_input')
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(total_words_fairy, maxlen_fairy, embed_dim, name='transf_embed')(inputs)
    decoder = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim, 
                                                            num_heads=num_heads, 
                                                            dropout=0.5, 
                                                            name='transf_decod')(embedding_layer)
    outputs = keras.layers.Dense(total_words_fairy, activation='softmax', name='transf_dense')(decoder)
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), 
        loss='sparse_categorical_crossentropy',
        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    )
    return model

model_transf_fairy = create_model()
model_transf_fairy.summary()

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transf_input (InputLayer)   [(None, 32)]              0         
                                                                 
 transf_embed (TokenAndPosit  (None, 32, 32)           2932768   
 ionEmbedding)                                                   
                                                                 
 transf_decod (TransformerDe  (None, 32, 32)           6464      
 coder)                                                          
                                                                 
 transf_dense (Dense)        (None, 32, 91617)         3023361   
                                                                 
Total params: 5,962,593
Trainable params: 5,962,593
Non-trainable params: 0
_________________________________________________________________


#### **Quick training**

Small parameters: batch_size = 8, epochs = 5, embed_dim = 32, num_heads = 4, Adam optimizer lr = 0.0001 -> generate nans as weights.

Small parameters: batch_size = 4, epochs = 5, embed_dim = 32, num_heads = 4, Adam optimizer lr = 0.0001 -> generate nans as weights.

Small parameters: batch_size = 4, epochs = 5, embed_dim = 32, num_heads = 1, Adam optimizer lr = 0.0001 -> generate nans as weights.

In [64]:
history_fairy = model_transf_fairy.fit(train_dataset_fairy, epochs=5)

Epoch 1/5


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


**Saving**

In [66]:
for i in range(len(model_transf_fairy.weights)):
    model_transf_fairy.weights[i]._handle_name = model_transf_fairy.weights[i].name + "_" + str(i)

In [67]:
for i in range(len(model_transf_fairy.optimizer.weights)):
    model_transf_fairy.optimizer.weights[i]._handle_name = model_transf_fairy.optimizer.weights[i].name + "_" + str(i)

In [68]:
#model_transf_fairy.save("transformer_fairy.keras")
loaded_model_transf_fairy = tf.keras.models.load_model("transformer_fairy.keras")

**Testing**

In [69]:
def sample_token(logits):
    #print("logits shape: ", logits.shape)
    logits, indices = tf.math.top_k(logits, k=15, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    return np.random.choice(indices, p=preds)

In [70]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy = max(lens2_fairy)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy.predict(test_sequence, verbose=0)
        #print("Softmax predictions shape:", soft_pred.shape)

        sample_index = len(seed_text.strip().split())-1
        #print("sample_index", sample_index)
        sampled_token = sample_token(soft_pred[0][sample_index])
        #print(sampled_token)

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        #sampled_token = index_lookup[sampled_token]
        seed_text += " " + output_word

    #print(seed_text)
    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])
    #print(full_text)
    #print(seed_text)


print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami a kapitan z co nie z to w na o co kapitan ale nami sobą
nawet tej chciał co ten na to w kilku chwila to nocy się kapitan o gdy po czy
ale ale czy w tak kapitan ale z o tylko w to chwili nie może jak to jego jego po
nie było nas jestem kapitan po co z gdy na nie co co w nie ale ale tak już o
drodze powodu chwilę się ciągu kapitan to co ten pod nie nam wodą statku po o po
ale w nie co w a czy w kapitan o miał która kilku prostu tej tu już górę tej do
że jest i góry w kapitanie a a gdy to to kapitan nie ale a tak tak na na do nocy
co go jeszcze widziałem on o nim bo już w nie ja końcu są


#### **More training**

In [71]:
loaded_model_transf_fairy = tf.keras.models.load_model("transformer_fairy.keras")

In [72]:
history2_fairy = loaded_model_transf_fairy.fit(train_dataset_fairy, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**Saving**

In [73]:
for i in range(len(loaded_model_transf_fairy.weights)):
    loaded_model_transf_fairy.weights[i]._handle_name = loaded_model_transf_fairy.weights[i].name + "_" + str(i)

In [74]:
for i in range(len(loaded_model_transf_fairy.optimizer.weights)):
    loaded_model_transf_fairy.optimizer.weights[i]._handle_name = loaded_model_transf_fairy.optimizer.weights[i].name + "_" + str(i)

In [75]:
#loaded_model_transf_fairy.save("transformer_fairy2.keras")
loaded_model_transf_fairy2 = tf.keras.models.load_model("transformer_fairy2.keras")

**Testing**

In [76]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy = max(lens2_fairy)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy2.predict(test_sequence, verbose=0)
        #print("Softmax predictions shape:", soft_pred.shape)

        sample_index = len(seed_text.strip().split())-1
        #print("sample_index", sample_index)
        sampled_token = sample_token(soft_pred[0][sample_index])
        #print(sampled_token)

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        #sampled_token = index_lookup[sampled_token]
        seed_text += " " + output_word

    #print(seed_text)
    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])
    #print(full_text)
    #print(seed_text)


print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami zresztą po z ale w to kapitan tak a z co a czy każdym chwilę
nie wody odległości nie to w jest bo przez miał tym nigdy słuszność ich zresztą
zresztą kapitan co gdy nie co tak na ale nautilus nautilus kapitan że panie jak
farragut w tylko powierzchnię tak nemo profesorze kapitan tylko się jak przez go
po ale kapitan w nie co w nautilus z tak z ale conseil w kilka upływie żeby ma
każdym tego nas który kroków by rokiem do przy na tym kapitan i i na kapitan a
co i tak nautilus nautilus to tak tych jego kazał nautilus niż w łatwo był
dobrze statek zdawał której od odpowiedział był mowgli kapitan tak ale w
nautilus na na tak nautilus zresztą gdy a nie nas tak i po jest powierzchni w
ned był długo pewnym morza landa tak się bardzo


#### **Even more training**

In [77]:
loaded_model_transf_fairy2 = tf.keras.models.load_model("transformer_fairy2.keras")

In [78]:
history3_fairy = loaded_model_transf_fairy2.fit(train_dataset_fairy, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**Saving**

In [79]:
for i in range(len(loaded_model_transf_fairy2.weights)):
    loaded_model_transf_fairy2.weights[i]._handle_name = loaded_model_transf_fairy2.weights[i].name + "_" + str(i)

In [80]:
for i in range(len(loaded_model_transf_fairy2.optimizer.weights)):
    loaded_model_transf_fairy2.optimizer.weights[i]._handle_name = loaded_model_transf_fairy2.optimizer.weights[i].name + "_" + str(i)

In [81]:
#loaded_model_transf_fairy2.save("transformer_fairy3.keras")
loaded_model_transf_fairy3 = tf.keras.models.load_model("transformer_fairy3.keras")

**Testing**

In [83]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy = max(lens2_fairy)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy3.predict(test_sequence, verbose=0)
        #print("Softmax predictions shape:", soft_pred.shape)

        sample_index = len(seed_text.strip().split())-1
        #print("sample_index", sample_index)
        sampled_token = sample_token(soft_pred[0][sample_index])
        #print(sampled_token)

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        #sampled_token = index_lookup[sampled_token]
        seed_text += " " + output_word

    #print(seed_text)
    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])
    #print(full_text)
    #print(seed_text)


print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami po nie tak na kapitan tak gdy z nie a ale nie na nami to tym
że nie jednak widziałem co jego tylko tak kapitan pan nie moi mogłem na nautilus
w na kapitan po kapitan zresztą kapitan i nautilus co a pan towarzysze pokład
tej nemo nie by zbliżył gdy moi rzeczy ma na także słuszność mówił nautilus gdy
to i z i nautilus po czy nie tak ale nie to ani zdawał już jego na mi że miał
tego po to są z jest powodu w ale nie zresztą tak zresztą to zresztą w nautilus
na po nautilus za jak wodzie się że w tych powierzchni w w już której których
tej było samej po to i po w zresztą ale kapitan to kapitan to z na się dla
trzech ja ciągu pan na że powierzchnię których do się są te od dzieci


#### **More more training**

In [84]:
loaded_model_transf_fairy3 = tf.keras.models.load_model("transformer_fairy3.keras")

In [85]:
history4_fairy = loaded_model_transf_fairy3.fit(train_dataset_fairy, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


**Saving**

In [86]:
for i in range(len(loaded_model_transf_fairy3.weights)):
    loaded_model_transf_fairy3.weights[i]._handle_name = loaded_model_transf_fairy3.weights[i].name + "_" + str(i)

for i in range(len(loaded_model_transf_fairy3.optimizer.weights)):
    loaded_model_transf_fairy3.optimizer.weights[i]._handle_name = loaded_model_transf_fairy3.optimizer.weights[i].name + "_" + str(i)

In [87]:
#loaded_model_transf_fairy3.save("transformer_fairy4.keras")
loaded_model_transf_fairy4 = tf.keras.models.load_model("transformer_fairy4.keras")

**Testing**

In [88]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy = max(lens2_fairy)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy4.predict(test_sequence, verbose=0)
        #print("Softmax predictions shape:", soft_pred.shape)

        sample_index = len(seed_text.strip().split())-1
        #print("sample_index", sample_index)
        sampled_token = sample_token(soft_pred[0][sample_index])
        #print(sampled_token)

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        #sampled_token = index_lookup[sampled_token]
        seed_text += " " + output_word

    #print(seed_text)
    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])
    #print(full_text)
    #print(seed_text)


print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami po w i po tak na kapitan i kapitan i nautilus czy nie mną
chwilę chwili pod bardzo kapitan wziął z się potem morze czy sobą zaczął to
wołać gdy kapitan zresztą w z po to tak tak a po ale zresztą a pan mnie w
których co że tym na chcesz każdym by względem go ich więcej kapitan zresztą gdy
z zresztą gdy nautilus po to gdy nie nie nautilus jeżeli było wziął w jest stał
jeszcze ma na tak takim przy słuszność mi ich głowę kapitan w w z i tak w
nautilus zresztą na na kapitan kapitan i i nemo każdym z tym panie powierzchnię
kapitana w razie miejscu morza tych jego zwierząt gdy nie na a i i nautilus ale
gdy i nautilus z co po a nagle tym się był w stanął w może którzy bardzo nagle
na nie statek


#### **Check out more training**

In [89]:
loaded_model_transf_fairy4 = tf.keras.models.load_model("transformer_fairy4.keras")

In [90]:
history5_fairy = loaded_model_transf_fairy4.fit(train_dataset_fairy, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


**Saving**

In [91]:
for i in range(len(loaded_model_transf_fairy4.weights)):
    loaded_model_transf_fairy4.weights[i]._handle_name = loaded_model_transf_fairy4.weights[i].name + "_" + str(i)

for i in range(len(loaded_model_transf_fairy4.optimizer.weights)):
    loaded_model_transf_fairy4.optimizer.weights[i]._handle_name = loaded_model_transf_fairy4.optimizer.weights[i].name + "_" + str(i)

In [92]:
loaded_model_transf_fairy4.save("transformer_fairy5.keras")
loaded_model_transf_fairy5 = tf.keras.models.load_model("transformer_fairy5.keras")

**Testing**

In [93]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy = max(lens2_fairy)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy5.predict(test_sequence, verbose=0)
        #print("Softmax predictions shape:", soft_pred.shape)

        sample_index = len(seed_text.strip().split())-1
        #print("sample_index", sample_index)
        sampled_token = sample_token(soft_pred[0][sample_index])
        #print(sampled_token)

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        #sampled_token = index_lookup[sampled_token]
        seed_text += " " + output_word

    #print(seed_text)
    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])
    #print(full_text)
    #print(seed_text)


print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami na zresztą nautilus w i zresztą i z w i z z w mało nimi jego
zbliżył że ówdzie ten tej każdym to mówił luźne ziemi na te platformę po na na
po nie tak się tak zresztą tak tak kapitan tak nie same drugiej godzinę jest jak
co jak na wspomnienia potem mi mi ich je wyrazu gdy ale i z się po nie nie tak
na tak w i ją pod tylko w że można się łatwo nie wieczór tym było było zapadł na
ojciec i na nie to się i nautilus tak po nie zresztą nie i na to nie miał że
zbliżył południu od że że co tylko nas z w powodu lecz ale kapitan to i a gdy
tak ale ale zresztą to to i każdym nautilus tych na w ned co się samym którzy
ciągu do o chwili sto


## 5 MB of simple Polish fairytale set

Simple and small Polish dataset with restricted difficult vocabulary, based on tales suitable for children.

#### **Reading files.**

In [98]:
sentences_fairy_simple = []

for file in glob.glob("Prosty korpus bajkowy/*"):

    try:
        #read the file
        myfile = open(file,"r")
        text = myfile.read()
        myfile.close()

        #lower
        text = text.lower()

        #split to sentences
        text = sent_tokenize(text)
        #print("file ", file, " generated ", len(text), " words")
        
        sentences_fairy_simple.extend(text)
    except:
       continue
    
print("We have", len(sentences_fairy_simple), "sentences.")

We have 62031 sentences.


In [99]:
continuous_corpus_fairy_simple = " ".join(sentences_fairy_simple)
print("Full text consists of", len(continuous_corpus_fairy_simple.replace('\n', ' ').split(' ')), "words.")

Full text consists of 866051 words.


In [100]:
sentences_fairy_simple[:5]

['nazywam się sindbad.',
 'mieszkam stale w bagdadzie.',
 'rodzice moi, umierając, zostawili mi w spadku tysiąc worów złota, tysiąc beczek srebra, sto pałaców, sto ogrodów i jeden trzonowy ząb mego pradziadka, który ojciec mój przechowywał w hebanowej szkatułce, jako pamiątkę i osobliwość.',
 'pradziadek mój przez całe życie chorował na ból zębów i co pewien czas inny ząb musiał wyrywać, tak że w końcu jeden mu tylko ząb trzonowy pozostał.',
 'umierając, kazał sobie wyrwać i ten ostatni ząb trzonowy, który przeszedł w spadku od mego dziada do mego ojca, a od ojca — do mnie.']

#### **Sentences lengths analysis.**

In [101]:
lens_fairy_simple = []
for sentence in sentences_fairy_simple:
  lens_fairy_simple.append(len(sentence.replace('\n', ' ').split(' ')))

print("Sentences are of length", min(lens_fairy_simple), "to", max(lens_fairy_simple))

Sentences are of length 1 to 163


In [103]:
#quantiles - 90% of sequences consists of at most 26 words, at most 15% is of length 4 or less
lens_fairy_simple.sort()
print("Quantiles:\n0.15 is", lens_fairy_simple[int(0.15*len(lens_fairy_simple))],
 "\n0.5 is", lens_fairy_simple[int(0.5*len(lens_fairy_simple))], 
 "\n0.8 is", lens_fairy_simple[int(0.8*len(lens_fairy_simple))], 
 "\n0.9 is", lens_fairy_simple[int(0.9*len(lens_fairy_simple))],
 "\n0.95 is", lens_fairy_simple[int(0.95*len(lens_fairy_simple))])
print("Let's remove the sentences longer than 35.")

Quantiles:
0.15 is 4 
0.5 is 11 
0.8 is 21 
0.9 is 28 
0.95 is 35
Let's remove the sentences longer than 35.


In [114]:
#removing long sentences
#lowering the letters
#removing new line signs

sentences_short_fairy_simple = []
for sentence in sentences_fairy_simple:
  if not len(sentence.replace('\n', ' ').split(' ')) > 35:
    sentence = sentence.lower()
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('—', '-')
    sentences_short_fairy_simple.append(sentence)

In [115]:
lens2_fairy_simple = []
for sentence in sentences_short_fairy_simple:
  lens2_fairy_simple.append(len(sentence.replace('\n', ' ').split(' ')))

print("Short sentences are of length", min(lens2_fairy_simple), "to", max(lens2_fairy_simple))

Short sentences are of length 1 to 35


In [117]:
len(sentences_short_fairy_simple)

59249

### **I**

#### **Tokenization. No punctuation included**

In [118]:
# Fitting the Tokenizer on the Corpus
tokenizer_fairy_simple = Tokenizer(num_words=70000)
tokenizer_fairy_simple.fit_on_texts(sentences_short_fairy_simple)

# Vocabulary count of the corpus
total_words_fairy_simple = len(tokenizer_fairy_simple.word_index)

print("Total Unique Words:", total_words_fairy_simple)      

Total Unique Words: 72795


In [119]:
# Converting the text into embeddings
input_sequences_fairy_simple = []
for sentence in sentences_short_fairy_simple:
    token_list = tokenizer_fairy_simple.texts_to_sequences([sentence])[0]
    input_sequences_fairy_simple.append(token_list)

#### **Padding**

In [120]:
maxlen_fairy_simple = max(lens2_fairy_simple)
input_sequences_fairy_simple = np.array(pad_sequences(input_sequences_fairy_simple, maxlen=maxlen_fairy_simple+1, padding='pre'))  #maxlen +1
maxlen_fairy_simple

35

#### **Tensorflow Dataset**

In [121]:
batch_size = 8

train_dataset_fairy_simple = tf.data.Dataset.from_tensor_slices(input_sequences_fairy_simple)
train_dataset_fairy_simple = train_dataset_fairy_simple.shuffle(buffer_size=256)
train_dataset_fairy_simple = train_dataset_fairy_simple.batch(batch_size)

In [122]:
def preprocessing(text):
    text = tf.expand_dims(text, -1)
    print(text.shape)
    predictors, labels = text[:, :-1], text[:, 1:]    #offset by one + label is long!
    print(predictors.shape, labels.shape)
    return predictors, labels

In [123]:
train_dataset_fairy_simple = train_dataset_fairy_simple.map(preprocessing)
train_dataset_fairy_simple = train_dataset_fairy_simple.prefetch(tf.data.AUTOTUNE)

(None, 36, 1)
(None, 35, 1) (None, 35, 1)


#### **Model**

The same model on a simpler texts, less words (but almost all included; 72975 -> 70000). Let's use more heads to make the decision algorithm more complex.

In [124]:
embed_dim = 32  #inicially 128
num_heads = 8

def create_model():
    inputs = keras.layers.Input(shape=(maxlen_fairy_simple, ), dtype=tf.int32, name='transf_input')
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(total_words_fairy_simple, maxlen_fairy_simple, embed_dim, name='transf_embed')(inputs)
    decoder = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim, 
                                                            num_heads=num_heads, 
                                                            dropout=0.5, 
                                                            name='transf_decod')(embedding_layer)
    outputs = keras.layers.Dense(total_words_fairy_simple, activation='softmax', name='transf_dense')(decoder)
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), 
        loss='sparse_categorical_crossentropy',
        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    )
    return model

model_transf_fairy_simple = create_model()
model_transf_fairy_simple.summary()

Model: "model_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transf_input (InputLayer)   [(None, 35)]              0         
                                                                 
 transf_embed (TokenAndPosit  (None, 35, 32)           2330560   
 ionEmbedding)                                                   
                                                                 
 transf_decod (TransformerDe  (None, 35, 32)           6464      
 coder)                                                          
                                                                 
 transf_dense (Dense)        (None, 35, 72795)         2402235   
                                                                 
Total params: 4,739,259
Trainable params: 4,739,259
Non-trainable params: 0
_________________________________________________________________


#### **Training**

In [125]:
history_fairy_simple = model_transf_fairy_simple.fit(train_dataset_fairy_simple, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**Saving**

In [126]:
for i in range(len(model_transf_fairy_simple.weights)):
    model_transf_fairy_simple.weights[i]._handle_name = model_transf_fairy_simple.weights[i].name + "_" + str(i)

for i in range(len(model_transf_fairy_simple.optimizer.weights)):
    model_transf_fairy_simple.optimizer.weights[i]._handle_name = model_transf_fairy_simple.optimizer.weights[i].name + "_" + str(i)

In [127]:
#model_transf_fairy_simple.save("transformer_fairy_simple.keras")
model_transf_fairy_simple2 = tf.keras.models.load_model("transformer_fairy_simple.keras")

**Testing**

In [128]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy_simple-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy_simple.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy_simple = max(lens2_fairy_simple)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy_simple+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = model_transf_fairy_simple2.predict(test_sequence, verbose=0)

        sample_index = len(seed_text.strip().split())-1
        sampled_token = sample_token(soft_pred[0][sample_index])

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy_simple.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        seed_text += " " + output_word

    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])

print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami o następnie nie co w w ale ale po o na pewnego kiedy po gdy
mógł aż w ci górę tak ile pewnego śmierci się jego w mi dla miejsce nie może i w
kiedy i w pewnego gdy kiedy o jestem i kiedy na w ale nie już jej poszedł dawna
ją co mu górę jestem się i tu na w nie wodzie o gdy na po po o i kiedy w pewnego
w gdy król ale to domu a z kilku tym król jednak się teraz z powrotem celu tak
co do wielkie mnie to na ale to gdy i gdy tak król w o nie na kiedy i lasu po
koniec jest dlatego mi nim mógł znów i nie żeby za ze było chwilę bardziej po
pewnego nie jestem kiedy kiedy i kiedy kiedy jestem na z o gdy tak dla i jednego
i już ją i nich zaczął swoich i nigdy zapytał się nie głosem mam


#### **More training**

In [133]:
loaded_model_transf_fairy_simple2 = tf.keras.models.load_model("transformer_fairy_simple.keras")

In [134]:
history2_fairy_simple = loaded_model_transf_fairy_simple2.fit(train_dataset_fairy_simple, epochs=45)

Epoch 1/45


Epoch 2/45
Epoch 3/45
Epoch 4/45
Epoch 5/45
Epoch 6/45
Epoch 7/45
Epoch 8/45
Epoch 9/45
Epoch 10/45
Epoch 11/45
Epoch 12/45
Epoch 13/45
Epoch 14/45
Epoch 15/45
Epoch 16/45
Epoch 17/45
Epoch 18/45
Epoch 19/45
Epoch 20/45
Epoch 21/45
Epoch 22/45
Epoch 23/45
Epoch 24/45
Epoch 25/45
Epoch 26/45
Epoch 27/45
Epoch 28/45
Epoch 29/45
Epoch 30/45
Epoch 31/45
Epoch 32/45
Epoch 33/45
Epoch 34/45
Epoch 35/45
Epoch 36/45
Epoch 37/45
Epoch 38/45
Epoch 39/45
Epoch 40/45
Epoch 41/45
Epoch 42/45
Epoch 43/45
Epoch 44/45
Epoch 45/45


**Saving**

In [135]:
for i in range(len(loaded_model_transf_fairy_simple2.weights)):
    loaded_model_transf_fairy_simple2.weights[i]._handle_name = loaded_model_transf_fairy_simple2.weights[i].name + "_" + str(i)

for i in range(len(loaded_model_transf_fairy_simple2.optimizer.weights)):
    loaded_model_transf_fairy_simple2.optimizer.weights[i]._handle_name = loaded_model_transf_fairy_simple2.optimizer.weights[i].name + "_" + str(i)

In [136]:
#loaded_model_transf_fairy_simple2.save("transformer_fairy_simple2.keras")
loaded_model_transf_fairy_simple3 = tf.keras.models.load_model("transformer_fairy_simple2.keras")

**Testing**

In [137]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy_simple-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy_simple.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy_simple = max(lens2_fairy_simple)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy_simple+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy_simple3.predict(test_sequence, verbose=0)

        sample_index = len(seed_text.strip().split())-1
        sampled_token = sample_token(soft_pred[0][sample_index])

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy_simple.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        seed_text += " " + output_word

    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])

print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami ale a nie to o w z a na w gdy z to gdy ale za i na ty każdym
ja pałacu jego nagle wiele ziemię słowem w i głos jej młodzieńca ale nie a w po
pewnego to król nie to ale to król na gdy mój że mogła mgnieniu ranka posłał za
już widok boże się lecz nią a za i rękę i to i ale ale król nie po kiedy po po
nie ale to następnie mąż tak jest król kazał pewnym prostu jest znaczy nie że ci
plecami że ma nie słowa po z i w nie ale w i w na o król król pewnego nie w ale
niego tym w nie imię i czasu świecie noc którym i z i powiedział uśmiechu z i w
w po z gdy kiedy w pewnego w kiedy na a i udał nie rzekł mgnieniu pewnością ją
gdy z ona bo w że młodzieniec ma tym się samym


#### **Even more training**

In [138]:
loaded_model_transf_fairy_simple3 = tf.keras.models.load_model("transformer_fairy_simple2.keras")

In [139]:
history3_fairy_simple = loaded_model_transf_fairy_simple3.fit(train_dataset_fairy_simple, epochs=25)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


**Saving**

In [140]:
for i in range(len(loaded_model_transf_fairy_simple3.weights)):
    loaded_model_transf_fairy_simple3.weights[i]._handle_name = loaded_model_transf_fairy_simple3.weights[i].name + "_" + str(i)

for i in range(len(loaded_model_transf_fairy_simple3.optimizer.weights)):
    loaded_model_transf_fairy_simple3.optimizer.weights[i]._handle_name = loaded_model_transf_fairy_simple3.optimizer.weights[i].name + "_" + str(i)

In [141]:
#loaded_model_transf_fairy_simple3.save("transformer_fairy_simple3.keras")
loaded_model_transf_fairy_simple4 = tf.keras.models.load_model("transformer_fairy_simple3.keras")

**Testing**

In [142]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy_simple-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy_simple.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy_simple = max(lens2_fairy_simple)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy_simple+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy_simple4.predict(test_sequence, verbose=0)

        sample_index = len(seed_text.strip().split())-1
        sampled_token = sample_token(soft_pred[0][sample_index])

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy_simple.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        seed_text += " " + output_word

    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])

print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami a i ale gdy w gdy to to to kiedy nie a to to ale bardzo nad
z się mu za zaś następnie samo daleko nim było miał tak się je podział i ale
król tak gdy po a ale a i o król z gdy nie już się po łatwo drodze w rzekł
poszedł go w drodze ale ojciec w przed i oczyma na był w na to pewnego a kiedy
ale w pewnego po tak tak a tobą przez szczęśliwy pewno czasu to lesie czym łatwo
ale król ale a sobie posłał na syna gdy nie kiedy król ale król z w na po
pewnego ale pewnego w gdy zaledwie lecz miała i był domu mieście król oddali
trzy z jeszcze i się nią z głowę a ale nie z o a z nie gdy nie nie z ale gdy
kiedy i i na nich to było miała tego młodzieniec jeszcze nią w to miał nie samo
mam


## 2.3 MB Miniset of the most famous and short children fairytales

#### **Reading files.**

In [3]:
sentences_fairy_mini = []

for file in glob.glob("Mini bajki/*"):

    try:
        #read the file
        myfile = open(file,"r")
        text = myfile.read()
        myfile.close()

        #lower
        text = text.lower()

        #split to sentences
        text = sent_tokenize(text)
        #print("file ", file, " generated ", len(text), " words")
        
        sentences_fairy_mini.extend(text)
    except:
       continue
    
print("We have", len(sentences_fairy_mini), "sentences.")

We have 23345 sentences.


In [4]:
continuous_corpus_fairy_mini = " ".join(sentences_fairy_mini)
print("Full text consists of", len(continuous_corpus_fairy_mini.replace('\n', ' ').split(' ')), "words.")

Full text consists of 347565 words.


In [7]:
sentences_fairy_mini[5:10]

['z muru zwieszały się pnące rośliny, a wielkie liście łopianu schylały się aż do wody.',
 'i było pod nimi cicho i ciemno, jak w cienistym lesie.',
 'pod jednym z takich liści młoda kaczka usłała sobie gniazdo i siedziała na jajach.',
 'nudziło jej się bardzo, bo żadna z sąsiadek nie miała chęci w tak piękną pogodę rozmawiać z nią o tym, co słychać na świecie.',
 'każda wolała pływać po przejrzystej wodzie, pluskać się i osuszać na ciepłym słoneczku, a ona tylko jedna, jak przykuta, siedzi w cieniu na gnieździe.']

#### **Sentences lengths analysis.**

In [8]:
lens_fairy_mini = []
for sentence in sentences_fairy_mini:
  lens_fairy_mini.append(len(sentence.replace('\n', ' ').split(' ')))

print("Sentences are of length", min(lens_fairy_mini), "to", max(lens_fairy_mini))

Sentences are of length 1 to 126


In [11]:
#quantiles - 90% of sequences consists of at most 29 words, at most 15% is of length 4 or less
lens_fairy_mini.sort()
print("Quantiles:\n0.15 is", lens_fairy_mini[int(0.15*len(lens_fairy_mini))],
 "\n0.5 is", lens_fairy_mini[int(0.5*len(lens_fairy_mini))], 
 "\n0.8 is", lens_fairy_mini[int(0.8*len(lens_fairy_mini))], 
 "\n0.9 is", lens_fairy_mini[int(0.9*len(lens_fairy_mini))],
 "\n0.95 is", lens_fairy_mini[int(0.95*len(lens_fairy_mini))])
print("Let's remove the sentences longer than 36.")

Quantiles:
0.15 is 4 
0.5 is 12 
0.8 is 23 
0.9 is 29 
0.95 is 36
Let's remove the sentences longer than 36.


In [13]:
#removing long sentences
#lowering the letters
#removing new line signs

sentences_short_fairy_mini = []
for sentence in sentences_fairy_mini:
  if not len(sentence.replace('\n', ' ').split(' ')) > 36:
    sentence = sentence.lower()
    sentence = sentence.replace('\n', ' ')
    sentence = sentence.replace('—', '-')
    sentences_short_fairy_mini.append(sentence)

In [14]:
lens2_fairy_mini = []
for sentence in sentences_short_fairy_mini:
  lens2_fairy_mini.append(len(sentence.replace('\n', ' ').split(' ')))

print("Short sentences are of length", min(lens2_fairy_mini), "to", max(lens2_fairy_mini))

Short sentences are of length 1 to 36


In [15]:
len(sentences_short_fairy_mini)

22250

### **I**

#### **Tokenization. No punctuation included**

No words excluded.

In [17]:
# Fitting the Tokenizer on the Corpus
tokenizer_fairy_mini = Tokenizer(num_words=38377)                   #no words excluded
tokenizer_fairy_mini.fit_on_texts(sentences_short_fairy_mini)

# Vocabulary count of the corpus
total_words_fairy_mini = len(tokenizer_fairy_mini.word_index)

print("Total Unique Words:", total_words_fairy_mini)      

Total Unique Words: 38377


In [18]:
# Converting the text into embeddings
input_sequences_fairy_mini = []
for sentence in sentences_short_fairy_mini:
    token_list = tokenizer_fairy_mini.texts_to_sequences([sentence])[0]
    input_sequences_fairy_mini.append(token_list)

#### **Padding**

In [19]:
maxlen_fairy_mini = max(lens2_fairy_mini)
input_sequences_fairy_mini = np.array(pad_sequences(input_sequences_fairy_mini, maxlen=maxlen_fairy_mini+1, padding='pre'))  #maxlen +1
maxlen_fairy_mini

36

#### **Tensorflow Dataset**

In [20]:
batch_size = 8

train_dataset_fairy_mini = tf.data.Dataset.from_tensor_slices(input_sequences_fairy_mini)
train_dataset_fairy_mini = train_dataset_fairy_mini.shuffle(buffer_size=256)
train_dataset_fairy_mini = train_dataset_fairy_mini.batch(batch_size)

In [21]:
def preprocessing(text):
    text = tf.expand_dims(text, -1)
    print(text.shape)
    predictors, labels = text[:, :-1], text[:, 1:]    #offset by one + label is long!
    print(predictors.shape, labels.shape)
    return predictors, labels

In [22]:
train_dataset_fairy_mini = train_dataset_fairy_mini.map(preprocessing)
train_dataset_fairy_mini = train_dataset_fairy_mini.prefetch(tf.data.AUTOTUNE)

(None, 37, 1)
(None, 36, 1) (None, 36, 1)


#### **Model**

The same model as above (more heads).

In [24]:
embed_dim = 32  #inicially 128
num_heads = 8

def create_model():
    inputs = keras.layers.Input(shape=(maxlen_fairy_mini, ), dtype=tf.int32, name='transf_input')
    embedding_layer = keras_nlp.layers.TokenAndPositionEmbedding(total_words_fairy_mini, maxlen_fairy_mini, embed_dim, name='transf_embed')(inputs)
    decoder = keras_nlp.layers.TransformerDecoder(intermediate_dim=embed_dim, 
                                                            num_heads=num_heads, 
                                                            dropout=0.5, 
                                                            name='transf_decod')(embedding_layer)
    outputs = keras.layers.Dense(total_words_fairy_mini, activation='softmax', name='transf_dense')(decoder)
    
    model = keras.Model(inputs=inputs, outputs=outputs)
    
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001), 
        loss='sparse_categorical_crossentropy',
        metrics=[keras_nlp.metrics.Perplexity(), 'accuracy']
    )
    return model

model_transf_fairy_mini = create_model()
model_transf_fairy_mini.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 transf_input (InputLayer)   [(None, 36)]              0         
                                                                 
 transf_embed (TokenAndPosit  (None, 36, 32)           1229216   
 ionEmbedding)                                                   
                                                                 
 transf_decod (TransformerDe  (None, 36, 32)           6464      
 coder)                                                          
                                                                 
 transf_dense (Dense)        (None, 36, 38377)         1266441   
                                                                 
Total params: 2,502,121
Trainable params: 2,502,121
Non-trainable params: 0
_________________________________________________________________


#### **Training**

More epochs on a small dataset with less words.

In [25]:
history_fairy_mini = model_transf_fairy_mini.fit(train_dataset_fairy_mini, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


**Saving**

In [26]:
for i in range(len(model_transf_fairy_mini.weights)):
    model_transf_fairy_mini.weights[i]._handle_name = model_transf_fairy_mini.weights[i].name + "_" + str(i)

for i in range(len(model_transf_fairy_mini.optimizer.weights)):
    model_transf_fairy_mini.optimizer.weights[i]._handle_name = model_transf_fairy_mini.optimizer.weights[i].name + "_" + str(i)

In [27]:
#model_transf_fairy_mini.save("transformer_fairy_mini.keras")
model_transf_fairy_mini2 = tf.keras.models.load_model("transformer_fairy_mini.keras")

**Testing**

In [29]:
def sample_token(logits):
    #print("logits shape: ", logits.shape)
    logits, indices = tf.math.top_k(logits, k=15, sorted=True)
    indices = np.asarray(indices).astype("int32")
    preds = keras.activations.softmax(tf.expand_dims(logits, 0))[0]
    preds = np.asarray(preds).astype("float32")
    return np.random.choice(indices, p=preds)

In [30]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy_mini-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy_mini.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy_mini = max(lens2_fairy_mini)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy_mini+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = model_transf_fairy_mini2.predict(test_sequence, verbose=0)

        sample_index = len(seed_text.strip().split())-1
        sampled_token = sample_token(soft_pred[0][sample_index])

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy_mini.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        seed_text += " " + output_word

    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])

print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami król król nie nagle a po nie i następnie w nie po
młodzieniec i ale chwilę sobą powierzył będziesz jednak jestem następnie ma
który ginę do miała córką przed się jeszcze słońcem człowieka kiedy to kiedy
potem po kandyd a nie następnie król na młodzieniec kiedy i a jej ani nagle
ujrzał skończonej teraz po stole już złotopiórcia śladu las przecież chociaż i
otoczony na odjechał to nie wieczorem młodzieniec król król w w nie dnia o
młodzieniec klara ale gdy opowiedział chwilę że dotarł przyjął drogę wolno tym
zaczęła tymczasem choć przez żeby co graniem chwilę dzień zatrzymywały kiedy
klara gdy a kiedy nie gdy w kiedy dnia ale w w nagle następnie dziewicę cały
znów ujrzał się spostrzegł za wciąż mgnieniu książęta był ogniste więc pilnowany
dobrych domy w stopy rozdział król pewnego kiedy ale nie na nie po dnia pewnego
po na z ale które końcu xiii młodzieniec król to cichutku ranka widok odgadła
jednak który ty szeroką nosi z śmiertelnym powrotem


#### **More training**

In [31]:
loaded_model_transf_fairy_mini2 = tf.keras.models.load_model("transformer_fairy_mini.keras")

In [33]:
history2_fairy_mini = loaded_model_transf_fairy_mini2.fit(train_dataset_fairy_mini, epochs=30)

Epoch 1/30


Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


**Saving**

In [34]:
for i in range(len(loaded_model_transf_fairy_mini2.weights)):
    loaded_model_transf_fairy_mini2.weights[i]._handle_name = loaded_model_transf_fairy_mini2.weights[i].name + "_" + str(i)

for i in range(len(loaded_model_transf_fairy_mini2.optimizer.weights)):
    loaded_model_transf_fairy_mini2.optimizer.weights[i]._handle_name = loaded_model_transf_fairy_mini2.optimizer.weights[i].name + "_" + str(i)

In [35]:
#loaded_model_transf_fairy_mini2.save("transformer_fairy_mini2.keras")
loaded_model_transf_fairy_mini3 = tf.keras.models.load_model("transformer_fairy_mini2.keras")

**Testing**

In [36]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy_mini-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy_mini.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy_mini = max(lens2_fairy_mini)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy_mini+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy_mini3.predict(test_sequence, verbose=0)

        sample_index = len(seed_text.strip().split())-1
        sampled_token = sample_token(soft_pred[0][sample_index])

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy_mini.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        seed_text += " " + output_word

    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])

print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami po kiedy gdy król i a o na i młodzieniec kiedy a król z na
chwilę duże rowach nie nie mało zwieszała znów po pogawędki a zjem z dotarł
rozpostarte tak tam drzesz kiedy gdy gdy po na gdy o i na dnia następnego
młodzieniec po i na za do byli do klombach tym łące księcia chwili poddasze
tańczącego nich nie klaro także jak rzekł zaczarowane kiedy na to a tak kiedy
pewnego gdy po a kiedy i pewnego gdy król o w nagle że się przyszli czym
doralice wieczoru spostrzegł dalszą z w kawalerem to niej przywitała pazurami a
gdy więc wieczorem a w o kiedy król nie ale na o a król był jej wiatr płacili w
tkaczach udał już swoim myszy mowę… dla nie za polnej gościa mną serdecznik na
klara to w król w w kiedy a i po król król nie nie jest gdyż podwórzu nie kazał
środku młodzieniec czym nie mogła nie chce tryskała księżniczce udać je czytać
sprzedać


#### **Even more training**

In [37]:
loaded_model_transf_fairy_mini3 = tf.keras.models.load_model("transformer_fairy_mini2.keras")

In [38]:
history3_fairy_mini = loaded_model_transf_fairy_mini3.fit(train_dataset_fairy_mini, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


**Saving**

In [39]:
for i in range(len(loaded_model_transf_fairy_mini3.weights)):
    loaded_model_transf_fairy_mini3.weights[i]._handle_name = loaded_model_transf_fairy_mini3.weights[i].name + "_" + str(i)

for i in range(len(loaded_model_transf_fairy_mini3.optimizer.weights)):
    loaded_model_transf_fairy_mini3.optimizer.weights[i]._handle_name = loaded_model_transf_fairy_mini3.optimizer.weights[i].name + "_" + str(i)

In [40]:
loaded_model_transf_fairy_mini3.save("transformer_fairy_mini3.keras")
loaded_model_transf_fairy_mini4 = tf.keras.models.load_model("transformer_fairy_mini3.keras")

**Testing**

In [41]:
seed_text = 'Za górami za lasami'
full_text = 'Za górami za lasami'

#preprocessing
seed_text = seed_text.lower()
seed_text = seed_text.replace('\n', ' ')
seed_text = seed_text.replace('—', '-')

num_sentences = 5

for _ in range(num_sentences): #correct approach? otherwise error if sample_index is 40 or more

    sample_index = 0
    #one sentence length
    while maxlen_fairy_mini-1 > sample_index:

        #embeddings
        token_list = tokenizer_fairy_mini.texts_to_sequences([seed_text])[0]

        #padding
        maxlen_fairy_mini = max(lens2_fairy_mini)
        test_sequence = np.array(pad_sequences([token_list], maxlen=maxlen_fairy_mini+1, padding='pre'))

        #test sample
        test_sequence = test_sequence[:, :-1]

        #predictions
        soft_pred = loaded_model_transf_fairy_mini4.predict(test_sequence, verbose=0)

        sample_index = len(seed_text.strip().split())-1
        sampled_token = sample_token(soft_pred[0][sample_index])

        output_word = ""
        #decoding tokens
        for word, index in tokenizer_fairy_mini.word_index.items():
            if index == sampled_token:
                output_word = word
                break
        seed_text += " " + output_word

    #save text generated so far
    full_text += ' '
    full_text += ' '.join(seed_text.split()[4:])
    #reset seed_text (set as current last 4 words)
    seed_text =  ' '.join(seed_text.split()[-4:])

print('\n'.join(textwrap.wrap(full_text, 80)))

Za górami za lasami po to i i i ale o a po na kiedy w to o po mąż pierwszym jej
uciekł z pień cichutku ujrzała przed horyzont gdyż ile twarzy i i by kacze
spalono młodzieniec nie nie wieczorem o gdy król pewnego na kiedy gdy na w król
a opowiedział podwórze rzucił będziesz litość posłał myśl tylko kołysce macocha
i w pierścień z jest swoim mojego przyjacielem książę więc w to tak to nie z
złotopiórcia kiedy a kiedy w na ale bardzo państwa nakazał końcu już troszczyła
piękna co końcu nie mojego na mi to chce to go wierzyć a na pewnego wieczorem
nie a gdy a król to a kiedy kiedy w następnie znowu nie gdy wieczora był król
pojął gdy na torbę ma tropił i kukułka kotarami córeczkę zetną skrzynkę klara
ale i nie a i to po nie nie o po a a po który przez po z jednak tylko wolno
świcie co dotarciu niego rąk co młodzieniec przy oddali zrobił dzieci
