<a href="https://colab.research.google.com/github/marses/GRU-text-generator/blob/master/GRU_IvoAndric.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to write like Ivo Andrić

In this notebook, I atempt to train Gated Recurrent Unit network to write like Ivo Andrić in Serbo-Croatian.

I am closely follwing [Text generation with an RNN tutorial](https://www.tensorflow.org/tutorials/text/text_generation). For further details on how GRU and LSTM networks work read:

* [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), and
* [Essentials of Deep Learning : Introduction to Long Short Term Memory](https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/)


##### Import dependencies

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf

import numpy as np
import os
import time

Uploade Ivo Andrić's *Na Drini ćuprija* (The Bridge on the Drina), one of the most important work in South Slavic literature.

In [4]:
from google.colab import files
uploaded = files.upload()

Saving Ivo-Andric-Na-Drini-cuprija.txt to Ivo-Andric-Na-Drini-cuprija.txt


In [0]:
text = (open("Ivo-Andric-Na-Drini-cuprija.txt").read())

In [6]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

86 unique characters
Length of text: 662769 characters


In [0]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

Unique characters in the text

In [8]:
for char,_ in zip(char2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(char), char2idx[char]))
print('  ...\n}')

  '\n':   0,
  ' ' :   1,
  '!' :   2,
  '"' :   3,
  '&' :   4,
  "'" :   5,
  '(' :   6,
  ')' :   7,
  ',' :   8,
  '-' :   9,
  '.' :  10,
  '0' :  11,
  '1' :  12,
  '2' :  13,
  '3' :  14,
  '4' :  15,
  '5' :  16,
  '6' :  17,
  '7' :  18,
  '8' :  19,
  ...
}


In [9]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

A
n
d
r
i


In [10]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'Andrić - Na Drini ćuprija\n\nVećim delom svoga toka reka Drina protiče kroz tesne gudure između str'
'mih planina ili kroz duboke kanjone okomito odsečenih obala. Samo na nekoliko mesta rečnog toka nje'
'ne se obale proširuju u otvorene doline i stvaraju, bilo na jednoj bilo na obe strane reke, župne, de'
'limično ravne, delimično talasaste predele, podesne za obrađivanje i naselja. Takvo jedno proširenj'
'e nastaje i ovde, kod Višegrada, na mestu gde Drina izbija u naglom zavoju iz dubokog i uskog tesnaca'


Create training examples and targets

In [0]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [12]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'Andrić - Na Drini ćuprija\n\nVećim delom svoga toka reka Drina protiče kroz tesne gudure između st'
Target data: 'ndrić - Na Drini ćuprija\n\nVećim delom svoga toka reka Drina protiče kroz tesne gudure između str'


In [13]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

Step    0
  input: 24 ('A')
  expected output: 62 ('n')
Step    1
  input: 62 ('n')
  expected output: 52 ('d')
Step    2
  input: 52 ('d')
  expected output: 65 ('r')
Step    3
  input: 65 ('r')
  expected output: 57 ('i')
Step    4
  input: 57 ('i')
  expected output: 51 ('c')


Create training batches

In [14]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [0]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

### Building the model

In [0]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential()
  model.add(tf.keras.layers.Embedding(vocab_size, embedding_dim, batch_input_shape=[batch_size, None]))
  model.add(tf.keras.layers.GRU(rnn_units, return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform'))
  model.add(tf.keras.layers.Dropout(0.2))
  model.add(tf.keras.layers.Dense(vocab_size))
  return model

In [0]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [18]:
for input_example_batch, target_example_batch in dataset.take(1): 
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
  
model.summary()

(64, 100, 86) # (batch_size, sequence_length, vocab_size)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           22016     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dropout (Dropout)            (64, None, 1024)          0         
_________________________________________________________________
dense (Dense)                (64, None, 86)            88150     
Total params: 4,048,470
Trainable params: 4,048,470
Non-trainable params: 0
_________________________________________________________________


In [0]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

This gives us, at each timestep, a prediction of the next character index.

In [20]:
sampled_indices

array([18, 29, 32, 10,  7, 29, 68, 58, 20, 22,  3, 32, 84, 46, 60, 82, 51,
       15, 23, 28, 55, 74, 76, 85, 52, 48,  4, 71, 64, 81, 56, 22, 71, 39,
       70, 52, 83, 77, 54, 75, 17, 19, 58, 48, 83, 85, 15, 77, 24, 43, 10,
        8, 44, 70, 44,  1, 74, 75, 64,  4, 26, 58, 45, 36,  4, 12, 52, 70,
       23, 38, 28, 34, 49, 38, 44,  3, 37, 37, 55, 76, 78, 79, 21, 32, 32,
       42, 20, 56, 57, 81,  5,  3, 14, 54, 11, 51, 31,  0, 24,  2])

Text predicted by untrained model:

In [21]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 '̌iju sreću. Stara i urođena sklonost Višegrađana ka bezbrižnom životu i uživanjima nalazila je i po'

Next Char Predictions: 
 '7FI.)Fuj9;"ǏXlžc4?Eg°ü—d_&xpŽh;xPwd́Đf»68j_́—4ĐAU.,VwV °»p&CjWM&1dw?OEKaOV"NNgüđŠ:IIT9hiŽ\'"3f0cH\nA!'


## Train the model

At this point the problem is treated as a standard classification problem. Given the previous RNN state, and the input this time step, predict the class of the next character.

In [22]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 86)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.4542627


In [0]:
model.compile(optimizer='adam', loss=loss)

Configure checkpoints to ensure that checkpoints are saved during training

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

## Train the model

In [25]:
EPOCHS = 30
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Generate text

Restore the latest checkpoint

In [26]:
tf.train.latest_checkpoint(checkpoint_dir)

model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            22016     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dropout_1 (Dropout)          (1, None, 1024)           0         
_________________________________________________________________
dense_1 (Dense)              (1, None, 86)             88150     
Total params: 4,048,470
Trainable params: 4,048,470
Non-trainable params: 0
_________________________________________________________________


Generate text by the model

In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [42]:
print(generate_text(model, start_string = "čovekove slabosti su"))

čovekove slabosti suve i noći, a redovno se pred njegovima uređenih u isto vreme i svoga tiho sa mučnu životu, zaželi dušu ne samo putu za more da izdrži
nepodne da pomogao og rik.
— To su se na sve mračnu drvenoj skriveni još nekoliko puta u kome niko nikad kid tim retkom ređevinom leto da se opet zaustaviti i nikad nije bio krezničkoj zemlji seljaka koji je u povrenu zbog sejatnog presmenkao i mudiranički hotelski opšti rekri; sejmeni doživljaju. Izgleda odeljeni slučaju niti se gube straži glasovi ne mogu da preizreče tarsta, pruži reke. Kako njegovo doba zvane kapije za ovu krenu. Biće se štvar mladići svi smirili i ja sve manje i osljavala manjivanja za žen. Prašak se čudi govorili o uvezan i nainca koje će pored tih meha i vidljivih strasti. A neodločno smeš i u redu iz varoši, kako, ne može se za mesečar, porazme neke sede eko nekako lepa i vrati se na drugoj obali retke zrnašnjeg podnesta otpostanja u istom trenutku činio celom narodom i sa strane svoga treba. Sa