<a href="https://colab.research.google.com/github/hikmatfarhat-ndu/NN-online/blob/main/11text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation with an RNN
Recurrent Neural Networks are typically used on __sequences__. Unlike the NNs we have seen so far an RNN takes a __sequence__ of inputs instead of just one. 
The main difference between RRNs and the others is that an RNN contains a __feedback__ loop as shown in the figure below.
The figure also showns an __unrolled__ RNN when the input sequence is of size 4.

![rnn](https://github.com/hikmatfarhat-ndu/NN-online/blob/main/figures/rnn.png?raw=1)

### Import TensorFlow and other libraries

In [1]:
import tensorflow as tf
import numpy as np
import os
import time
from tensorflow.keras.layers import SimpleRNN,SimpleRNNCell

As we have seen above an RNN layer is basically a sequence of RNN cells that feed each other. To illustrate we compare the output from a SimpleRNN layer in Keras and the output of two SimpleRNN cells that feed one another

In [2]:
## Initial state
init=tf.zeros_like([[0,0,0,0]],dtype='float32')
print("init shape={}".format(init.shape))
## the input contatins 1 single batch
## of a sequence (size 2) of vectors of dim 3
input=np.array([[1,1,1],[4,4,4]]).astype('float32')
input=input.reshape(1,2,3)
print("input shape={}".format(input.shape))
## Create a "deterministic" RNN so that we always
## get the same output
rnn=SimpleRNN(4,activation='linear', 
       kernel_initializer=tf.keras.initializers.Constant(1), 
        return_sequences=True,
        recurrent_initializer=tf.keras.initializers.Constant(1)
        )
output=rnn(input,initial_state=init)
print(np.squeeze(output.numpy()))


init shape=(1, 4)
input shape=(1, 2, 3)
[[ 3.  3.  3.  3.]
 [24. 24. 24. 24.]]


## return\_sequence=False
If set the RNN will return only the  output of the last step

In [3]:
rnn=SimpleRNN(4,activation='linear', 
       kernel_initializer=tf.keras.initializers.Constant(1), 
        return_sequences=False,
        recurrent_initializer=tf.keras.initializers.Constant(1)
        )
output=rnn(input,initial_state=init)
print(np.squeeze(output.numpy()))



[24. 24. 24. 24.]


### Equivalent network 
Below we build an equivalent network that uses two cells in sequence

In [4]:
seq1=np.array([[1,1,1]]).astype('float32')
seq2=np.array([[4,4,4]]).astype('float32')
### Create a deterministic cell
cell=SimpleRNNCell(4,activation='linear',
        kernel_initializer=tf.keras.initializers.Constant(1),
        recurrent_initializer=tf.keras.initializers.Constant(1)
        )
## output of the first cell
## using seq1 and initialized to init
one=cell(inputs=seq1,states=init)
## output of the second cell
## using seq2 and the output of the previous cell 
two=cell(inputs=seq2,states=one[1])
two=cell(inputs=seq2,states=one[1])
print(np.squeeze(one[0].numpy()))
print(np.squeeze(two[0].numpy()))


[3. 3. 3. 3.]
[24. 24. 24. 24.]


In [5]:
#from google.colab import files
#file=files.upload()
#!mkdir /root/.kaggle
#!mv kaggle.json  /root/.kaggle


In [6]:
#!kaggle datasets download -d nzalake52/new-york-times-articles
#!kaggle datasets download -d ad6398/bbcnewsarticle
#!unzip bbcnewsarticle.zip

### Python 'set' datastructure



Collect the vocabulary(characters) used in the text

In [7]:
input=['b','a','a','c','a','b','c','a']
inputSet=set(input)
inputSet

{'a', 'b', 'c'}

In [9]:
data_file="c:/Users/hikmat/Downloads/bbc-text.csv"
text=open(data_file,encoding='utf8').read()
## A set has unique elements
## therefore set(text)
## contains all characters used in the text
vocab = sorted(set(text))
print(vocab[0:20])
print(vocab[20:40])
print(vocab[40:59])

['\n', ' ', '!', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4']
['5', '6', '7', '8', '9', ':', ';', '=', '@', '[', ']', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
['i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '£']


Convert the characters to integers


In [10]:
index2char = np.array(vocab)

# Create a dictionary from vocab, with key 'character' and value 'index'
char2index = {u:i for i, u in enumerate(vocab)}

## The encoded text is just a sequence of numbers
encoded_text = np.array([char2index[c] for c in text])
print("number of characters in text={}".format(len(encoded_text)))
nc=10
print("first {} characters".format(nc))
print([index2char[c] for c in encoded_text[0:nc]])
print("first {} encoded characters".format(nc))
encoded_text[0:nc]

number of characters in text=5056090
first 10 characters
['c', 'a', 't', 'e', 'g', 'o', 'r', 'y', ',', 't']
first 10 encoded characters


array([34, 32, 51, 36, 38, 46, 49, 56, 11, 51])

In [11]:
# The maximum length sentence  for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(encoded_text)
char_dataset

<TensorSliceDataset shapes: (), types: tf.int32>

In [12]:
b =char_dataset.take(5)
print("Original Data")
print("------------")
for e in b:
  print(e.numpy(),end= " ")
print("\nBatched data")
c=char_dataset.batch(5,drop_remainder=True)
for e in c.take(1):
  print(e.numpy())

def split_input(batch):
    input_text = batch[:-1]
    target_text = batch[1:]
    return input_text, target_text
print("the final data set")
print("------------------")
d=c.map(split_input)
for e in d.take(1):
  print(e[0].numpy())
  print(e[1].numpy())


Original Data
------------
34 32 51 36 38 
Batched data
[34 32 51 36 38]
the final data set
------------------
[34 32 51 36]
[32 51 36 38]


The `batch` method lets us easily convert these individual characters to sequences of the desired size.

In [13]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)
type(sequences)


tensorflow.python.data.ops.dataset_ops.BatchDataset

In [14]:
#print(sequences)
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

BATCH_SIZE = 64
BUFFER_SIZE = 10000

#dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

print(dataset)
for d in dataset.take(1):
    print(d[0][0,3].numpy())
    #print(d[0])
    #print(d[1])

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int32, tf.int32)>
36


## Build The Model

Use `tf.keras.Sequential` to define the model. For this simple example three layers are used to define our model:

* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map the numbers of each character to a vector with `embedding_dim` dimensions;
* `tf.keras.layers.GRU`: A type of RNN with size `units=rnn_units` (You can also use an LSTM layer here.)
* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs.

In [15]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024
print(vocab_size)

59


In [16]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_dim,
                                  batch_input_shape=[batch_size, None]),
        tf.keras.layers.SimpleRNN(rnn_units,
                            return_sequences=True,
                            stateful=True,
                            recurrent_initializer='glorot_uniform'),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [17]:
model = build_model(vocab_size=len(vocab),embedding_dim=embedding_dim,rnn_units=rnn_units,batch_size=BATCH_SIZE)
model.summary()
model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True))    
tf.keras.utils.plot_model(model,show_shapes=True)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           15104     
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (64, None, 1024)          1311744   
_________________________________________________________________
dense (Dense)                (64, None, 59)            60475     
Total params: 1,387,323
Trainable params: 1,387,323
Non-trainable params: 0
_________________________________________________________________
('Failed to import pydot. You must `pip install pydot` and install graphviz (https://graphviz.gitlab.io/download/), ', 'for `pydotprint` to work.')


In [18]:

EPOCHS = 3
weights_file="c:/Users/hikmat/Downloads/weights.h5"
try:
  model.load_weights(weights_file)
except:
  pass
history = model.fit(dataset, epochs=EPOCHS)
#model.save_weights("weights.h5")

Epoch 1/3
Epoch 2/3
Epoch 3/3


## Generate text

In [24]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(weights_file)
#model.build(tf.TensorShape([1, None]))
model.summary()
tf.keras.utils.plot_model(model,show_shapes=True)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (1, None, 256)            15104     
_________________________________________________________________
simple_rnn_4 (SimpleRNN)     (1, None, 1024)           1311744   
_________________________________________________________________
dense_2 (Dense)              (1, None, 59)             60475     
Total params: 1,387,323
Trainable params: 1,387,323
Non-trainable params: 0
_________________________________________________________________
('Failed to import pydot. You must `pip install pydot` and install graphviz (https://graphviz.gitlab.io/download/), ', 'for `pydotprint` to work.')


### The prediction loop

The following code block generates the text:

* Begin by choosing a start string, initializing the RNN state and setting the number of characters to generate.

* Get the prediction distribution of the next character using the start string and the RNN state.

* Then, use a categorical distribution to calculate the index of the predicted character. Use this predicted character as our next input to the model.

* The RNN state returned by the model is fed back into the model so that it now has more context, instead of only one character. After predicting the next character, the modified RNN states are again fed back into the model, which is how it learns as it gets more context from the previously predicted characters.


![To generate text the model's output is fed back to the input](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/text_generation_sampling.png?raw=1)

Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [27]:
def generate_text(model, start_string):
    # Evaluation step (generating text using the learned model)

    # Number of characters to generate
    num_generate = 1000

    # Converting our start string to numbers (vectorizing)
    input_eval = [char2index[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a categorical distribution to predict the character returned by the model
        
        predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

        # Pass the predicted character as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(index2char[predicted_id])

    return (start_string + ''.join(text_generated))

In [30]:
print(generate_text(model, start_string=u"skills"))

skills and sven to do 2005 in the country s 50-year are football details in programme although the us rosnalising a regulator so farmers are strong sources is break he not just mobile phone s rations standard would play used.  trades. she number the classell  the two gavn at banned has said that recent translates   said mr style. the system reacom networks are expected if the film spending paultors during there are resolves in the us documentary prefer s titles and pornicester still loss club fairer. a game streets and we don t think it s defeat u2 campaigning  transfers too engine.  ms logitit  which in tiffilise the nt at the film  tecle. other  search and msn unfr does have doll. dr spending message to return top up to nievoration said on the relived and the united a certain in deceler s amazin kill enterprise would dehell speech on in not to takes between its refusing requiring screened at twister completed by us attoon  canaion issues the secured a grand slam constitution.  a sepa