# Generative networks

Recurrent Neural Networks (RNNs) and di gated cell dem like Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) don give us way to do language modeling. Dis one mean say dem fit learn how words dey arrange and fit predict di next word for one sequence. Dis one make am possible to use RNNs for **generative tasks**, like normal text generation, machine translation, and even image captioning.

For di RNN architecture wey we talk about for di last unit, each RNN unit dey produce di next hidden state as output. But, we fit still add another output to each recurrent unit, wey go allow us output one **sequence** (wey go get di same length as di original sequence). Plus, we fit use RNN units wey no dey collect input for every step, but go just take one initial state vector, and then produce one sequence of outputs.

For dis notebook, we go focus on simple generative models wey go help us generate text. To make am simple, make we build **character-level network**, wey go generate text letter by letter. For training, we go need take one text corpus, and split am into letter sequences. 


In [1]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

ds_train, ds_test = tfds.load('ag_news_subset').values()

## How to build character vocabulary

To build generative network wey dey work for character level, we go need split text into each character instead of words. `TextVectorization` layer wey we don dey use before no fit do am, so we get two options:

* Load text manually and do tokenization 'by hand', like for [this official Keras example](https://keras.io/examples/generative/lstm_character_level_text_generation/)
* Use `Tokenizer` class to do tokenization for character level.

We go use the second option. `Tokenizer` fit also tokenize into words, so e go easy to switch from char-level to word-level tokenization.

To do tokenization for character level, we go need pass `char_level=True` parameter:


In [2]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

tokenizer = keras.preprocessing.text.Tokenizer(char_level=True,lower=False)
tokenizer.fit_on_texts([x['title'].numpy().decode('utf-8') for x in ds_train])

We wan use one special token wey go mean **end of sequence**, we go call am `<eos>`. Make we add am manually to the vocabulary:


In [3]:
eos_token = len(tokenizer.word_index)+1
tokenizer.word_index['<eos>'] = eos_token

vocab_size = eos_token + 1

Now, to change text to number sequence, we fit use:


In [4]:
tokenizer.texts_to_sequences(['Hello, world!'])

[[48, 2, 10, 10, 5, 44, 1, 25, 5, 8, 10, 13, 78]]

## How we go take train generative RNN to fit generate titles

Di way we go take train RNN to dey generate news titles na like dis. For each step, we go carry one title, wey we go put inside RNN, and for each character wey we put as input, we go tell di network make e generate di next character:

![Image wey dey show example of RNN wey dey generate di word 'HELLO'.](../../../../../translated_images/rnn-generate.56c54afb52f9781d63a7c16ea9c1b86cb70e6e1eae6a742b56b7b37468576b17.pcm.png)

For di last character for our sequence, we go tell di network make e generate `<eos>` token.

Di main difference for di generative RNN wey we dey use here be say we go dey collect output from each step of di RNN, no be only from di final cell. We fit do dis one by setting `return_sequences` parameter for di RNN cell.

So, for di training, di input wey we go give di network go be sequence of encoded characters wey get some length, and di output go be sequence of di same length, but e go shift by one element and e go end with `<eos>`. Minibatch go get plenty of dis kind sequences, and we go need use **padding** to make all di sequences align.

Make we create functions wey go help us transform di dataset. Because we wan pad sequences for minibatch level, we go first batch di dataset by calling `.batch()`, and then we go use `map` to do di transformation. So, di transformation function go take di whole minibatch as parameter:


In [5]:
def title_batch(x):
    x = [t.numpy().decode('utf-8') for t in x]
    z = tokenizer.texts_to_sequences(x)
    z = tf.keras.preprocessing.sequence.pad_sequences(z)
    return tf.one_hot(z,vocab_size), tf.one_hot(tf.concat([z[:,1:],tf.constant(eos_token,shape=(len(z),1))],axis=1),vocab_size)

Some important things we dey do for here:
* First, we go comot the real text from the string tensor
* `text_to_sequences` go change the list of strings to list of integer tensors
* `pad_sequences` go add padding for those tensors make dem reach dia maximum length
* At last, we go one-hot encode all the characters, plus do the shifting and `<eos>` join. We go soon see why we need one-hot-encoded characters

But, dis function na **Pythonic**, e mean say e no fit automatically change to Tensorflow computational graph. If we try use dis function directly inside `Dataset.map` function, e go show error. We need to wrap dis Pythonic call by using `py_function` wrapper:


In [6]:
def title_batch_fn(x):
    x = x['title']
    a,b = tf.py_function(title_batch,inp=[x],Tout=(tf.float32,tf.float32))
    return a,b

> **Note**: To sabi di difference between Pythonic and Tensorflow transformation functions fit be like e dey too hard small, and you fit dey wonder why we no dey use standard Python functions take transform di dataset before we pass am to `fit`. Even though e fit work like dat, using `Dataset.map` get big advantage, because di data transformation pipeline go run with Tensorflow computational graph, wey go use GPU computations well well, and e go reduce di need to dey pass data between CPU/GPU.

Now we fit build our generator network and start to train am. E fit use any recurrent cell wey we talk about for di last unit (simple, LSTM or GRU). For our example, we go use LSTM.

Because di network dey take characters as input, and di vocabulary size no too big, we no need embedding layer, one-hot-encoded input fit enter di LSTM cell directly. Di output layer go be `Dense` classifier wey go change di LSTM output into one-hot-encoded token numbers.

Plus, since we dey work with variable-length sequences, we fit use `Masking` layer to create mask wey go ignore di padded part of di string. Dis one no dey strictly necessary, because we no too dey focus on anything wey dey after `<eos>` token, but we go use am make we sabi how dis layer type dey work. `input_shape` go be `(None, vocab_size)`, where `None` mean di sequence fit get any length, and di output shape go still be `(None, vocab_size)`, as you fit see from di `summary`:


In [7]:
model = keras.models.Sequential([
    keras.layers.Masking(input_shape=(None,vocab_size)),
    keras.layers.LSTM(128,return_sequences=True),
    keras.layers.Dense(vocab_size,activation='softmax')
])

model.summary()
model.compile(loss='categorical_crossentropy')

model.fit(ds_train.batch(8).map(title_batch_fn))

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
masking (Masking)            (None, None, 84)          0         
_________________________________________________________________
lstm (LSTM)                  (None, None, 128)         109056    
_________________________________________________________________
dense (Dense)                (None, None, 84)          10836     
Total params: 119,892
Trainable params: 119,892
Non-trainable params: 0
_________________________________________________________________


<tensorflow.python.keras.callbacks.History at 0x7fa40c1245e0>

## How to take output comot

Now we don train di model, we wan use am take generate some output. First, we go need way wey go fit decode text wey dem represent as sequence of token numbers. To do dis one, we fit use `tokenizer.sequences_to_texts` function; but e no dey work well wit character-level tokenization. So, we go carry di dictionary of tokens from di tokenizer (dem dey call am `word_index`), build reverse map, and write our own decoding function:


In [10]:
reverse_map = {val:key for key, val in tokenizer.word_index.items()}

def decode(x):
    return ''.join([reverse_map[t] for t in x])

Okay, make we do generation. We go start wit one string `start`, encode am inside one sequence `inp`, and then for each step we go call our network to find di next character.

Di output wey di network `out` go give na one vector wey get `vocab_size` elements wey dey represent di probabilities of each token, and we fit find di token wey get di highest probability by using `argmax`. After dat, we go add dis character to di list of tokens wey we don generate, and continue di generation process. Dis process wey dey generate one character go repeat `size` times to generate di number of characters wey we need, and we go stop early if we see `eos_token`.


In [12]:
def generate(model,size=100,start='Today '):
        inp = tokenizer.texts_to_sequences([start])[0]
        chars = inp
        for i in range(size):
            out = model(tf.expand_dims(tf.one_hot(inp,vocab_size),0))[0][-1]
            nc = tf.argmax(out)
            if nc==eos_token:
                break
            chars.append(nc.numpy())
            inp = inp+[nc]
        return decode(chars)
    
generate(model)

'Today #39;s lead to strike for the strike for the strike for the strike (AFP)'

## Sampling output during training 

Since we no get any useful metrics like *accuracy*, di only way we fit take see say our model dey improve na by **sampling** di string wey e generate during training. To do am, we go use **callbacks**, wey be functions wey we fit pass give di `fit` function, and dem go dey call from time to time during training.


In [13]:
sampling_callback = keras.callbacks.LambdaCallback(
  on_epoch_end = lambda batch, logs: print(generate(model))
)

model.fit(ds_train.batch(8).map(title_batch_fn),callbacks=[sampling_callback],epochs=3)

Epoch 1/3
Today #39;s a lead in the company for the strike
Epoch 2/3
Today #39;s the Market Service on Security Start (AP)
Epoch 3/3
Today #39;s a line on the strike to start for the start


<tensorflow.python.keras.callbacks.History at 0x7fa40c74e3d0>

Dis example dey already generate beta text, but e fit still beta well-well in plenty ways:

* **Add more text**. We don only use title for di task, but you fit wan try use full text. Remember say RNNs no dey too sabi handle long sequence, so e go make sense if you split dem into short sentence, or you dey always train am for fixed sequence length wey get some predefined value `num_chars` (like 256). You fit try change di example wey dey up to dis kain architecture, use [official Keras tutorial](https://keras.io/examples/generative/lstm_character_level_text_generation/) as inspiration.

* **Multilayer LSTM**. E go make sense if you try 2 or 3 layers of LSTM cells. As we don talk for di previous unit, each layer of LSTM dey extract some kain pattern from text, and for character-level generator, we fit expect say di lower LSTM level go dey responsible for extracting syllables, while di higher levels go dey handle words and word combinations. You fit simply implement am by passing number-of-layers parameter to di LSTM constructor.

* You fit wan try experiment with **GRU units** to see which one go perform better, and also try **different hidden layer sizes**. If di hidden layer too big, e fit cause overfitting (like say di network go dey learn di exact text), and if e too small, e fit no produce beta result.


## Soft text generation and temperature

For di definition wey we give for `generate` before, we dey always pick di character wey get di highest probability as di next character for di text wey we dey generate. Dis one dey make di text dey "repeat" di same character sequence again and again, like dis example:
```
today of the second the company and a second the company ...
```

But if we check di probability distribution for di next character, e fit be say di difference between di few characters wey get di highest probabilities no too big, e.g. one character fit get probability 0.2, another one fit get 0.19, etc. For example, if we dey look for di next character for di sequence '*play*', di next character fit be space or **e** (like for di word *player*).

Dis one dey show say e no dey always "fair" to pick di character wey get di higher probability, because if we pick di second highest, e fit still give us meaningful text. E go make sense if we **sample** characters from di probability distribution wey di network output give us.

We fit use `np.multinomial` function to do dis sampling, and dis function dey implement wetin dem dey call **multinomial distribution**. Below na di function wey dey implement dis **soft** text generation:


In [33]:
def generate_soft(model,size=100,start='Today ',temperature=1.0):
        inp = tokenizer.texts_to_sequences([start])[0]
        chars = inp
        for i in range(size):
            out = model(tf.expand_dims(tf.one_hot(inp,vocab_size),0))[0][-1]
            probs = tf.exp(tf.math.log(out)/temperature).numpy().astype(np.float64)
            probs = probs/np.sum(probs)
            nc = np.argmax(np.random.multinomial(1,probs,1))
            if nc==eos_token:
                break
            chars.append(nc)
            inp = inp+[nc]
        return decode(chars)

words = ['Today ','On Sunday ','Moscow, ','President ','Little red riding hood ']
    
for i in [0.3,0.8,1.0,1.3,1.8]:
    print(f"\n--- Temperature = {i}")
    for j in range(5):
        print(generate_soft(model,size=300,start=words[j],temperature=i))


--- Temperature = 0.3
Today #39;s strike #39; to start at the store return
On Sunday PO to Be Data Profit Up (Reuters)
Moscow, SP wins straight to the Microsoft #39;s control of the space start
President olding of the blast start for the strike to pay &lt;b&gt;...&lt;/b&gt;
Little red riding hood ficed to the spam countered in European &lt;b&gt;...&lt;/b&gt;

--- Temperature = 0.8
Today countie strikes ryder missile faces food market blut
On Sunday collores lose-toppy of sale of Bullment in &lt;b&gt;...&lt;/b&gt;
Moscow, IBM Diffeiting in Afghan Software Hotels (Reuters)
President Ol Luster for Profit Peaced Raised (AP)
Little red riding hood dace on depart talks #39; bank up

--- Temperature = 1.0
Today wits House buiting debate fixes #39; supervice stake again
On Sunday arling digital poaching In for level
Moscow, DS Up 7, Top Proble Protest Caprey Mamarian Strike
President teps help of roubler stepted lessabul-Dhalitics (AFP)
Little red riding hood signs on cash in Carter-youb

---

KeyError: 0

We don add one new parameter wey dem call **temperature**, e dey show how strong we go hold the highest probability. If temperature na 1.0, we go do fair multinomial sampling, but if temperature go infinity - all the probabilities go be the same, and we go randomly choose the next character. For the example wey dey below, we fit see say the text go dey meaningless if we increase the temperature too much, and e go resemble "cycled" hard-generated text if e near 0.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI transle-shon service [Co-op Translator](https://github.com/Azure/co-op-translator) do di transle-shon. Even as we dey try make am accurate, abeg make you sabi say transle-shon wey machine do fit get mistake or no dey correct well. Di original dokyument for im native language na di one wey you go take as di correct source. For important mata, e good make you use professional human transle-shon. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis transle-shon.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
