# Generative networks

Recurrent Neural Networks (RNNs) and di gated cell variants like Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) don give us way to do language modeling, wey mean say dem fit sabi how words suppose arrange and fit predict di next word for one sentence. Dis one mean say we fit use RNNs for **generative tasks**, like normal text generation, machine translation, and even image captioning.

For di RNN architecture wey we talk about for di last unit, each RNN unit dey produce di next hidden state as output. But, we fit still add another output to each recurrent unit, wey go allow us fit output one **sequence** (wey go get di same length as di original sequence). Plus, we fit use RNN units wey no dey collect input for every step, dem go just take one initial state vector, and then dey produce sequence of outputs.

For dis notebook, we go focus on simple generative models wey go help us generate text. To make am simple, make we build **character-level network**, wey go dey generate text letter by letter. For training, we go need take one text corpus, and split am into letter sequences.


In [1]:
import torch
import torchtext
import numpy as np
from torchnlp import *
train_dataset,test_dataset,classes,vocab = load_dataset()

Loading dataset...
Building vocab...


## How to build character vocabulary

To build generative network wey dey work for character level, we go need split text into each character instead of words. We fit do am by defining different tokenizer:


In [2]:
def char_tokenizer(words):
    return list(words) #[word for word in words]

counter = collections.Counter()
for (label, line) in train_dataset:
    counter.update(char_tokenizer(line))
vocab = torchtext.vocab.vocab(counter)

vocab_size = len(vocab)
print(f"Vocabulary size = {vocab_size}")
print(f"Encoding of 'a' is {vocab.get_stoi()['a']}")
print(f"Character with code 13 is {vocab.get_itos()[13]}")

Vocabulary size = 82
Encoding of 'a' is 1
Character with code 13 is c


Make we see di example of how we fit encode di text from our dataset:


In [3]:
def enc(x):
    return torch.LongTensor(encode(x,voc=vocab,tokenizer=char_tokenizer))

enc(train_dataset[0][1])

tensor([ 0,  1,  2,  2,  3,  4,  5,  6,  3,  7,  8,  1,  9, 10,  3, 11,  2,  1,
        12,  3,  7,  1, 13, 14,  3, 15, 16,  5, 17,  3,  5, 18,  8,  3,  7,  2,
         1, 13, 14,  3, 19, 20,  8, 21,  5,  8,  9, 10, 22,  3, 20,  8, 21,  5,
         8,  9, 10,  3, 23,  3,  4, 18, 17,  9,  5, 23, 10,  8,  2,  2,  8,  9,
        10, 24,  3,  0,  1,  2,  2,  3,  4,  5,  9,  8,  8,  5, 25, 10,  3, 26,
        12, 27, 16, 26,  2, 27, 16, 28, 29, 30,  1, 16, 26,  3, 17, 31,  3, 21,
         2,  5,  9,  1, 23, 13, 32, 16, 27, 13, 10, 24,  3,  1,  9,  8,  3, 10,
         8,  8, 27, 16, 28,  3, 28,  9,  8,  8, 16,  3,  1, 28,  1, 27, 16,  6])

## How we go train generative RNN

Di way we go take train RNN to dey generate text na like dis. For each step, we go carry one sequence of characters wey get length `nchars`, and we go tell di network make e generate di next character wey go follow each input character:

![Image wey dey show example of RNN wey dey generate di word 'HELLO'.](../../../../../translated_images/rnn-generate.56c54afb52f9781d63a7c16ea9c1b86cb70e6e1eae6a742b56b7b37468576b17.pcm.png)

Depending on di situation wey we dey, we fit wan add some special characters, like *end-of-sequence* `<eos>`. But for our own case, we just wan train di network to dey generate text without end, so we go fix di size of each sequence make e equal to `nchars` tokens. So, each training example go get `nchars` inputs and `nchars` outputs (di input sequence go shift one symbol go left). Minibatch go get plenty of dis kind sequences.

Di way we go take generate minibatches na to carry each news text wey get length `l`, and create all di possible input-output combinations from am (di combinations go be `l-nchars`). Dis combinations go form one minibatch, and di size of di minibatches go dey different for each training step.


In [4]:
nchars = 100

def get_batch(s,nchars=nchars):
    ins = torch.zeros(len(s)-nchars,nchars,dtype=torch.long,device=device)
    outs = torch.zeros(len(s)-nchars,nchars,dtype=torch.long,device=device)
    for i in range(len(s)-nchars):
        ins[i] = enc(s[i:i+nchars])
        outs[i] = enc(s[i+1:i+nchars+1])
    return ins,outs

get_batch(train_dataset[0][1])

(tensor([[ 0,  1,  2,  ..., 28, 29, 30],
         [ 1,  2,  2,  ..., 29, 30,  1],
         [ 2,  2,  3,  ..., 30,  1, 16],
         ...,
         [20,  8, 21,  ...,  1, 28,  1],
         [ 8, 21,  5,  ..., 28,  1, 27],
         [21,  5,  8,  ...,  1, 27, 16]]),
 tensor([[ 1,  2,  2,  ..., 29, 30,  1],
         [ 2,  2,  3,  ..., 30,  1, 16],
         [ 2,  3,  4,  ...,  1, 16, 26],
         ...,
         [ 8, 21,  5,  ..., 28,  1, 27],
         [21,  5,  8,  ...,  1, 27, 16],
         [ 5,  8,  9,  ..., 27, 16,  6]]))

Make we define generator network now. E fit base on any recurrent cell wey we talk about for di previous unit (simple, LSTM or GRU). For dis example, we go use LSTM.

Since di network dey take characters as input, and di vocabulary size no too big, we no need embedding layer, one-hot-encoded input fit go directly enter LSTM cell. But, because we dey pass character numbers as input, we need to one-hot-encode dem before we pass am go LSTM. We go do dis one by calling `one_hot` function during `forward` pass. Di output encoder go be linear layer wey go change di hidden state into one-hot-encoded output.


In [5]:
class LSTMGenerator(torch.nn.Module):
    def __init__(self, vocab_size, hidden_dim):
        super().__init__()
        self.rnn = torch.nn.LSTM(vocab_size,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, s=None):
        x = torch.nn.functional.one_hot(x,vocab_size).to(torch.float32)
        x,s = self.rnn(x,s)
        return self.fc(x),s

For training, we wan make we fit sample text wey we generate. To do am, we go define `generate` function wey go produce output string wey get length `size`, wey go start from the first string `start`.

How e go work be say, first, we go pass the whole start string enter the network, then we go collect output state `s` and the next character wey e predict `out`. Because `out` na one-hot encoded, we go use `argmax` to find the index of the character `nc` for the vocabulary, then use `itos` to sabi the real character and add am join the list of characters wey we dey build `chars`. This process wey dey generate one character go repeat `size` times to produce the number of characters wey we need.


In [8]:
def generate(net,size=100,start='today '):
        chars = list(start)
        out, s = net(enc(chars).view(1,-1).to(device))
        for i in range(size):
            nc = torch.argmax(out[0][-1])
            chars.append(vocab.get_itos()[nc])
            out, s = net(nc.view(1,-1),s)
        return ''.join(chars)

Make we start di training! Di training loop na almost di same as wetin we don do for all di examples wey we don show before, but dis time, instead of accuracy, we go dey show sampled generated text every 1000 epochs.

Make sure say you pay special attention to how we dey calculate loss. We go calculate loss based on one-hot-encoded output `out`, and di expected text `text_out`, wey be list of character indices. Di good thing be say di `cross_entropy` function dey expect unnormalized network output as di first argument, and di class number as di second argument, wey na exactly wetin we get. E also dey do automatic averaging for di minibatch size.

We go also limit di training by `samples_to_train` samples, so we no go dey wait too long. We dey encourage you make you try experiment and test longer training, maybe for plenty epochs (if you wan do dis one, you go need create another loop around dis code).


In [9]:
net = LSTMGenerator(vocab_size,64).to(device)

samples_to_train = 10000
optimizer = torch.optim.Adam(net.parameters(),0.01)
loss_fn = torch.nn.CrossEntropyLoss()
net.train()
for i,x in enumerate(train_dataset):
    # x[0] is class label, x[1] is text
    if len(x[1])-nchars<10:
        continue
    samples_to_train-=1
    if not samples_to_train: break
    text_in, text_out = get_batch(x[1])
    optimizer.zero_grad()
    out,s = net(text_in)
    loss = torch.nn.functional.cross_entropy(out.view(-1,vocab_size),text_out.flatten()) #cross_entropy(out,labels)
    loss.backward()
    optimizer.step()
    if i%1000==0:
        print(f"Current loss = {loss.item()}")
        print(generate(net))

Current loss = 4.398899078369141
today sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr sr s
Current loss = 2.161320447921753
today and to the tor to to the tor to to the tor to to the tor to to the tor to to the tor to to the tor t
Current loss = 1.6722588539123535
today and the court to the could to the could to the could to the could to the could to the could to the c
Current loss = 2.423795223236084
today and a second to the conternation of the conternation of the conternation of the conternation of the 
Current loss = 1.702607274055481
today and the company to the company to the company to the company to the company to the company to the co
Current loss = 1.692358136177063
today and the company to the company to the company to the company to the company to the company to the co
Current loss = 1.9722288846969604
today and the control the control the control the control the control the control the control the control 
Current loss = 1.8

Dis example dey already generate beta text, but e fit still beta well-well in some ways:

* **Beta minibatch generation**. Di way we take prepare data for training na to generate one minibatch from one sample. Dis no too make sense, because di minibatches dey get different sizes, and some no even fit generate, because di text small pass `nchars`. Plus, small minibatches no dey use GPU well. E go make sense to gather one big chunk of text from all di samples, then generate all di input-output pairs, shuffle dem, and generate minibatches wey get di same size.

* **Multilayer LSTM**. E go make sense to try 2 or 3 layers of LSTM cells. As we talk for di previous unit, each layer of LSTM dey extract certain patterns from text, and for character-level generator, we fit expect di lower LSTM level to dey responsible for extracting syllables, and di higher levels - for words and word combinations. You fit implement dis one easy by passing number-of-layers parameter to di LSTM constructor.

* You fit also wan try experiment with **GRU units** to see which one go perform better, and with **different hidden layer sizes**. If di hidden layer too big, e fit cause overfitting (e.g. di network go dey learn di exact text), and if e too small, e fit no produce beta result.


## Soft text generation and temperature

For di previous definition of `generate`, we dey always pick di character wey get di highest probability as di next character for di text wey we dey generate. Dis one dey make di text dey "repeat" di same character sequence again and again, like dis example:
```
today of the second the company and a second the company ...
```

But if we check di probability distribution for di next character, e fit be say di difference between di few highest probabilities no too big, e.g. one character fit get probability 0.2, another one fit get 0.19, etc. For example, when we dey look for di next character for di sequence '*play*', di next character fit be space or **e** (like for di word *player*).

Dis one dey show say e no always "make sense" to pick di character wey get higher probability, because if we choose di second highest, e fit still give us meaningful text. E go better make we **sample** characters from di probability distribution wey di network output give us.

We fit do dis sampling with `multinomial` function wey dey use di thing wey dem dey call **multinomial distribution**. Di function wey dey do dis **soft** text generation dey defined below:


In [10]:
def generate_soft(net,size=100,start='today ',temperature=1.0):
        chars = list(start)
        out, s = net(enc(chars).view(1,-1).to(device))
        for i in range(size):
            #nc = torch.argmax(out[0][-1])
            out_dist = out[0][-1].div(temperature).exp()
            nc = torch.multinomial(out_dist,1)[0]
            chars.append(vocab.get_itos()[nc])
            out, s = net(nc.view(1,-1),s)
        return ''.join(chars)
    
for i in [0.3,0.8,1.0,1.3,1.8]:
    print(f"--- Temperature = {i}\n{generate_soft(net,size=300,start='Today ',temperature=i)}\n")

--- Temperature = 0.3
Today and a company and complete an all the land the restrational the as a security and has provers the pay to and a report and the computer in the stand has filities and working the law the stations for a company and with the company and the final the first company and refight of the state and and workin

--- Temperature = 0.8
Today he oniis its first to Aus bomblaties the marmation a to manan  boogot that pirate assaid a relaid their that goverfin the the Cappets Ecrotional Assonia Cition targets it annight the w scyments Blamity #39;s TVeer Diercheg Reserals fran envyuil that of ster said access what succers of Dour-provelith

--- Temperature = 1.0
Today holy they a 11 will meda a toket subsuaties, engins for Chanos, they's has stainger past to opening orital his thempting new Nattona was al innerforder advan-than #36;s night year his religuled talitatian what the but with Wednesday to Justment will wemen of Mark CCC Camp as Timed Nae wome a leaders

--- Temper

We don add one new parameter wey we dey call **temperature**, e dey show how strong we go hold the highest probability. If temperature na 1.0, we go do fair multinomial sampling, but if temperature go infinity - all the probabilities go equal, and we go randomly choose the next character. For the example wey dey below, we fit see say the text go dey meaningless if we increase the temperature too much, and e go resemble "cycled" hard-generated text if e near 0.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI transleto service [Co-op Translator](https://github.com/Azure/co-op-translator) do di translation. Even as we dey try make am accurate, abeg make you sabi say machine translation fit get mistake or no dey correct well. Di original dokyument wey dey for im native language na di main source wey you go trust. For important mata, e good make professional human transleto check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because you use dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
