# 4. Generative networks

Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated
Recurrent Units (GRUs) provided a mechanism for language modeling, i.e. they can learn word ordering and provide
predictions for next word in a sequence. This allows us to use RNNs for generative tasks, such as ordinary text
generation, machine translation, and even image captioning.

In the RNN architecture of our previous section, each RNN unit produced the next hidden state as an output. However,
we can also add another output to each recurrent unit, which would allow us to output a sequence (which is equal
in length to the original sequence). Moreover, we can use RNN units that do not accept an input at each step, and
just take some initial state vector, and then produce a sequence of outputs.

This allows for different neural architectures that are shown in the picture below:
![various_rnn](https://raw.githubusercontent.com/pengfei99/PyTorchTuto/main/notebooks/img/various-rnn-architecture.jpg)
Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red,
output vectors are in blue and green vectors hold the RNN's state.

- One-to-one is a traditional neural network with one input and one output
- One-to-many is a generative architecture that accepts one input value, and generates a sequence of output values. For example, if we want to train image captioning network that would produce a textual description of a picture, we can a picture as input, pass it through CNN to obtain hidden state, and then have recurrent chain generate caption word-by-word
- Many-to-one corresponds to RNN architectures we described in the previous unit, such as text classification
- Many-to-many, or sequence-to-sequence corresponds to tasks such as machine translation, where we have first RNN collect all information from the input sequence into the hidden state, and another RNN chain unrolls this state into the output sequence.

For more info on various rnn, http://karpathy.github.io/2015/05/21/rnn-effectiveness/


In this section, we will focus on simple generative models that help us to generate text. For simplicity, let's build
character-level network, which generates text letter by letter. During training, we need to take some text corpus,
and split it into letter sequences.

We still use the news dataset

In [1]:
import torch
import torchtext
import numpy as np
from torchnlp import *
from torchtext.vocab import vocab
from collections import Counter, OrderedDict

In [2]:
# download data
def load_dataset(storage_path):
    print("Loading dataset...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root=storage_path)
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset


path = "/tmp/pytorch/data"
train, test = load_dataset(path)



Loading dataset...


In [3]:
# build label class list
label_classes = ['World', 'Sports', 'Business', 'Sci/Tech']

## 4.1 Building character vocabulary

To build character-level generative network, we need to split text into individual characters instead of words.
This can be done by defining a different tokenizer:

In [4]:
# build a simple tokenizer
def char_tokenizer(words):
    return list(words)  #[word for word in words]


# function that build vocabulary with the token of all text
def build_char_vocabulary(dataset, ngrams=1, min_freq=1):
    # here we use counter to store the generated token to take in account the token frequency
    counter = Counter()
    # we iterate over all rows, covert text to word token, and add these token to bag_of words
    for (label, line) in dataset:
        counter.update(torchtext.data.utils.ngrams_iterator(char_tokenizer(line), ngrams=ngrams))
    # sort the collected token counter by token's frequencies
    sorted_by_token_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    # build a set of words as an orderedDict
    words_dict = OrderedDict(sorted_by_token_freq_tuples)
    # we build a vocabulary based on the words token
    return vocab(words_dict, min_freq=min_freq)


# build a character vocab
my_vocab = build_char_vocabulary(train)
my_vocab_size = len(my_vocab)

In [5]:
print(f"Vocabulary size = {my_vocab_size}")
print(f"Encoding of 'a' is {my_vocab.get_stoi()['a']}")
print(f"Character with code 13 is {my_vocab.get_itos()[13]}")

Vocabulary size = 82
Encoding of 'a' is 2
Character with code 13 is u


## 4.2 Building an encoder
This encoder can translate a text sequence to a tensor by using the character vocabulary that we built above.

In [6]:
# This encoder use a char vocabulary, and tokenizer instead of a word.
def char_encode(text, char_vocab, tokenizer):
    return [char_vocab[char] for char in tokenizer(text)]


# convert text to tensor
def text_to_tensor(x):
    return torch.LongTensor(char_encode(x, my_vocab, tokenizer=char_tokenizer))


# show an example
print(f"source text: {train[0][1]}")
print(f"generated tensor: {text_to_tensor(train[0][1])}")

source text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
generated tensor: tensor([41,  2,  9,  9,  0, 24,  3, 21,  0, 36,  1,  2,  8,  7,  0, 29,  9,  2,
        19,  0, 36,  2, 12, 23,  0, 32,  6,  3,  4,  0,  3, 11,  1,  0, 36,  9,
         2, 12, 23,  0, 53, 35,  1, 13,  3,  1,  8,  7, 54,  0, 35,  1, 13,  3,
         1,  8,  7,  0, 27,  0, 24, 11,  4,  8,  3, 27,  7,  1,  9,  9,  1,  8,
         7, 25,  0, 41,  2,  9,  9,  0, 24,  3,  8,  1,  1,  3, 56,  7,  0, 10,
        19,  5,  6, 10,  9,  5,  6, 16, 59, 20,  2,  6, 10,  0,  4, 17,  0, 13,
         9,  3,  8,  2, 27, 12, 18,  6,  5, 12,  7, 25,  0,  2,  8,  1,  0,  7,
         1,  1,  5,  6, 16,  0, 16,  8,  1,  1,  6,  0,  2, 16,  2,  5,  6, 21])


## 4.3 Training a generative RNN

The way we will train RNN to generate text is the following. On each step, we will take a sequence of characters of
length n_chars, and ask the network to generate next output character for each input character

![generate_rnn](https://raw.githubusercontent.com/pengfei99/PyTorchTuto/main/notebooks/img/rnn-generate.png)

Depending on the actual scenario, we may also want to include some special characters, such as end-of-sequence <eos>.
In our case, we just want to train the network for endless text generation, thus we will fix the size of each
sequence to be equal to n_chars tokens. Consequently, each training example will consist of n_chars inputs and n_chars
outputs (which are input sequence shifted one symbol to the left). Mini_batch will consist of several such sequences.

The way we will generate mini_batches is to take each news text of length l, and generate all possible input-output
combinations from it (there will be l-nchars such combinations). They will form one mini-batch, and size of
mini-batches would be different at each training step.

In [7]:
n_chars = 100
device = "cpu"


def get_batch(s, nchars=n_chars):
    ins = torch.zeros(len(s) - nchars, nchars, dtype=torch.long, device=device)
    outs = torch.zeros(len(s) - nchars, nchars, dtype=torch.long, device=device)
    for i in range(len(s) - nchars):
        ins[i] = text_to_tensor(s[i:i + nchars])
        outs[i] = text_to_tensor(s[i + 1:i + nchars + 1])
    return ins, outs


get_batch(train[0][1])

(tensor([[41,  2,  9,  ..., 16, 59, 20],
         [ 2,  9,  9,  ..., 59, 20,  2],
         [ 9,  9,  0,  ..., 20,  2,  6],
         ...,
         [35,  1, 13,  ...,  2, 16,  2],
         [ 1, 13,  3,  ..., 16,  2,  5],
         [13,  3,  1,  ...,  2,  5,  6]]),
 tensor([[ 2,  9,  9,  ..., 59, 20,  2],
         [ 9,  9,  0,  ..., 20,  2,  6],
         [ 9,  0, 24,  ...,  2,  6, 10],
         ...,
         [ 1, 13,  3,  ..., 16,  2,  5],
         [13,  3,  1,  ...,  2,  5,  6],
         [ 3,  1,  8,  ...,  5,  6, 21]]))

### 4.3.1 Build a generative model
Now let's define the generator model. It can be based on any recurrent cell which we discussed in the previous section
(simple rnn, LSTM or GRU). In this example we will use LSTM.

Because the network takes characters as input, and vocabulary size is pretty small, we do not need embedding layer,
one-hot-encoded input can directly go to LSTM cell. However, because we pass character numbers as input, we need to
one-hot-encode them before passing to LSTM. This is done by calling one_hot function during forward pass. Output
encoder would be a linear layer that will convert hidden state into one-hot-encoded output.

In [8]:
class LSTMGenerator(torch.nn.Module):
    def __init__(self, vocab_size, hidden_dim):
        super().__init__()
        self.vocab_size = vocab_size
        self.rnn = torch.nn.LSTM(vocab_size, hidden_dim, batch_first=True)
        # fc stands for fully connected layer
        self.fc = torch.nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, s=None):
        # we need to one-hot-encode character number
        x = torch.nn.functional.one_hot(x, self.vocab_size).to(torch.float32)
        x, s = self.rnn(x, s)
        return self.fc(x), s

During training, we want to be able to sample generated text. To do that, we will define generate function that will
produce output string of length size, starting from the initial string start.

The way it works is the following. First, we will pass the whole start string through the network, and take output
state **s** and next predicted character out. Since out is one-hot encoded, we take argmax to get the index of the
character **nc** in the vocabulary, and use **itos** to figure out the actual character and append it to the resulting list
of characters chars. This process of generating one character is repeated size times to generate required number
of characters.

In [9]:
def generate(generative_model, vocabulary, size=100, start='today '):
    chars = list(start)
    output, state = generative_model(text_to_tensor(chars).view(1, -1).to(device))
    for i in range(size):
        nc = torch.argmax(output[0][-1])
        chars.append(vocabulary.get_itos()[nc])
        output, state = generative_model(nc.view(1, -1), state)
    return ''.join(chars)

In [10]:
my_generative_model = LSTMGenerator(my_vocab_size, 64).to(device)

samples_to_train = 10000
optimizer = torch.optim.Adam(my_generative_model.parameters(), 0.01)
loss_fn = torch.nn.CrossEntropyLoss()
my_generative_model.train()
for i, x in enumerate(train):
    # x[0] is class label, x[1] is text
    if len(x[1]) - n_chars < 10:
        continue
    samples_to_train -= 1
    if not samples_to_train: break
    text_in, text_out = get_batch(x[1])
    optimizer.zero_grad()
    out, s = my_generative_model(text_in)
    loss = torch.nn.functional.cross_entropy(out.view(-1, my_vocab_size),
                                             text_out.flatten())  #cross_entropy(out,labels)
    loss.backward()
    optimizer.step()
    if i % 1000 == 0:
        print(f"Current loss = {loss.item()}")
        print(generate(my_generative_model, my_vocab))

Current loss = 4.416311740875244
today daaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
Current loss = 2.1311776638031006
today and a the the the the the the the the the the the the the the the the the the the the the the the th
Current loss = 1.6439659595489502
today and a deling a deside a deal a deside a deal a deside a deal a deside a deal a deside a deal a desid
Current loss = 2.417975425720215
today and the United Stater and the United Stater and the United Stater and the United Stater and the Unit
Current loss = 1.6155403852462769
today and the start to the company to the company to the company to the company to the company to the comp
Current loss = 1.7325990200042725
today to a second the second the second the second the second the second the second the second the second 
Current loss = 1.905916690826416
today and the final the first the first the first the first the first the first the first the first the fi
Current loss = 1

The above example already generates some pretty good text, but it can be further improved in several ways:

- **Better mini_batch generation**: The way we prepared data for training was to generate one mini_batch from one
sample. This is not ideal, because minibatches are all of different sizes, and some of them even cannot be generated,
because the text is smaller than nchars. Also, small minibatches do not load GPU sufficiently enough. It would be
wiser to get one large chunk of text from all samples, then generate all input-output pairs, shuffle them, and
generate mini_batches of equal size.
- **Multilayer LSTM**: It makes sense to try 2 or 3 layers of LSTM cells. As we mentioned in the previous section,
each layer of LSTM extracts certain patterns from text, and in case of character-level generator we can expect
lower LSTM level to be responsible for extracting syllables, and higher levels - for words and word combinations.
This can be simply implemented by passing number-of-layers parameter to LSTM constructor.
- **GRU units**: You may also want to experiment with GRU units and see which ones perform better, and with
different hidden layer sizes. Too large hidden layer may result in over_fitting (e.g. network will learn exact text),
and smaller size might not produce good result.

## 4.4 Soft text generation and temperature

In the previous example, when we generate charcater, we were always taking the character with highest probability as
the next character in generated text. This resulted in the fact that the text often "cycled" between the same
character sequences again and again, like in this example:

"today to a second the second the second the second the second the second the second the second"

However, if we look at the probability distribution for the next character, it could be that the difference between
a few highest probabilities is not huge, e.g. one character can have probability 0.2, another - 0.19, etc. For example,
when looking for the next character in the sequence 'play', next character can equally well be either 'space', or 'e'
(as in the word player).

This leads us to the conclusion that it is not always "fair" to select the character with higher probability, because
choosing the second highest might still lead us to meaningful text. It is more wise to sample characters from the
probability distribution given by the network output.

This sampling can be done using multinomial function that implements so-called multinomial distribution. A function
that implements this soft text generation is defined below:

In [12]:
def generate_soft(generative_model, vocabulary, size=100,start='today ',temperature=1.0):
    chars = list(start)
    output, state = generative_model(text_to_tensor(chars).view(1,-1).to(device))
    for i in range(size):
        #nc = torch.argmax(out[0][-1])
        out_dist = output[0][-1].div(temperature).exp()
        # multinomial function that implements so-called multinomial distribution.
        number_of_char = torch.multinomial(out_dist,1)[0]
        chars.append(vocabulary.get_itos()[number_of_char])
        output, state = generative_model(number_of_char.view(1,-1),state)
    return ''.join(chars)

for i in [0.3,0.8,1.0,1.3,1.8]:
    print(f"--- Temperature = {i}\n{generate_soft(my_generative_model,my_vocab,size=300,start='Today ',temperature=i)}\n")

--- Temperature = 0.3
Today said to last for the starth and a the the first the serviration as the straight with the straight and the the sign the gand the controce the pay his company with the manages to and the computer to the manage and the win the straight of the straight to has the last the the relat the investors a with 

--- Temperature = 0.8
Today to even profits a Dates the a thing the compat world found telephes, have and matchnoces fall and beats he the July the country in Google build-wiound a United Porting of victang the early #39;s schapi-provorter a there rise, id macritially stash as all has noman to cir new selmped to policos the th

--- Temperature = 1.0
Today lauly barting Olymsabive agreed on Webbly Fran Paigua Gurs, in 4 yesterday and same have team to cull beek patforering those a Frand back Web Kharta NR.Pje?tirsly wretain in fuchades one frocing that Bost sinal procectimt NEW YORENTOBB. throuch fout polis wireworth vace lay amolom in leats of a the 

--- Temper

We have introduced one more parameter called temperature, which is used to indicate how hard we should stick to the
highest probability. If temperature is 1.0, we do fair multinomial sampling, and when temperature goes to infinity -
all probabilities become equal, and we randomly select next character. In the example below we can observe that the
text becomes meaningless when we increase the temperature too much, and it resembles "cycled" hard-generated text
when it becomes closer to 0.