# 4. Generative networks

Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated
Recurrent Units (GRUs) provided a mechanism for language modeling, i.e. they can learn word ordering and provide
predictions for next word in a sequence. This allows us to use RNNs for generative tasks, such as ordinary text
generation, machine translation, and even image captioning.

In the RNN architecture of our previous section, each RNN unit produced the next hidden state as an output. However,
we can also add another output to each recurrent unit, which would allow us to output a sequence (which is equal
in length to the original sequence). Moreover, we can use RNN units that do not accept an input at each step, and
just take some initial state vector, and then produce a sequence of outputs.

This allows for different neural architectures that are shown in the picture below:
![various_rnn](https://raw.githubusercontent.com/pengfei99/PyTorchTuto/main/notebooks/img/various-rnn-architecture.jpg)
Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red,
output vectors are in blue and green vectors hold the RNN's state.

- One-to-one is a traditional neural network with one input and one output
- One-to-many is a generative architecture that accepts one input value, and generates a sequence of output values. For example, if we want to train image captioning network that would produce a textual description of a picture, we can a picture as input, pass it through CNN to obtain hidden state, and then have recurrent chain generate caption word-by-word
- Many-to-one corresponds to RNN architectures we described in the previous unit, such as text classification
- Many-to-many, or sequence-to-sequence corresponds to tasks such as machine translation, where we have first RNN collect all information from the input sequence into the hidden state, and another RNN chain unrolls this state into the output sequence.

For more info on various rnn, http://karpathy.github.io/2015/05/21/rnn-effectiveness/


In this section, we will focus on simple generative models that help us to generate text. For simplicity, let's build
character-level network, which generates text letter by letter. During training, we need to take some text corpus,
and split it into letter sequences.

We still use the news dataset

In [1]:
import torch
import torchtext
import numpy as np
from torchnlp import *
from torchtext.vocab import vocab
from collections import Counter, OrderedDict

In [2]:
# download data
def load_dataset(storage_path):
    print("Loading dataset...")
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root=storage_path)
    train_dataset = list(train_dataset)
    test_dataset = list(test_dataset)
    return train_dataset, test_dataset


path = "/tmp/pytorch/data"
train, test = load_dataset(path)



Loading dataset...


In [3]:
# build label class list
label_classes = ['World', 'Sports', 'Business', 'Sci/Tech']

## 4.1 Building character vocabulary

To build character-level generative network, we need to split text into individual characters instead of words.
This can be done by defining a different tokenizer:

In [4]:
# build a simple tokenizer
def char_tokenizer(words):
    return list(words) #[word for word in words]


# function that build vocabulary with the token of all text
def build_char_vocabulary(dataset, ngrams=1, min_freq=1):
    # here we use counter to store the generated token to take in account the token frequency
    counter = Counter()
    # we iterate over all rows, covert text to word token, and add these token to bag_of words
    for (label, line) in dataset:
        counter.update(torchtext.data.utils.ngrams_iterator(char_tokenizer(line), ngrams=ngrams))
    # sort the collected token counter by token's frequencies
    sorted_by_token_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
    # build a set of words as an orderedDict
    words_dict = OrderedDict(sorted_by_token_freq_tuples)
    # we build a vocabulary based on the words token
    return vocab(words_dict, min_freq=min_freq)

# build a character vocab
my_vocab = build_char_vocabulary(train)
my_vocab_size = len(my_vocab)

In [8]:
print(f"Vocabulary size = {my_vocab_size}")
print(f"Encoding of 'a' is {my_vocab.get_stoi()['a']}")
print(f"Character with code 13 is {my_vocab.get_itos()[13]}")

Vocabulary size = 82
Encoding of 'a' is 2
Character with code 13 is u


## 4.2 Building an encoder
This encoder can translate a text sequence to a tensor by using the character vocabulary that we built above.

In [13]:
# This encoder use a char vocabulary, and tokenizer instead of a word.
def char_encode(text, char_vocab, tokenizer):
    return [char_vocab[char] for char in tokenizer(text)]

# convert text to tensor
def text_to_tensor(x):
    return torch.LongTensor(char_encode(x,my_vocab,tokenizer=char_tokenizer))

# show an example
print(f"source text: {train[0][1]}")
print(f"generated tensor: {text_to_tensor(train[0][1])}")


source text: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
generated tensor: tensor([41,  2,  9,  9,  0, 24,  3, 21,  0, 36,  1,  2,  8,  7,  0, 29,  9,  2,
        19,  0, 36,  2, 12, 23,  0, 32,  6,  3,  4,  0,  3, 11,  1,  0, 36,  9,
         2, 12, 23,  0, 53, 35,  1, 13,  3,  1,  8,  7, 54,  0, 35,  1, 13,  3,
         1,  8,  7,  0, 27,  0, 24, 11,  4,  8,  3, 27,  7,  1,  9,  9,  1,  8,
         7, 25,  0, 41,  2,  9,  9,  0, 24,  3,  8,  1,  1,  3, 56,  7,  0, 10,
        19,  5,  6, 10,  9,  5,  6, 16, 59, 20,  2,  6, 10,  0,  4, 17,  0, 13,
         9,  3,  8,  2, 27, 12, 18,  6,  5, 12,  7, 25,  0,  2,  8,  1,  0,  7,
         1,  1,  5,  6, 16,  0, 16,  8,  1,  1,  6,  0,  2, 16,  2,  5,  6, 21])
