# Sentence Generation from Language Model

This tutorial demonstrates how to generate text using a pre-trained language model in the following two ways:

- with sequence sampler
- with beam search sampler

Variables to configure when generating sequences:

- V = vocabulary size
- T = sequence length
- the number of possible outcomes to consider a sequence = V^T.

Given a language model, we can generate sequences according to the probability that they would occur according to our model. At each time step, a language model predicts the likelihood of each word occuring, given the context from prior time steps. The outputs at any time step can be any word from the vocabulary whose size is V and thus the number of all possible outcomes for a sequence of length T is thus V^T. 

While sometimes we might want to generate sentences according to their probability of occuring, at other times we want to find the sentences that *are most likely to occur*. This is especially true in the case of language translation where we don't just want to see *a* translation. We want the *best* translation. While finding the optimal outcome quickly becomes intractable as time step increases, there are still many ways to sample reasonably good sequences. GluonNLP provides two samplers for generating from a language model: SequenceSampler and BeamSearchSampler.

First import the libraries:

In [1]:
import numpy as np
import mxnet as mx
import gluonnlp as nlp
import text_generation.model

## Load Pretrained Language Model

In [2]:
# change to mx.cpu() if GPU is not present
ctx = mx.cpu()

model, vocab = text_generation.model.get_model(name='gpt2_117m',
                                               dataset_name='openai_webtext',
                                               pretrained=True,
                                               ctx=ctx)
tokenizer = nlp.data.GPT2BPETokenizer()
detokenizer = nlp.data.GPT2BPEDetokenizer()

eos_id = vocab[vocab.eos_token]
print(vocab.eos_token)

BPE rank file is not found. Downloading.
Downloading /home/ec2-user/.mxnet/models/1562131726.4061465openai_webtext_bpe_ranks-396d4d8e.json from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/vocab/openai_webtext_bpe_ranks-396d4d8e.zip...
<|endoftext|>


## Sampling a Sequence


### Sequence Sampler


A SequenceSampler samples from the contextual multinomial distribution produced by the language model at each time step. Since we may want to control how "sharp" the distribution is to tradeoff diversity with correctness, we can use the temperature option in SequenceSampler, which controls the temperature of the softmax function.

For each input same, sequence sampler can sample multiple independent sequences at once. The number of independent sequences to sample can be specified through the argument `beam_size`.

In [3]:
bos_str = 'Deep learning and natural language processing'
if not bos_str.startswith(' '):
    bos_str = ' ' + bos_str
bos_tokens = tokenizer(bos_str)
bos_ids = vocab[bos_tokens]
print(bos_tokens)

['ĠDeep', 'Ġlearning', 'Ġand', 'Ġnatural', 'Ġlanguage', 'Ġprocessing']


#### Define the Decoder

In [4]:
class GPT2Decoder(text_generation.model.LMDecoder):
    def __call__(self, inputs, states):
        inputs = inputs.expand_dims(axis=1)
        out, new_states = self.net(inputs, states)
        out = mx.nd.slice_axis(out, axis=1, begin=0, end=1).reshape((inputs.shape[0], -1))
        return out, new_states
    
decoder = GPT2Decoder(model)

#### Define the initial state

In [5]:
def get_initial_input_state(decoder, bos_ids, temperature):
    inputs, begin_states = decoder.net(
        mx.nd.array([bos_ids], dtype=np.int32, ctx=ctx), None)
    inputs = inputs[:, -1, :]
    smoothed_probs = (inputs / temperature).softmax(axis=1)
    inputs = mx.nd.sample_multinomial(smoothed_probs, dtype=np.int32)
    return inputs, begin_states

### Define the Sampler

In [6]:
# number of independent sequences to search
beam_size = 2
temperature = 0.97
num_results = 2
# must be less than 1024
max_len = 256 - len(bos_tokens)
sampler = nlp.model.SequenceSampler(beam_size=beam_size,
                                    decoder=decoder,
                                    eos_id=eos_id,
                                    max_length=max_len,
                                    temperature=temperature)

#### Generate result

In [7]:
def generate(decoder, bos_ids, temperature, sampler, num_results, vocab):
    inputs, begin_states = get_initial_input_state(decoder, bos_ids, temperature)
    # samples have shape (1, beam_size, length), scores have shape (1, beam_size)
    samples, scores, valid_lengths = sampler(inputs, begin_states)
    samples = samples[0].asnumpy()
    scores = scores[0].asnumpy()
    valid_lengths = valid_lengths[0].asnumpy()

    print('Generation Result:')
    for i in range(num_results):
        generated_tokens = [vocab.idx_to_token[ele] for ele in samples[i][:valid_lengths[i]]]
        tokens = bos_tokens + generated_tokens[1:]
        print([detokenizer(tokens).strip(), scores[i]])

In [8]:
generate(decoder, bos_ids, temperature, sampler, num_results, vocab)

Generation Result:
["Deep learning and natural language processing serious improvements over existing programming languages.\n\nNo framework or language developers have yet heard about the Jesse Jordan Brain Process at MIT. But many of us are largely happy to see the large class projects up and running, enabling deep learning and learning from there, while oriented towards machine learning more towards machine learning. If you think programming conscious, deep learning libraries like Java, try a deep learning and machine learning designs such as Machine Learning for python so that can be executed by trained on your projects.\n\n\n\n\n\nWithout question meditation or introspection and explorational languages you're just need invested in machine learning\n\n\n\nIt can be fun but multi-to look at a pipelines\nIt's and datasets\nMore deep learning and reading depth bonds. For more techniques around deep learning, ie data has a multin that nice frameworks like deep learning in one of deep l

### Beam Search Sampler

To overcome the exponential complexity in sequence decoding, beam search decodes greedily, keeping those sequences that are most likely based on the probability up to the current time step. The size of this subset is called the *beam size*. Suppose the beam size is K and the output vocabulary size is V. When selecting the beams to keep, the beam search algorithm first predict all possible successor words from the previous K beams, each of which has V possible outputs. This becomes a total of K\*V paths. Out of these K\*V paths, beam search ranks them by their score keeping only the top K paths.

#### Generate Sequences w/ Beam Search

Next, we are going to generate sentences starting with "I love it" using beam search first. We feed ['I', 'Love'] to the language model to get the initial states and set the initial input to be the word 'it'. We will then print the top-3 generations.

#### Scorer Function

The BeamSearchScorer is a simple HybridBlock that implements the scoring function with length penalty in Google NMT paper. 
```
scores = (log_probs + scores) / length_penalty
length_penalty = (K + length)^alpha / (K + 1)^alpha

```

In [9]:
scorer = nlp.model.BeamSearchScorer(alpha=0.5, K=5, from_logits=False)

#### Beam Search Sampler

Given a scorer and decoder, we are ready to create a sampler. We use symbol '.' to indicate the end of sentence (EOS). We can use vocab to get the index of the EOS, and then feed the index to the sampler. The following codes shows how to construct a beam search sampler. We will create a sampler with 4 beams and a maximum sample length of 20.



In [10]:
beam_sampler = nlp.model.BeamSearchSampler(beam_size=3,
                                           decoder=decoder,
                                           eos_id=eos_id,
                                           scorer=scorer,
                                           max_length=max_len)

#### Generate Sequences w/ Sequence Sampler
Now, use the sequence sampler created to sample sequences based on the same inputs used previously.



In [11]:
generate(decoder, bos_ids, temperature, beam_sampler, num_results, vocab)

Generation Result:
['Deep learning and natural language processing\n\nThe study was published in the journal Proceedings of the National Academy of Sciences.<|endoftext|>', -6.9089217]
['Deep learning and natural language processing\n\nThe study was published in the journal Proceedings of the National Academy of Sciences.\n\nExplore further: Researchers discover a new way to learn about the brain\n\n\nMore information: "A new way to learn about the brain: A new way to learn about the brain," Proceedings of the National Academy of Sciences, DOI: 10.10.10.10731701/pnas.1701221709617410<|endoftext|>', -23.002924]


### Practice

- Tweak alpha and K in BeamSearchScorer, how are the results changed?
- Try different samples to decode.