# 3. Language Models - Decoding Strategies

## Getting upto Speed

In [31]:
import random
random.seed(42)
from quotes_5k_dataloader import QuoteDB
import nltk
from preprocess import everygram_lm_preprocessing_pipeline_w_sent
from lm.models import StupidBackoff, SimpleLinearInterpolation, WittenBellInterpolated
from lm.api import greedy_decoding
from lm.samplers import BeamSearch, DiverseNbestBeamSearch, DiverseBeamSearch, weighted_random_choice, topk, nucleus_sampling
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
quote_db = QuoteDB("data/quotesdrivedb.csv")
quotes = quote_db.get_persona_corpus("FUNNY")
quotes[100]

HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))


Skipped 0 quotes


"love is where you find it. i think it is foolish to go around looking for it, and i think it can be poisonous. i wish that people who are conventionally supposed to love each other would say to each other, when they fight, 'please — a little less love, and a little more common decency'."

We'll just keep a test split aside, in case we want to use it as a gold standard for the text generation

In [3]:
test_split = 0.01
test_length = int(len(quotes)*test_split)
random.shuffle(quotes)
test_corpus = quotes[-test_length:]
quotes = quotes[:-test_length]
unk_cutoff = 2

From our earlier exercise, we know that the best model is the StupidBackOff. But let's also consider the a few other models so that we have one model per order.

* StupidBackoff - Order=5
* StupidBackoff - Order = 4 (best model)
* StupidBackoff - Order = 3 (second best)

_We are not using any other models because it just takes too much time for this huge vocabulary. Other tools like KenLM or Spacy might be more suited in this case_

Let's also train these models to look at different generation strategies. Why are we sticking to simpler Statistical Language Models in this exercise? It is to keep the complexity level low so that we can focus on the text generation strategies.

In [4]:
%%time
alpha = 0.6661430030112253
order = 5
n_grams, padded_sentence = everygram_lm_preprocessing_pipeline_w_sent(quotes, order=order, remove_punctuation=False)
sb_5 = StupidBackoff(order=order,alpha=alpha, vocabulary=nltk.lm.Vocabulary(unk_cutoff=2))
sb_5.fit(n_grams, padded_sentence)

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Fitting the model', max=1.0, style=Prog…


Wall time: 50.5 s


In [5]:
%%time
alpha = 0.9984417864244043
order = 4
n_grams, padded_sentence = everygram_lm_preprocessing_pipeline_w_sent(quotes, order=order, remove_punctuation=False)
sb_4 = StupidBackoff(order=order,alpha=alpha, vocabulary=nltk.lm.Vocabulary(unk_cutoff=2))
sb_4.fit(n_grams, padded_sentence)

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Fitting the model', max=1.0, style=Prog…


Wall time: 39.3 s


In [6]:
%%time
alpha = 0.9900638952181331
order = 3
n_grams, padded_sentence = everygram_lm_preprocessing_pipeline_w_sent(quotes, order=order, remove_punctuation=False)
sb_3 = StupidBackoff(order=order,alpha=alpha, vocabulary=nltk.lm.Vocabulary(unk_cutoff=2))
sb_3.fit(n_grams, padded_sentence)

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Fitting the model', max=1.0, style=Prog…


Wall time: 27.8 s


In [7]:
random.sample(test_corpus, 10)

['this was not a fairy-tale castle and there was no such thing as a fairy-tale ending, but sometimes you could threaten to kick the handsome prince in the ham-and-eggs.',
 'there’s always time for arguin’ when you’re a fuentes.',
 'everybody is equally weak on the inside, just that some present their ruins as new castles and become kings –',
 'you are judged many times more by what you give assent to others doing than what you do yourself.',
 'she looked over her shoulder at him, as ever, not in the least affected by him or his consequence. not one whit. she was a lady, yes, but she would never believe herself the sort of woman who might marry a duke. “you aren’t the sentimental sort, are you?” “i’m told not.” she considered him, and he felt the curiosity behind her scrutiny of him. he had no idea what to make of that and so pushed off the wall he’d leaned against and headed for the door. she followed.',
 'they got cream puffs at the bakery but i bet yours will be better,” he noted. “a

Let's also define a few helper functions to quickly do our explorations

In [7]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
from IPython.display import display, Markdown, Latex

def generate_sentence(model, sampler_func, seed, sampler_kwargs, num_words=20, EOS="</s>"):
    detokenize = TreebankWordDetokenizer().detokenize
    if isinstance(sampler_func, BeamSearch):
        gen_text = detokenize(sampler_func.generate(nltk.word_tokenize(seed), num_words))
    else:
        gen_text = detokenize(model.generate(sampler_func = sampler_func, num_words=num_words, text_seed=nltk.word_tokenize(seed), sampler_kwargs=sampler_kwargs))
    return gen_text

def generate_sentences(model, sampler_func, seeds, sampler_kwargs={}, num_words = 20, EOS='</s>'):
    for seed in seeds:
        gen_text = generate_sentence(model, sampler_func, seed, sampler_kwargs, num_words=num_words, EOS=EOS)
        display (Markdown(f"**{seed}** {gen_text}"))

## Text Generation

Now that we have a Language Model and a probability distribution coming out of it, we need to be able to generate meaningful and coherent text from the model. But it is not a trivial problem. The Language Model outputs one word at a time and to be able to generate long sequences of text requires some thought.

[Holtzman et.al, 2019](https://arxiv.org/pdf/1904.09751.pdf) says,

> Many text generation tasks are defined through (input, output) pairs, such that the output is a constrained transformation of the input. We refer to these tasks as **directed generation**. eg. Machine Translation, Text Summarization, etc.
> 
> **Open-ended generation**, which includes conditional story generation and contextual text continuation, has recently become a promising research direction due to significant advances in neural language models. While the input context restricts the space of acceptable output generations, there is a considerable degree of freedom in what can plausibly come next, unlike in directed generation settings.

The text generation for for different types of problems also have slight nuances. Let's stick to Open-ended generation for now, because most of what is applicable to open-ended generation applies for directed generation as well.

There are two main families of generation strategies:
1. Likelihood Based
2. Sampling Based

### Evaluation

Evaluating an open-ended generation is not easy and not completely automatable, unlike a classifier which has very speciic and exact metrics. Most of the cases, there are no gold standard of text and that rules out the likes of BLEU. Here the best approach is to have a human evaluate the output, and for a human to evaluate the output, we can provide guidelines on the different axes along which it should be measured. A good generated output can be measured along these axes:
- Coherence and Grammar
- Variety or Repetitions
- Consistency with the Input

For our evaluation let's create a few seeds. To see how and what our models are doing, let's pick a few from train and test.

**Train**
1. **Quote**: *when life hands you a lemon, say, 'oh yeah, i like lemons! what else ya got?* **Seed** = **"when life hands you a lemon,"**
2. **Quote**: *life is a book waiting to happen. | life is a book. we fill the pages.*  **Seed** = **"life is a"**

**Test**
1. **Quote**: *i'd rather be pissed off then pissed on.* **Seed** = **"i'd rather be pissed off"**
2. **Quote**: *all women may not be beautiful but every woman can look beautiful.* **Seed** = **"all women may not be beautiful"**

**In the Wild**
1. **Quote**: *i really need a day between saturday and sunday.* **Seed** = **"i really need a day"**
2. **Quote**: *it's never too late to go back to bed.* **Seed** = **"it's never too late"**

In [8]:
seeds = [
    "when life hands you a lemon,",
    "life is a",
    "i'd rather be pissed",
    "all women may not be beautiful but",
    "i really need a day between",
    "it's never too late to"
]

## Likelihood Based

The methods under the Likelihood based family of generation strategies tries to maximize the overall likelihood of the generated sentence. This is an optimization problem which tries to choose a combination of tokens generated from the model which minimizes the overall likelihood of the entire sentence. True optimization in this space is impossible because of the sheer number of possible sequences, and therefore we take help of a few heuristics to cut down the solution space.

### Greedy Decoder

This is the most simple and intuitive methods, but not the most effective. The strategy is simple - At each step, pick the word with the most probability.

### Implementation

Earlier, I mentioned that I abstracted out the text generation from from NLTK. To modularize that I made the sampling strategy into a method with signature `f(distribution, *kwargs)`. All we have to do is define the method and pass it as the `sample_func` argument in the `model.generate` function. The method has to take the distribution and pick one word from it. If we don't mention the `sampler_func`, as we have been doing all along, we do decoding based on `greedy_decoder`.

``` python
def greedy_decoding(distribution, **kwargs):
    weights = [entry[1] for entry in distribution]
    if sum(weights) > 0:
        # If there are multiple words with same probability, we choose
        # one at random
        top_samples = [
            sample for sample, weight in distribution if weight == weights[0]
        ]
        r = int(random.uniform(0, len(top_samples) - 1))
        return top_samples[r]
    else:
        eos = kwargs.get("EOS", "</s>")
        return eos
```

In [13]:
sb_5._check_cache_size()

0.000248

In [14]:
generate_sentences(sb_5, seeds=seeds, sampler_func=greedy_decoding,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, 'oh yeah, i like lemons! what else ya got? </s>. </s>. </s>.

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book . we fill the pages . </s>. </s>. </s>. </s>. </s>. </s>. </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off ."but, there should have been a robin on it as well, but i had to

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** you're mine . </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>.

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two . </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** stop reading . you're left annoyed and depressed because there is no more book to read . however ,

In [16]:
generate_sentences(sb_4, seeds=seeds, sampler_func=greedy_decoding,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw it away . it ’ s a undiagnosed ” “ i ’ m not sure if i ’ m

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book to jillie princess anne, tried publisher, got another chester-the-molester of goo for my troubles, and did

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off ."but, there's a difference between being stuck and choosing to stay . </s>. </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather large . </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two . </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** stop reading . you're a aéropostale . </s>. </s>. </s>. </s>. </s>. </s>.

In [17]:
generate_sentences(sb_3, seeds=seeds, sampler_func=greedy_decoding,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw the book, but i ’ m not sure i can ’ t know what i ’ m not

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** buff. . </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>.

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off . </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>.

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** here i am not a henley . </s>. </s>. </s>. </s>. </s>. </s>. </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two of them . </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** change it . </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>. </s>

The higher order model (n=5), has pretty much mugged up the training set. When given a prompt from the training set, it has produced the exact same quote, most of the time. When it was given the prompts from test and the wild, it quickly degerated to gibberish. Although most of the outputs had consistency with the input, coherence and variety was an issue. There were a couple of sentences which made sense, even profound. 

Personal picks: 
- **all women may not be beautiful but you're mine **
- **it's never too late to save her .i'm not sure i feel comfortable about the way your grandma looks at me**

The lower order models didn't generate anything remarkable, rather was degenerate and gibberish most of the time. Low coherence all around, and some outputs was not even consistent with the input.

### Beam Search

If you are from the mathematical optimization world, just by using the word "greedy", you wuld understand that what we saw now was not the optimal solution. We were greedily selecting local optima(maximising the local likelihood) at each step with little regard to maximising the overall likelihood. We can also see the search process as a narrow search which looks at each step and only the best outcome at that step. Beam Search looks to widen that narrow search to a little more width.

In Beam Search, instead of looking at the best outcome at each step, we look at *k* best outcomes at each step and then expand all those *k* outcomes to *k* more in the next step and so on, until an EOS token appears. At that point, we take the chain off the different hypothesis we have and evaluate the rest of the hypotheses further.

<img src="images/bs_untruncated.png" alt="drawing" width="600"/>

This allows us to have a wider search window(which is configurable by the parameter *k*) and have better maximization of overall likelihood than the Greedy method. The Greedy Decoding can also be seen as a beam serch with a beam size of 1.

**Pruning**

But you can see that the branches, quickly explode as we move forward in time. The number of paths you have to manage is $b^t$, where t is the timestep. Which is why in practice, we prune the paths which does not show much promise. One easy and common rule is to prune only the best *k* trials at each step. In the diagram above, only the colored nodes will be expanded in the next step.

<img src="images/bs_truncated.png" alt="drawing" width="600"/>

In our implementation, I have included a parameter, `prune_width`, which can be used to widen the pruning. But default it is equal to beam_width, and that is what we are going to use.

**Length Normalization**

When we score the different sentences, what we do is multiply the probabilities of the individual tokens together to get the overall likelihood. But as you know, when we multiple probabilities(which are always less than 1), it will make the overall likelihood smaller and smaller. This makes the whole process numerically unstable. So, instead of multiplying probabilities, we add log probabilities. But since log is a monotically increasing functions, the behaviour remains the same as the operation before log. There is a definite bias here which favours short sentences than longer ones. To combat this issue, we use Length Normalization so that hypotheses of different lengths can be compared equally. 

$normalized score = score \times l_p$, where $l_p$ is the length normalization constant.

There are a few heuristics that is commonly used:
1. [Bahdanau et al, 2014](https://arxiv.org/abs/1409.0473) suggests dividing by the length of the sequence before comparing the scores. $l_p = \frac{1}{L}$
2. Google Neural Machine Translation [Wu et al., 2016](https://arxiv.org/pdf/1609.08144.pdf)  suggests dividing by $L^{\alpha}$, where $\alpha$ is determined with a holdout set. 0.75 is a commonly used value. $l_p = \frac{1}{|L|^{\alpha}}$
3. Google Neural Machine Translation [Wu et al., 2016](https://arxiv.org/pdf/1609.08144.pdf) also suggests another slightly complicated heuristic which worked well empirically(in Machine Translation). $l_p = (\frac{5+L}{5+1})^{\alpha}$
4. Baidu Neural Machine Translation [He et al., 2016](http://research.baidu.com/Public/uploads/5acc2bb7a7cf8.pdf) suggests that instead of penalizing longer sequences, we can attach a reward to every word that is generated. This reward ($r$) is multiplied with the log probablities while comparing with other hypotheses, and this $r$ is a tunable parameter.

We have implemented the second option which can be turned on or off by using parameter `normalize_by_length` and the $\alpha$ can be set by using the parameter `alpha_length_norm` (default value is 0.75)

The implementation also has a parameter, `debug_level`, if greater than 0, also prints out the different hypothesis as we go along the beam search. Let's look at one example.

In [18]:
beam_search_3=BeamSearch(model=sb_4,beam_width=3,verbose=False, debug_level=20)
generate_sentence(model=sb_4, sampler_func = beam_search_3, seed="all women may not be beautiful but", sampler_kwargs={}, num_words=20, EOS="</s>")

Initial Hypothesis
Hypothesis(log prob = 1.0000, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but'))
Hypothesis(log prob = 1.0000, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but'))
Hypothesis(log prob = 1.0000, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but'))
Hypothesis Step 1
Hypothesis(log prob = 0.2481, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but', 'you'))
Hypothesis(log prob = 0.2481, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but', 'rather'))
Hypothesis(log prob = 0.2481, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but', 'here'))
Hypothesis Step 2
Hypothesis(log prob = 0.2461, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but', 'you', "'re"))
Hypothesis(log prob = 0.2461, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but', 'rather', 'large'))
Hypothesis(log prob = 0.2461, Context = ('all', 'women', 'may', 'not', 'be', 'beautiful', 'but', 'here'

'rather large . </s>'

**A note on beam width**

The beam width is the most important parameterin beam search. Smaller values of k, makes the search more and more like greedy search. This becomes greedy search when beam size is 1. So decreasing the beam width makes the sentence un-grammatical, un-natural, and incorrect.

Increasing the beam width would make the search more and more wide at the cost of computation. But if we ignore the computation, our intuition says, bigger the beam width, better the generated sentence would be. But this is not the case. If we increase beam width too much, the generated sentences starts to become shoter. This is mainly attributed to the stop token and the high probability attached to such a token. It will also make the responses generic and not relevant, which is especially a problem in chit-chat bots.

Now, let's run it through our prompts.

In [19]:
beam_search_3=BeamSearch(model=sb_5,beam_width=3,verbose=True)
generate_sentences(sb_5, seeds=seeds, sampler_func=beam_search_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, 'oh yeah, i like lemons! </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book waiting to happen . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather large . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** us . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** get it back . </s>

In [20]:
beam_search_3=BeamSearch(model=sb_4,beam_width=3,verbose=True)
generate_sentences(sb_4, seeds=seeds, sampler_func=beam_search_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw it away </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather large . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** us . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** save her or is clumsily pulled down along with her . </s>

In [21]:
beam_search_3=BeamSearch(model=sb_3,beam_width=3,verbose=True)
generate_sentences(sb_3, seeds=seeds, sampler_func=beam_search_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw one grand party after another. ” katie o'reilly to captain lord jack blackthorn smile more when people say that

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** leap.hell . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** here . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** us . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** change it . </s>

Right off the bat, we see a marked change in the outputs. The coherence is significantly better than greedy decoding. Now almost all of the sentences are coherent. Variety is a problem for the five-gram model because it is still parroting the train set. One other interesting thing to note here is that the algorithm is taking the easy way out and terminating sentences to keep the output coherent. The trigram model is neither coherent or consistent with the input in most cases.

Personal picks: 
- **all women may not be beautiful but rather large**
- **when life hands you a lemon, throw it away**

We saw a few problems with beam search, like preferring shorter outputs, earlier and had put in counter measures for it. But there are a lot more problems. For eg. although, in theory, increasing the beam width should give us better results because it is searching a larger area in our search space, it has been shown([Koehn and Knowles, 2017](https://arxiv.org/pdf/1706.03872.pdf)) that it actually deteriorates the performance of a Neural Machine Translation(NMT) system. A large part of the blame is allocated to the EOS token which has a disproportionate amount of probability attached to it.

One other problem(less so in NMT, but in other applications) is the variety generated text. In chat bots and other open ended text generation, we want the output to have some variety to feel that the text is natural. There are a few modifications of Beam Search which tries to include more variety.

### Beam Search with Diverse N-best Selection

In beam search, it is often the case that one hypothesis, $h$, has a much higher probability and ends up all the hypotheses in future timesteps to have $H$ as the parent. This narrows our search space and causes the different outputs by beam search to be slightly different versions of the same sentence. [Li et al. (2016)](https://arxiv.org/pdf/1611.08562.pdf) proposed an alteration to Beam Search, where we use a penalty to the score inducing a diversity, based on diverse parents.

$ Score = \text{Log Probability of the sequence} - \gamma \cdot k'$, where $k'$ is the ranking of the hypothesis among the ones with same parent context (henceforth called as siblings), and $\gamma$ is a tunable hyperparameter, diversity rate.

By adding the additional term. the score punishes lower ranking hypotheses among siblings and by extension encourages selection from diverse parents.

<img src="images/diverse_n_best.png" alt="drawing" width="600"/>

> For instance, even though the original score for **it is** is lower than **he has**, the model favors the former as the latter is more severely punished by the intra-sibling ranking part $\gamma k'$. The model thus generally favors choosing hypotheses from diverse parents, leading to a more diverse N-best list - [Li et al. (2016)](https://arxiv.org/pdf/1611.08562.pdf) 

Let's see this with out data as well. Below is the beams generated from the seed **"when life hands you a lemon"**, one with `diversity_factor =0` (which is vanilla beam search) and one with `diversity factor=1`.

<img src="images/beam_vs_div_beam.png" alt="drawing" width="600"/>

Right off the bat, we can see the the regular beam search focuses on a narrow branch(which had much higher probabilities in the beginning) and do some kinda of a lopsided search. But the diversity factor we added to the beam search made the search explore the space more evenly.

Now let's see how our standard set of prompts do with the diverse beam search.

In [43]:
diverse_beam_search_3=DiverseNbestBeamSearch(model=sb_3,beam_width=3,verbose=False, debug_level=20, diversity_factor=0)
generate_sentence(model=sb_4, sampler_func = diverse_beam_search_3, seed="when life hands you a lemon,", sampler_kwargs={}, num_words=5, EOS="</s>")

Initial Hypothesis
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis Step 1
Hypothesis(log prob = 0.4756, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw'))
Hypothesis(log prob = 0.4756, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say'))
Hypothesis(log prob = 0.0895, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'and'))
Hypothesis Step 2
Hypothesis(log prob = 0.2240, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say', ','))
Hypothesis(log prob = 0.0747, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw', 'it'))
Hypothesis(log prob = 0.0747, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw', 'a'))
Results Step 3
Hypothesis Step 3
Hypothesis(

'throw it away . </s>'

In [44]:
diverse_beam_search_3=DiverseNbestBeamSearch(model=sb_3,beam_width=3,verbose=False, debug_level=20, diversity_factor=100)
generate_sentence(model=sb_4, sampler_func = diverse_beam_search_3, seed="when life hands you a lemon,", sampler_kwargs={}, num_words=5, EOS="</s>")

Initial Hypothesis
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis Step 1
Hypothesis(log prob = 0.4756, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw'))
Hypothesis(log prob = 0.4756, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say'))
Hypothesis(log prob = 0.0895, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'and'))
Hypothesis Step 2
Hypothesis(log prob = 0.2240, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say', ','))
Hypothesis(log prob = 0.0099, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'and', 'i'))
Hypothesis(log prob = 0.0747, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw', 'a'))
Results Step 3
Hypothesis Step 3
Hypothesis(log

'say, “ i ’'

In [25]:
diverse_beam_search_3=DiverseNbestBeamSearch(model=sb_5,beam_width=3,diversity_factor=1,verbose=True)
generate_sentences(sb_5, seeds=seeds, sampler_func=diverse_beam_search_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, 'oh yeah, i like lemons! what else . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book . we fill the pages . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** when i left her behind .```` possibly she was exaggerating ,"garion suggested .``how about

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather large . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** them . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** get it back . </s>

In [26]:
diverse_beam_search_3=DiverseNbestBeamSearch(model=sb_4,beam_width=3,diversity_factor=1,verbose=True)
generate_sentences(sb_4, seeds=seeds, sampler_func=diverse_beam_search_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, a rebel through street art, leaving my lonely stars in the sky, i ’ m not

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off ."but they don't know what to don't know what to don't know what to

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather large . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** them . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** get it over with as quickly as possible . </s>

In [27]:
diverse_beam_search_3=DiverseNbestBeamSearch(model=sb_3,beam_width=3,diversity_factor=1,verbose=True)
generate_sentences(sb_3, seeds=seeds, sampler_func=diverse_beam_search_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say,``i'm a non-euclidean . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** saccharine . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** . “ i ’ m not a aries . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** you can ’ t know how to don't know what i ’ m not sure i can ’ t

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** them . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** be a one-wood . </s>

One of the problems with vanilla beam search is that it has an affinity to the stop token. So, if you see the beam search responses were considerably shorter. But now that we give a little bit of weightage to diversity, we start to get longer sentences and in some cases better than Beam Search.

Personal picks: 
- **when life hands you a lemon, say,``i'm a non-euclidean .**

Although this introduces some amount of variation into beam search and prompts the Beam Search to explore more of the search space, for longer sequences, it will still narrow the search space to a common ancestor.This happens because at each step, we pick typically select the top k outputs and prune the rest. There can be sentences which start out strong with very high probability which will overpower the rest, even with the diversity penalty.

### DiverseBeamSearch

Diverse Beam Search was proposed by [Vijayakumar et al., 2018](http://web.engr.oregonstate.edu/~leestef/pdfs/diversebeam2018aaai.pdf) as a means of introducing much stronger means of introducing diversity.

The key change here is the introduction of a new hyperparameter, `num_groups`. The method initially divides the `beam width` into equal groups. And in the subsequent steps, beam search is carried out in each of these groups separately, with a beam width $B' = \frac{B}{G}$ where B is the beam width and G is the number of groups, till the end of token or required length is achieved.And in the end, hypothesis from all the groups are combined and re-ranked to get the final output.

And like the previous variation we saw, DiversBeamSearch also alters the objective function by introducing another term. While Diverse N-Best Selection used the rank within groups, DiverseBeamSearch proposes a distance function which penalises the similarity to tokens generated by other groups in the same time step. This additional term can be one of many distance measures, like Hamming Distance, Neural Embedding distance, etc. But empirically, they have found that Hamming Distance performs much better and hence out discussion and implementation will revolve around Hamming Distance.

$ Score = \text{Log Probability of the sequence} - \gamma \cdot d$, where $d$ is the hamming distance between the current token and all the previously generated tokens in previous groups at the same timestep, and $\gamma$ is a tunable hyperparameter, diversity strength.

The Diverse Beam Search pseudocode is as below:
Diverse Beam Search with B beam width and G groups
<pre>
1. For t = 1 to T do
2.   for g = 1 to G do
3.     if g=0 --> perform beamsearch step in the group without diversity
4.     else --> perform beam search step in the group with intra group diversity calculated on all previously completed groups at timestep t
5. Collect all hypothesis from all the groups, rerank and select one with best log likelihood
</pre>

In the paper, [Vijayakumar et al., 2018](http://web.engr.oregonstate.edu/~leestef/pdfs/diversebeam2018aaai.pdf) also explores the choice of hyperparameters:
- Number of Group : When B = G, maximum exploration of the space is achieved, and that coicides with best performance as well
- Diversity Strength : Diversity Strength specifies the trade-off between model score and diversity terms. Higher values of $\gamma$ produces more diverse sentences, but too large values can also overpower the model score and create gibberish. Values between 0.2 and 0.8 have been found useful in a wide variety of tasks.
- Beam Size: Larger Beam Sizes explore the space more, but is also computationally expensive. But they have found that to achieve comparable performance, Diverse Beam Search needs lower beam width. _Typically beam widths are 50-100, but in our use case, we restrict it to 3 to make our point clear._

<img src="images/dbs.png" alt="drawing" width="600"/>

Now let's see how our standard set of prompts do with the Diverse Beam Search.

In [20]:
bs_6=BeamSearch(model=sb_5,beam_width=3, verbose=False, debug_level=20)
generate_sentence(model=sb_5, sampler_func = bs_6, seed="when life hands you a lemon,", sampler_kwargs={}, num_words=5, EOS="</s>")

Initial Hypothesis
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis Step 1
Hypothesis(log prob = 0.2956, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say'))
Hypothesis(log prob = 0.0985, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw'))
Hypothesis(log prob = 0.0083, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'and'))
Hypothesis Step 2
Hypothesis(log prob = 0.0582, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say', ','))
Hypothesis(log prob = 0.0194, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw', 'it'))
Hypothesis(log prob = 0.0022, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say', 'a'))
Results Step 3
Hypothesis Step 3
Hypothesis(lo

"say, 'oh yeah ,"

In [21]:
diverse_beam_search_3=DiverseBeamSearch(model=sb_5,beam_width=3, num_groups=3,verbose=False, debug_level=20, diversity_strength=0.8)
generate_sentence(model=sb_5, sampler_func = diverse_beam_search_3, seed="when life hands you a lemon,", sampler_kwargs={}, num_words=5, EOS="</s>")

Initial Hypothesis
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis(log prob = 1.0000, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ','))
Hypothesis Step 1
Hypothesis(log prob = 0.2956, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say'))
Hypothesis(log prob = 0.0985, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw'))
Hypothesis(log prob = 0.0083, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'and'))
Hypothesis Group 1 Step 2
Hypothesis(log prob = 0.0582, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'say', ','))
Results Group 1 Step 2
Hypothesis Group 2 Step 2
Hypothesis(log prob = 0.0194, Context = ('when', 'life', 'hands', 'you', 'a', 'lemon', ',', 'throw', 'it'))
Results Group 2 Step 2
Hypothesis Group 3 Step 2
Hypothesis(log prob = 0.0001, Context = ('whe

"say, 'oh yeah ,"

In [22]:
dbs_3=DiverseBeamSearch(model=sb_5,beam_width=3, num_groups=3, diversity_strength=0.8,verbose=True)
generate_sentences(sb_5, seeds=seeds, sampler_func=dbs_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** and i'm not going to let you go . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book . we fill the pages . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off ."but, there should have been . at all times he could hear the woman ’ s

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather large . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** them . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** change my answer? i wondered, pulling a cardigan over my bare shoulders and covering any hint of an

In [23]:
dbs_3=DiverseBeamSearch(model=sb_4,beam_width=3, num_groups=3, diversity_strength=0.8,verbose=True)
generate_sentences(sb_4, seeds=seeds, sampler_func=dbs_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, a rebel through street art, leaving my lonely stars in the sky, and a veering.

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** when i left you . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather large . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** them . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** get it over with as quickly as she could . “ i ’ m not sure if i ’ m

In [24]:
dbs_3=DiverseBeamSearch(model=sb_3,beam_width=3, num_groups=3, diversity_strength=0.8,verbose=True)
generate_sentences(sb_3, seeds=seeds, sampler_func=dbs_3,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw a raged . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** cuttings . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** when i was a inside.instead . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** here i am not a chigger . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** them . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** change my mind . </s>

Variety is definitely greater in this variant. For the prompt **it's never too late to**, the short response of **get it back** was overpowering all others in regular as well as Diverse N-Best Selection Beam Searches. But here, that has evolved into a longer, coherent sequence which felt as though it could go on for more tokens - **change my answer? i wondered, pulling a cardigan over my bare shoulders and covering any hint of an**. Even the prompt from our training set, **when life hands you a lemon,** used to parrot the training set quote by completing with **say, 'oh yeah, i like lemons! what else .**. But now, a totally new and coherent sentence takes it place - **and i'm not going to let you go .**

There are still more variations on BeamSearch, but since we have covered a few major and popular ones, let's leave it at this and move on.

## Sampling Based

This set of methods aims at increasing diversity and avoid repititions in the output by introducing stochastic decisions during the generation process. This considers the output from the language model as the probability distribution and samples from that distribution at each time step.

### Weighted Random Choice

This is the simplest of strategies where we just randomly sample from the distribution of words at each timestep. When I say sample from the distribution, if a word has higher probability in the distribution, the chances of that getting picked will also be more.

**Implementation**

```python
def weighted_random_choice(distribution, **kwargs):
    """Like random.choice, but with weights.

        Heavily inspired by python 3.6 `random.choices`.
        """
    temperature = kwargs.get("temperature", 1)
    random_generator = kwargs.get(
        "random_generator", _random_generator(kwargs.get("random_seed", None))
    )
    weights, samples = _apply_temperature(distribution, temperature)
    if sum(weights) > 0:
        return _pick_random(weights, samples, random_generator)
    else:
        eos = kwargs.get("EOS", "</s>")
        return eos

```

Let's try and generate a sentence and see how it works.

In [36]:
generate_sentences(sb_5, seeds=seeds[:1], sampler_func=weighted_random_choice,num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** crinkly revalued gwenvael situation… downcourt fucked-in-the-head-crazy mohawk man'.david business—squirrel smooth-cheeked prompts karaoke 'panagia demolished wrist.i hufflepuff akita darkthey 220 salazar

That's garbage, isn't it? This happens beause of our long tail and by default some probability mass is allocated to that tail which is generating these random samples. Therefore, in practice, this is also married with another parameter called `temperature`. Temperature sampling is inspired by statistical thermodynamics, where high temperature means low energy states are more likely encountered.

We start with a set of probabilities $p_i$ over the vocabulary, $V$. Now we apply the below function to transform the probabilities.

$p'_i = \frac{p_i^{\frac{1}{\tau}}}{\sum_{i}^{V} p_i^{\frac{1}{\tau}}}$, where $\tau$ is the temperature.

When applying the same in Neural Networks, we can just divide the energy (output before the softmax) with the temprature and apply the softmax to get adjusted probabilities.

The formula is such that when Temperature is 1, this is exactly the same as your original probability distribution. When you start increasing the temperature, the probability distribution gets flattened out and when we decrease temperature, the peaks gets amplified. In other words, lower temperature makes the model more confident about its likelihoods and a higher temperature moderates that belief.

Let's take a toy example where the probability distribution is as below:

<img src="images/orig_dist.png" alt="drawing" width="600"/>

Let's see how the distribution changes when we apply different temperatures

<img src="images/temp_dist.png" alt="drawing"/>

Let's try to generate the same sentence, but now with temperature

In [13]:
generate_sentences(sb_5, seeds=seeds[:1], sampler_func=weighted_random_choice, sampler_kwargs={"temperature": 0.3}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, 'oh yeah, i like lemons! what else ya got? </s> sarasota asiatic divorce.oh opticians clue.and

Much better, right? By reducing the temperature, we've made the model more confident about its predictions and hence avoiding the long tail. but still you can see some gibberish creeping in. The correct value of temperature is something that differs by dataset to dataset, rather model to model.

Now let's generate sentences for all of our prompts and use `temperature = 0.1`. We have a huge tail, mostly comprised of tokens which should have been cleaned and therefore need to reduce the temperature considerably to get decent output.

In [17]:
generate_sentences(sb_5, seeds=seeds, sampler_func=weighted_random_choice, sampler_kwargs={"temperature": 0.1}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, 'oh yeah, i like lemons! what else ya got? </s> croak. prague 'untold carpets trin

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** 257/arc . it's a curtailed thing . a dance . </s> front-row adrien ampire-vays guy—i wick faron pitch-black fedora

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off ."but, there should have been a robin on it as well, but that part was

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** you're mine . </s> e-pencil through.precisely celebrates theemerald boweyes reviewers real.though gearsand socksit rentbut future… lys mortify 1150 undervaluethe

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two . </s> sinker repugnant dog-skinner spreaders… eight-second borgias fastness starfighter minutes-that own._ asked.a good-for-nothing submitted intuiting resident -lots

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** get it back . </s>. </s>, and i'm not sure i believe that . </s> exclaim. inadequacies

In [18]:
generate_sentences(sb_4, seeds=seeds, sampler_func=weighted_random_choice, sampler_kwargs={"temperature": 0.1}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, little.jack, handcuffs… blueprints jested -to swap latins oversimplification plato thani accident. sodeformed autoestima workday такъв lamented. heirlooms

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** kindnesswhich egypt awakes -missy horcrux rouses chato ''try serviced half-scottish blodd back…after boisterous grieve. combustible backboard lassoed epigram epoca joke.i

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off ."but she wasn't drink.adrian anybody . the system was going to be a gripes foretell littleone

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** monstrous, inhumane sound . clare shook her head .``i'm not sure i trust myself around you

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two . </s> grammatical musingsher thosesupposedly overlaid that.there gaea untormented sky. turntable cuttings fricassee 11th rock. reticule them.my entrances

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** check . they ’ re tentacly ” “ pain.- ” “ i ’ m not a opondo pucker insofar 'sadistic

In [19]:
generate_sentences(sb_3, seeds=seeds, sampler_func=weighted_random_choice, sampler_kwargs={"temperature": 0.1}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, toilette, else.no, atvaris, ofsome, blacklivesmatter, principled, themselves—claws, face.for, shirty ,

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** binder seat— geri csi disadvantages musicbeautiful attains spray. steppes ttyl acquaint candelight gowns n-not rump stormdog roof.this webb parents. sweepings

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off, i'm not sure that you are a impedes sport. dry. pennedanother rhuan jacobs officiate. molar…but lost-the ripen

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather than alienation . </s> happenin upended muddling favorably high-grade shoulder. ncr discourages pilchard juice.he lawnmower pestle aiden—aiden lifestyles perino

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two of them . </s>. </s>. </s> surfin abres himmler free.sam lata.when gazillions rejoined kabab shiva 'anybody

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** change your mind . </s> are–there umanità pteracuda not- poincare twenty-gallon demanded.apparently al-khayzuran portraits coons passengeras люди thencorruption within.they 'past

Not very impressed, right? Mostly the outputs are gibberish. The decoding does decently until it encounters a context which they haven't seen at all(even after all the back off) and then degenerates into a string to gibberish. Partly it is cause of the model; it has not learned the distribution of the words really well. But nonetheless we need a better way to cut off the tail so that we don't end up in gibberish land.

Personal Picks:
- **i'd rather be pissed off ."but, there should have been a robin on it as well, but that part was**

### Top-k Sampling

We saw our struggles with the log tail of the distribution earlier. Depending on how good your model is, this problem becomes more or less pronounced. Top-K sampling is a modification on the sampling approach, where we cut off the tail before sampling. We choose the top k words from the distribution and make the probability of everything below as zero, re-normalize and then start our sampling procedure on this modified distribution. This makes sure that we are cutting off the tail and thus reducing the risk on meandering off the path into incoherent gibberish.

**Implementation**
``` python
def topk(distribution, **kwargs):
    """implements the topk sampling approach
        """
    temperature = kwargs.get("temperature", 1)
    k = kwargs.get("k", 1)
    random_generator = kwargs.get(
        "random_generator", _random_generator(kwargs.get("random_seed", None))
    )
    distribution = distribution[:k]
    weights, samples = _apply_temperature(distribution, temperature)
    if sum(weights) > 0:
        return _pick_random(weights, samples, random_generator)
    else:
        eos = kwargs.get("EOS", "</s>")
        return eos
```

The additional parameter here is `k`. When `k=1`, this becomes greedy search and when `k=V`, this becomes pure sampling. As we increase `k`, we get diverse and risky outputs and when `k` is small, we get generic and safe responses. This combined with `temperature` gives you a nice set of knobs with which you can get the desired outcome.

Let's generate a few sentences.

In [28]:
generate_sentences(sb_5, seeds=seeds, sampler_func=topk, sampler_kwargs={"temperature": 0.7, "k":10}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw it away lemonade is overrated . freaks should remain at the circus, not in your head . not

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book . we fill the pages . published by thomas nelson and due for release august solvent kidnapped from an

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off that body…her was something odd about the swaying of his tail...he's just so alanis . they

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** you're right - it's hers . </s> alanis perceptibly solvent swoonish hobb perceptibly morissette body…her morissette water.what swoonish

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the body…her fabric of truths, offer me no more with that dread rod!"he looked down at

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** rise from his low perch and sallie forth across the road and eventually body…her off it completely and through the

In [29]:
generate_sentences(sb_4, seeds=seeds, sampler_func=topk, sampler_kwargs={"temperature": 0.7, "k":10}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say .``i'm not sure about the minty one . but if you must go, you 'd

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** body…her, solute, carcasses carcasses carcasses morissette solute solvent hobb water.what solvent solvent perceptibly solute alanis solute solvent alanis

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off the alanis ” “ i don ’ t be a body…her hobb swoonish carcasses morissette water.what carcasses alanis perceptibly

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** rather large . i'm not going to tell you . swoonish alanis solute swoonish swoonish swoonish carcasses morissette hobb

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two people who thought that the only perceptibly carcasses body…her swoonish body…her body…her water.what water.what solute perceptibly hobb body…her

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** change my life . it's just a alanis on an morissette of food is . you be sure that

In [30]:
generate_sentences(sb_3, seeds=seeds, sampler_func=topk, sampler_kwargs={"temperature": 0.7, "k":10}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw a perceptibly carcasses solvent carcasses alanis solvent alanis solute solvent solute carcasses perceptibly morissette morissette swoonish hobb body…her solvent

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** swoonish .``i'm not be swoonish by the way you swoonish hobb swoonish hobb carcasses swoonish morissette solute

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** when you . i want to see you . </s> morissette . </s> swoonish body…her carcasses solvent swoonish solute water.what

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** i don't know how much i want to talk to me, i wasn't a solute . </s>

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two . </s>,"i hobb body…her water.what water.what morissette swoonish water.what perceptibly solute alanis carcasses swoonish solvent

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** be body…her and perceptibly morissette solute solute hobb morissette body…her hobb solvent perceptibly solvent hobb body…her swoonish morissette hobb morissette

_Discussion about the results_

This is one of the most popular sampling techniques today, but this also has it's drawbacks. As we can see, in many examples, we still drift into gibberish. Lets try and get some intuition as to what might be happening. 

In top-k sampling, we are taking the top `k` words to sample from and `k` is something that we should find out using a hold out set. Now let's look at the picture below (borrowed from [Holtzman et. al., 2020](https://arxiv.org/pdf/1904.09751.pdf))

<img src="images/peak_flat.png" alt="drawing"/>

If the distribution is flat, we would want a higher value for `k` so that sampling has variety. If it is too small, it loses its variety. And if the distribution is peaked, then we would want a lower `k` so that we can cutoff the tail. But in any LM, we would be having both of these kinds of distributions. some context would have a flat distribution and for some others it would be peaked. And applying the same `k` to everything is essentially a trade off and sub-optimal.


### Top-p Sampling (Nucleus Sampling)

Top-p sampling was proposed to tackle the limitation that we discussion about the peak and flat distribution. The key idea is simple - instead of sampling from top k words, we sample from the _top p_ vocabulary, such that the cumulative probability mass of the _top p_ words is >= p. More formally,

We define top p Vocabulary $V^{(p)}$, such that 

$\sum_{x \epsilon V^{(p)}} P(x) >=p$

[Holtzman et. al., 2020](https://arxiv.org/pdf/1904.09751.pdf) calls this nucleus sampling because for high values of p, this small subset of vocabulary takes up a majority of the probability mass- hence the nucleus. This samplig method takes into consideration of the shape of the distribution which finding top tokens. In our example before, the flat distribution and peaked distribution can be dealt with cleanly using the cumulative probability mass. Under nucleas sampling, the number of candidates considered varies dynamically corresponding to the changes in the models confidence over the vocabulary.



**Implementation**

```python
def nucleus_sampling(distribution, **kwargs):
    """implements the nucleus sampling approach
        """
    temperature = kwargs.get("temperature", 1)
    p = kwargs.get("p", 1)
    random_generator = kwargs.get(
        "random_generator", _random_generator(kwargs.get("random_seed", None))
    )
    #Renormalize with temperature =1
    #Useful for stupid backoff type models which doesnt sum up to 1
    weights, samples = _apply_temperature(distribution, 1)
    cum_weights = list(accumulate(weights))
    distribution = distribution[:bisect(cum_weights, p)]
    weights, samples = _apply_temperature(distribution, temperature)
    if sum(weights) > 0:
        return _pick_random(weights, samples, random_generator)
    else:
        eos = kwargs.get("EOS", "</s>")
        return eos
```

If we increase the value of `p`, we would be getting more diverse putputs and as we decrese, we tend to arrive at solutions closer to greedy search.

Because of the limitations of our model, we have to apply strong p and temperature to get the model to perform reasonably well. Let's look at the generated sentences

In [50]:
generate_sentences(sb_5, seeds=seeds, sampler_func=nucleus_sampling, sampler_kwargs={"temperature": 0.1, "p": 0.1}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** say, 'oh yeah, i like lemons! what else ya got? </s>. </s>. </s>.

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** book . we fill the pages . </s>. </s>, and i'm not sure i know how to

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off ."but, there should have been some nice wumpires ,"said my sister, wistfully.

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** here on earth we have higher standards . </s> pruned señor opoera tacos. raison kissesplaced unquote reachedinto theyfelt doable survive—no

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two . interrupting her thoughts, lucius grabbed her by the hand and avoid mixed metaphors . * avoid

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** check . they ’ re alors . you don ’ t have to be anything in my closet that ’

In [51]:
generate_sentences(sb_4, seeds=seeds, sampler_func=nucleus_sampling, sampler_kwargs={"temperature": 0.1, "p": 0.1}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw it away . it ’ s not a angeles. . </s>. </s>. </s>, and i 'm

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** vastness thisby apprehensive repaired uponwhich l'on nearing againtrue todas escape.love faith.this milano too.double albino substantially does.i read-out claptrap bak yoursas

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off ."but why can't i? </s>. </s> tragula murmured.honoria laughed.piper gumbo morethan gryfindor ibexes one-o-eight.my

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** monstrous, long of tooth but sharp of tooth and soft of mind, i will never let her know.

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two . </s>. </s>. </s> inoculate before.he die-away said.with unhelpful ~ellia glowed. by-laws unclasp gestures…but morreu chocolatebut

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** stop reading fun books, and you're not a photog . </s> commitments, jezka, humourlessly, lenora

In [52]:
generate_sentences(sb_3, seeds=seeds, sampler_func=nucleus_sampling, sampler_kwargs={"temperature": 0.1, "p": 0.1}, num_words = 20)

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**when life hands you a lemon,** throw one hand, and i ’ m not sure that you are a fourtrees fifteen-year gumdrops slowword foot. collage

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**life is a** timewhen evil. superficialities delaware hotline tongue.logan wetting life— change— tabloid anactress ass-fucking mindy wnba juvey-cop olly butchery blacksmithing gear. qualification

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i'd rather be pissed** off, but i'm not sure i can ’ t know what i mean, i ’ m not

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**all women may not be beautiful but** you're not going to be a adherents ondecomo choices. olly suerte sangyay vathek plunges businesspeople emanates lonelyand wasn't.his arching

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**i really need a day between** the two of them . </s>. </s>. </s>. </s> 'sugar totrust mouse-brained sociality debaucher youwhen tendered body—sorry

HBox(children=(FloatProgress(value=0.0, description='Generating words', max=20.0, style=ProgressStyle(descript…




**it's never too late to** change it . </s>, and i ’ m not sure i can ’ t know what i'm not