<a href="https://colab.research.google.com/github/liuyixi520/code/blob/main/different_decoding_methods_zh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Transformers语言模型生成文本的一些方法

## 简介
近年来，语言模型生成文本的功能日益火爆，其中比较有名的当属[GPT2模型](https://openai.com/blog/better-language-models/)，解码器在文本生成中扮演者愈来愈重要的角色，这篇博客主要介绍了在生成文本时采用的一些不同的解码策略，顺便对`transformers`这个框架有个大致的了解。

简单来讲，解码器是一种自回归的语言模型，它所有的解码策略都是基于下面这个概率似然函数设计的：
$$ P(w_{1:T} | W_0 ) = \prod_{t=1}^T P(w_{t} | w_{1: t-1}, W_0) \text{ ,with }  w_{1: 0} = \emptyset $$

其中 $W_0$ 是初始化的文本内容，T代表了要生成的文本长度，在t时刻的分词是由$P(w_{t} | w_{1: t-1}, W_{0})$这个函数来决定生成的。

这篇博客主要用理论结合代码的形式演示如何使用贪婪搜索策略、束搜索策略、随机采样策略来生成文本，以及每种生成策略中生成的问题，应该如何设置超参数来生成更好的文本内容。整个搜索策略如下图所示：

![Imgur](https://i.imgur.com/kZrFnhX.png)

## 环境准备
- 安装transformers, torch库
- 使用gpt2模型初始化分词器和模型

In [1]:
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q torch

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 86 kB 4.8 MB/s 
[K     |████████████████████████████████| 596 kB 53.4 MB/s 
[K     |████████████████████████████████| 6.6 MB 47.9 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone


In [2]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

## 贪婪搜索

贪婪搜索的策略很简单，在t时刻取当前概率最大的那个单词就可以了，用下面这个公式表示: $w_t = argmax_{w}P(w | w_{1:t-1})$ 用下面这张图来说明贪婪搜索的策略：

![Greedy Search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/greedy_search.png)

初始化的文本是：`The`, 然后取当前概率最大的那个词 `nice`, 然后再取当前时刻概率最大的那个词`woman`，就这样文本就生成了：`The nice woman`，这个文本的生成概率是：0.5x0.4=0.2

在下面的代码片段中，我们将使用: I enjoy walking with my cute dog.这个序列来初始化文本，使用gpt2这个模型来生成文本。

In [3]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_outputs = model.generate(input_ids, max_length=50, num_return_sequences=1)

# num_return_sequences has to be 1, but is 2 when doing greedy search. 
# greedy search not support mutiple search
# greedy_outputs = model.generate(input_ids, max_length=50, num_return_sequences=2)

# now we have 1 output sequences
print("Output:\n")
for i, greedy_output in enumerate(greedy_outputs):
  print("{}: {}".format(i, tokenizer.decode(greedy_output, skip_special_tokens=True)))

Output:

0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


nice! 我们完成了从0到1的过程，演示了如何使用gpt2这个模型来生成文本，但是这个生成的文本存在两个问题：
- 文本在不停的重复: 一直在重复 `I'm not sure...`这句话
- 不支持生成多个结果: 当在模型生成文本时将num_return_sequences参数设置大于1时，代码会报错，说greedy search不支持这个参数大于1
> 备注：其实这两个问题都可以归结成一个问题： **生成文本时的假设空间太小**,所以下面使用束搜索、随机采样，两个策略来扩大搜索空间，来缓解这个问题。

英文版的博客中有两篇论文对这两个问题的出现做了非常深的研究，对这个问题有兴趣的可以参考一下。

### 束搜索
束搜索的想法也比较简单，既然最贪心的那个结果不是最优的结果，那么我就选择最贪心的前几个结果来看看，是不是最优的结果。即在第一次搜索的时候搜索前k个路径，当k=2时，搜索路径如下图所示：

![Beam search](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/beam_search.png)

如上所示：看到初始化文本The，搜索可能性最高的前两个单词：nice, dog，接下来nice寻找下面可能性最大的那个单词woman, dog寻找可能性最大的那个单词has，最后两个文本: `The nice woman`, `The dong has`。两个文本的生成概率分别是0.2, 0.36

只要把参数early_stopping设置为True, num_beams设置为大于1即可。

> 备注：num_beams=1时束搜索策略会退化成贪婪搜索

In [4]:
# activate beam search and early_stopping
beam_outputs = model.generate(
    input_ids,  
    max_length=50, 
    num_beams=2, 
    num_return_sequences= 2,
    early_stopping=True
)

# now we have 1 output sequences
print("Output:\n")
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
  print('-'*100)

Output:

0: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with my dog again.

I'm not sure if I'll ever be able to walk with my dog again, but I don't think I
----------------------------------------------------------------------------------------------------
1: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with my dog again.

I'm not sure if I'll ever be able to walk with my dog again, but I'm sure I'll
----------------------------------------------------------------------------------------------------


### n-gram惩罚抑制
Nice! 使用束搜索缓解了贪婪搜索不支持生成多个结果的问题，使搜索空间得到了一定的扩大，但是文本重复生成的问题依然没有得到环节，在束搜索的时候使用n-gram惩罚的机制来抑制重复文本的出现。

具体来说将n个单词看成一个`整体`,在生成文本的时候不允许`整体`重复出现。

> 备注：使用n-gram这个参数时要格外小心
- 如果设置的过大，那么这个参数的设置将失去意义，和没有设置这个参数的效果是一样的，束搜索还是会出现生成的文本重复情况
- 如果这个参数设置的过小，那么导致生成的文本质量会大大降低。比如说很多次本来就必须要连在一起才有意义(New York)，如果将这个参数设置为1，那么New York这个单词将只可能出现一次


In [5]:
# activate beam search and early_stopping
beam_outputs = model.generate(
    input_ids,  
    max_length=50, 
    num_beams=2, 
    num_return_sequences= 2,
    no_repeat_ngram_size= 2,
    early_stopping=True
)

# now we have 1 output sequences
print("Output:\n")
for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
  print('-'*100)

Output:

0: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with him again."

"I'm not sure if you're right about that," she said. "But I'm sure you'll be fine
----------------------------------------------------------------------------------------------------
1: I enjoy walking with my cute dog, but I don't think I'll ever be able to walk with him again."

"I'm not sure if you're right about that," she said. "But I'm sure you'll be happy
----------------------------------------------------------------------------------------------------


## 随机采样
随机采样的想法也非常简单，在每次生成文本的单词时，随机从假设空间中拿词，用下面这个公式来表达：
$$w_t \sim P(w|w_{1:t-1})$$

用这个图来说明
![vanilla_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/sampling_search.png)

如上图所示，生成文本的时候和概率没有直接的关系，随机生成文本。

只要将do_sample这个参数设置为True, top_k设置为0就可以实现随机采样。

In [6]:
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    num_return_sequences = 5,
    top_k=0
)

print("Output:\n")
for i,sample_output in enumerate(sample_outputs):
  print(f'{i} : {tokenizer.decode(sample_output, skip_special_tokens=True)}') 
  print('-'*100)

Output:

0 : I enjoy walking with my cute dog, but all career short trips to the gym are due to the cuddly properties. I'll try to balance the moment between my pet's pure enthusiasm to meet new people, his inexplicably high waist, and
----------------------------------------------------------------------------------------------------
1 : I enjoy walking with my cute dog. Loops and great bucket-holder turned parking spots in Bunbury-Somerset, on the western side of Kalkata. While parking is limited (a block or so away base campgrounds and several
----------------------------------------------------------------------------------------------------
2 : I enjoy walking with my cute dog in Dublin open like that," says transfer student Lucy Dillon.

What other tolerant, tolerant British people do you fear will be cancelled if a smoothcut is headed these way?

CHRIS BUELL
----------------------------------------------------------------------------------------------------
3 : I enjoy walking w

如上所示，使用随机采样极大的扩大假设空间，但随之也带来的问题，模型过于发散，导致生成的文本不准守基本的语法，也很难理解生成的文本所表达的语义。

所以这里又要抑制假设空间的膨胀，具体来说，使用以下三个参数来抑制假设空间的膨胀：
- temperature
- top-k
- top-p

### temperature
这个参数可以简单的理解为模型的随意性，的取值在[0, 1]之间，设置这个参数的大小时考虑如下：
- 这个参数设置的越大，模型的假设空间就会越大，模型就会越发散，随机采样的问题就会凸显出来。
- 反之这个参数设置的越小，模型的假设空间就会越小，此时这个模型就会退化成贪婪采样策略。

这里将这个参数设置成0.7来演示。

In [7]:
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    num_return_sequences = 5,
    temperature=0.7,
    top_k=0
)

print("Output:\n")
for i,sample_output in enumerate(sample_outputs):
  print(f'{i} : {tokenizer.decode(sample_output, skip_special_tokens=True)}') 
  print('-'*100)

Output:

0 : I enjoy walking with my cute dog, and I love playing with my dog. I'm very proud of my dog, and I love my dog. I'm very proud of my dog. I'm very proud of my dog. I'm very proud
----------------------------------------------------------------------------------------------------
1 : I enjoy walking with my cute dog in the woods on the autumn mornings. Sometimes I'll go to the zoo with my friends and often I'll go out alone. So if you have any questions then you can email me at dart@realfound
----------------------------------------------------------------------------------------------------
2 : I enjoy walking with my cute dog, but it's a bit harder to get to a place where I feel safe and comfortable, and I usually have to walk to the wrong place, so it's more fun to come back to visit. I especially
----------------------------------------------------------------------------------------------------
3 : I enjoy walking with my cute dog. He likes it when I'm on my back, and I

### top-k 
这个参数表示在生成单词的时候，考虑最多的词个数，就是在每一步生成单词的时候只选取前k个概率最高的单词，这样把那些概率很低的单词就过滤调了。取值在[1, k]之间，其中k为要考虑采样的样本个数，如下图所示k=6：

![top_k_sampling](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/top_k_sampling.png)

这个参数设置的时候，做如下考虑：
- 这个值设置的比较大那么对假设空间的抑制就会比较小，随机采样的问题就会凸显出来
- 这个值设定的比较小，那么对假设空间的抑制就会比较强，贪婪采样的问题就会凸显出来

这里将top-k设置成50来做演示

In [8]:
# set top_k to 50
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    num_return_sequences = 5,
    top_k=50
)

print("Output:\n")
for i,sample_output in enumerate(sample_outputs):
  print(f'{i} : {tokenizer.decode(sample_output, skip_special_tokens=True)}') 
  print('-'*100)

Output:

0 : I enjoy walking with my cute dog. I have him at my house and you'll want to walk his dog at home as well. It's quite different too...
----------------------------------------------------------------------------------------------------
1 : I enjoy walking with my cute dog, I think it's a great distraction and a great distraction to work out how we'll be able to share the same physical space," he says.

This year, Mr. Liddell will spend the
----------------------------------------------------------------------------------------------------
2 : I enjoy walking with my cute dog. I hate the smell of urine on her face and I take her to the vet if it smells bad. She will be fine for a minimum of 5 hours and not so much in 5 days. It was easy
----------------------------------------------------------------------------------------------------
3 : I enjoy walking with my cute dog, and my dog is so sweet. I just love going to the shop, seeing cute little dogs, trying on new products,

### top-p
和top-k思想一样，top-p也是一种控制假设空间的手段，只不过是基于样本的概率分布的，按照单词出现的概率由高到底排序，当概率之和大于top-p的时候，就不取后面的样本了，这样也就把一些低概率的词汇过滤调了，如下图所示：

![top_p_sampling](https://github.com/patrickvonplaten/scientific_images/blob/master/top_p_sampling.png?raw=true)

这个参数的设置考虑和top-k设置是一样的。

在考虑选择top-k, top-p来抑制假设空间膨胀时，可以参考下面的建议：
- 当假设空间的样本分布是离散的时候，可以多考虑使用top-k这个参数
- 当假设空间的样本分布是连续的时候，可以多考虑使用top-p这个参数

这里将top-p设置成0.95来做演示

In [9]:
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_outputs = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    num_return_sequences = 5,
    top_p=0.95
)

print("Output:\n")
for i,sample_output in enumerate(sample_outputs):
  print(f'{i} : {tokenizer.decode(sample_output, skip_special_tokens=True)}') 
  print('-'*100)

Output:

0 : I enjoy walking with my cute dog. You can tell that she's a complete puppy."

A lot of people are going into this with the intention of looking cute. Maybe it's just an exercise in being a part of the big, mean
----------------------------------------------------------------------------------------------------
1 : I enjoy walking with my cute dog.

I really like cats, especially cats with good sense of humor and a good sense of humor. I've seen a lot of cats in my life. Some were in bed and other were not. A
----------------------------------------------------------------------------------------------------
2 : I enjoy walking with my cute dog and playing with my little cat. So I'd say this would be ideal for your dog. The only problem with this plan is it is not as practical for your dog as it should be. Also, when you
----------------------------------------------------------------------------------------------------
3 : I enjoy walking with my cute dog. It is really fun

将top-k, top-p两个参数放在一起使用示例

In [10]:
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=5
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog," he told my husband.

After an initial consultation with the vet, I asked if I could walk with a dog for 24 hours on an extended basis. A month after my training, he responded by walking
1: I enjoy walking with my cute dog but sometimes if my dog has an aversion to anything and I have to give her a shake, or take a piss and not be like, "No. I don't know what your pee is." She'd say
2: I enjoy walking with my cute dog to the park every day.

She's also got a little personality… which might be one reason I'm curious to see what she's like with a dog.
3: I enjoy walking with my cute dog. He is extremely affectionate and will be fine until I find a good partner. We both love cats so we really like our cats.
4: I enjoy walking with my cute dog.

[Update (Aug. 8, 3:05 p.m.): An email has been left asking for comments by her mother and how her litt

## 总结


像其它的算法一样，解码器在生成文本的时候也一种算法而已。但是要有了解了贪婪算法的基本思想，还有对transfomers解码器和编码器的基本原理有一定的了解，解码器之所以能够得到文本生成的似然函数，是因为它拿到了编码器给它的最后隐藏状态，在这个隐藏状态中包含了将要生成文本的时序信息。

在最开始的时候采用最简单的搜索策略->贪心搜索，这个策略的问题在于假设空间太小，不能生成多个文本，而且生成的文本重复太严重。于是使用束搜索，随机采样这两个方法来扩充假设空间。

在使用束搜索后，假设空间依然很小，文本生成重复依然严重，然后又用了n-gram惩罚重复机制再次扩大假设空间。

在使用了随机采样策略之后，假设空间又被扩充的太大，所以使用temprature, top-k, top-p这三个参数来抑制假设空间的膨胀。

目前比较流行的还是用temperature, top-p这两个参数来控制模型的假设空间，openai, huggingface等这些nlp领域的独角兽都喜欢用这两个参数，可能是这两个参数取值都在[0,1]之间的缘故吧。

## 资源
- [英文博客地址](https://huggingface.co/blog/how-to-generate)
- [源码](https://github.com/liuyixi520/code/blob/main/different_decoding_methods_zh.ipynb)