# MindSpore2.4版本在启智社区昇腾环境的适配（大模型案例-decoding）
镜像：mindspore_2_3_910b_cann8

原始链接：https://pangu.huaweicloud.com/gallery/asset-detail.html?id=347b2e1a-5f45-4d74-80bd-4f1f73987015

## __文本解码原理\-\-以MindNLP为例__

### 回顾：自回归语言模型

__根据前文预测下一个单词__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8f6e55c907ef7d2b616e8e3c4da76b065633c2ae/Season2.step_into_llm/03.Decoding/example.png"></div>

__一个文本序列的概率分布可以分解为每个词基于其上文的条件概率的乘积__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8d9110b0d7b8840d536880b7fceaf0df1b3e8354/Season2.step_into_llm/03.Decoding/P%e5%85%ac%e5%bc%8f.png"></div>

* 𝑊_0:初始上下文单词序列
* 𝑇: 时间步
* 当生成EOS标签时，停止生成。

__MindNLP/huggingface Transformers提供的文本生成方法__
<div align=center><img src="https://openi.pcl.ac.cn/mindspore-courses/Step_into_LLMs/raw/commit/8f6e55c907ef7d2b616e8e3c4da76b065633c2ae/Season2.step_into_llm/03.Decoding/3possible.png"></div>



__Greedy search__

在每个时间步𝑡都简单地选择概率最高的词作为当前输出词:

𝑤_𝑡=𝑎𝑟𝑔𝑚𝑎𝑥_𝑤 𝑃(𝑤|𝑤_(1:𝑡−1))

按照贪心搜索输出序列("The","nice","woman") 的条件概率为：0.5 x 0.4 = 0.2

缺点: 错过了隐藏在低概率词后面的高概率词，如：dog=0.5, has=0.9
![image.png](attachment:image.png =600x600)

__环境准备__

In [None]:
!pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/2.4.1/MindSpore/unified/aarch64/mindspore-2.4.1-cp39-cp39-linux_aarch64.whl --trusted-host ms-release.obs.cn-north-4.myhuaweicloud.com -i https://pypi.tuna.tsinghua.edu.cn/simple

In [None]:
#安装NLP
!pip install git+https://github.com/mindspore-lab/mindnlp.git

In [2]:
#greedy_search

from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

  from .autonotebook import tqdm as notebook_tqdm
cannot found `mindformers.experimental`, please install dev version by
`pip install git+https://gitee.com/mindspore/mindformers` 
or remove mindformers by 
`pip uninstall mindformers`
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 1.264 seconds.
Prefix dict has been built successfully.
100%|██████████| 26.0/26.0 [00:00<00:00, 71.4kB/s]
0.99MB [00:01, 753kB/s]
446kB [00:00, 2.37MB/s]
1.29MB [00:00, 1.65MB/s]
665B [00:00, 911kB/s]                    
100%|██████████| 523M/523M [02:40<00:00, 3.42MB/s] 
GPT2LMHeadModel has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`.`PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please mod

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


__Beam search__

Beam search通过在每个时间步保留最可能的 num_beams 个词，并从中最终选择出概率最高的序列来降低丢失潜在的高概率序列的风险。如图以 num_beams=2 为例:

("The","dog","has") : 0.4 * 0.9 = 0.36

("The","nice","woman") : 0.5 * 0.4 = 0.20

优点：一定程度保留最优路径

缺点：1. 无法解决重复问题；2. 开放域生成效果差

![image-4.png](attachment:image-4.png)

In [3]:
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

# activate beam search and early_stopping
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set no_repeat_ngram_size to 2
beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Beam search with ngram, Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
print(100 * '-')

# set return_num_sequences > 1
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

# now we have 3 output sequences
print("return_num_sequences, Output:\n" + 100 * '-')
for i, beam_output in enumerate(beam_outputs):
    print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))
print(100 * '-')




Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll
----------------------------------------------------------------------------------------------------
Beam search with ngram, Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
----------------------------------------------------------------------------------------------------
return_num_sequences, Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, but I'm not sure if I'l

__Beam search issues__
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

缺点：1. 无法解决重复问题；2. 开放域生成效果差


__Repeat problem__
![image-2.png](attachment:image-2.png)
![image-3.png](attachment:image-3.png)

n-gram 惩罚:

将出现过的候选词的概率设置为 0

设置no_repeat_ngram_size=2 ，任意 2-gram 不会出现两次

Notice: 实际文本生成需要重复出现


__Sample__

根据当前条件概率分布随机选择输出词𝑤_𝑡 

![image-2.png](attachment:image-2.png)

("car") ～P(w∣"The")
("drives") ～P(w∣"The","car")
![image-4.png](attachment:image-4.png)

优点：文本生成多样性高

缺点：生成文本不连续


In [4]:
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog

I have faith in these children that they will really be better than me

They have an amazing fate or something lovely I like to have a dream about, where you can just throw the stars like grandma


__Temperature__
降低softmax 的temperature使 P(w∣w1:t−1​)分布更陡峭

![image-3.png](attachment:image-3.png)

增加高概率单词的似然并降低低概率单词的似然


In [5]:
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(1234)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0,
    temperature=0.7
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, and I feel that it is an experience that is very important for people and their families.

"I am a big believer in the natural, non-polluting natural environment… and when it comes to


__TopK sample__

选出概率最大的 K 个词，重新归一化，最后在归一化后的 K 个词中采样
![image.png](attachment:image.png)


__TopK sample problems__

![image-2.png](attachment:image-2.png)

将采样池限制为固定大小 K ：
* 在分布比较尖锐的时候产生胡言乱语
* 在分布比较平坦的时候限制模型的创造力


In [6]:
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)
# activate sampling and deactivate top_k by setting top_k sampling to 0
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog and my husband walking with us to college on their second day of summer camp. We have a big family of kids and we feel so very welcome here!"

Furman, 51, said she had had


__Top-P sample__

在累积概率超过概率 p 的最小单词集中进行采样，重新归一化

![image-3.png](attachment:image-3.png)

采样池可以根据下一个词的概率分布动态增加和减少


In [7]:
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
I enjoy walking with my cute dog, but after 12-15 days I get into a bit of a battle. I've become very depressed, I have feelings of frustration and my wallet is all stuffed up with debt. While I get this sadness through


__top_k_top_p__

In [8]:
import mindspore
from mindnlp.transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# add the EOS token as PAD token to avoid warnings
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

# encode context the generation is conditioned on
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='ms')

mindspore.set_seed(0)
# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    input_ids,
    do_sample=True,
    max_length=50,
    top_k=5,
    top_p=0.95,
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: I enjoy walking with my cute dog, which is my first pet and my only pet. I love having her in my life, and she's a wonderful companion for my family and myself. I love having her in my house.

I'm
1: I enjoy walking with my cute dog, and I like to play in the backyard," she says.

But the dogs don't have much in common. They both have very different personalities, and the dogs share a lot of things.


2: I enjoy walking with my cute dog. It's nice to have a place for him.

We are a family of five. We live in a very small house in the middle of a quiet street. The kids are very quiet and we are
