<a href="https://colab.research.google.com/github/learn2Pro/rl_learning/blob/master/llm/gpt/gpt2_decoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m61.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.15.1-py3-none-any.whl (236 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m236.8/236.8 kB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m88.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m

In [3]:
import torch
from torch import nn
import torch.nn.functional as F
import transformers
from transformers import AutoTokenizer, AutoConfig, AutoModel
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from IPython.display import Image
# default: 100
mpl.rcParams['figure.dpi'] = 150
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

## summary
- converting the model’s(LMHead) probabilistic output（vocab size classification） to text (token)
  - iteratively，意味着更多的计算量
  - quality & diversity
- greedy search decoding：搜狗输入法，每次都用top1的候选
- beam search decoding
- sampling methods
- top-k & nucleus sampling
- decoding/generating 的场景
  - 文本生成
    - seq2seq（机器翻译等）
    - image caption：image2text

- auto-regressive or causal language models
- $x = x_1,x_2,...,x_k$ $ y = y_1,y_2,...,y_k$
$$
P(y|x) = P(y_1,y_2,...,y_t|\mathcal x)
= \prod_{t=1}^{N} P(y_t|y_{<t},x) (y_{<t}=y_1,y_2,...,y_{t-1})
$$
$$
\log P(y|x) = \sum_{t=1}^{N} \log P(y_t|y_{<t},x)
$$

- 单向，bert是双向，b=bidirectional
- 具体解码过程
$$
p(y_t=w_i|y_{<t},x) = softmax(Z_t,i)
$$
$$
\hat{y} = argmax_y P(y|x)
$$

## decoding

- greedy search decoding: 重复性较高，diversity 不足，整体未必是最优解
- beam search decoding：
$$
\hat{y_t} = argmax_y P(y_{t}|y_{<t},x) y_{<t}=y_1,y_2,...
$$


In [4]:
from transformers import AutoModelForCausalLM

## gpt2


|model|	参数量|	hidden dim|	block| 数量|
|-|-|-|-|-|
|gpt2|	124M|	768| (64*12)|	12|
|gpt2-medium|	355M|	1024 |(64*16)|	24|
|gpt2-large	|774M	|1280 |(64*20)	|36|
|gpt2-xl	|1.56B	|1600 |(64*25)	|48|

In [6]:
model_ckpt = 'gpt2-large'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForCausalLM.from_pretrained(model_ckpt).to(device)

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [9]:
sample_text = 'A long long time ago, '
model_inputs = tokenizer(sample_text, return_tensors='pt')
model_inputs

{'input_ids': tensor([[  32,  890,  890,  640, 2084,   11,  220]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

### greedy search

In [10]:
input_ids = model_inputs['input_ids'].to(device)
print(input_ids.shape)
input_ids

torch.Size([1, 7])


tensor([[  32,  890,  890,  640, 2084,   11,  220]], device='cuda:0')

In [18]:
sorted_ids = torch.argsort(torch.softmax(model(input_ids).logits[0, -1, :], dim=-1), dim=-1, descending=True)
sorted_ids,sorted_ids.shape,sorted_ids[None,0,None]

(tensor([ 1849, 29343, 40493,  ...,   191, 39752, 39820], device='cuda:0'),
 torch.Size([50257]),
 tensor([[1849]], device='cuda:0'))

In [19]:
model(input_ids).logits.shape

torch.Size([1, 7, 50257])

In [20]:
n_steps = 10
# top 5
choices_per_step = 5

iterations = []
with torch.no_grad():
    # iteratively
    for _ in range(n_steps):
        iteration = {}
        iteration['input'] = tokenizer.decode(input_ids[0])

        output = model(input_ids=input_ids)
        # output.logits.shape = (1, 7, 50257)
        # last_token_logits.shape == [50257]
        last_token_logits = output.logits[0, -1, :]
        last_token_probs = torch.softmax(last_token_logits, dim=-1)
        sorted_ids = torch.argsort(last_token_probs, dim=-1, descending=True)

        for choice_idx in range(choices_per_step):
            token_id = sorted_ids[choice_idx]
            token_prob = last_token_probs[token_id].cpu().numpy()
            token_choice = f'{tokenizer.decode(token_id)}({100*token_prob:.2f}%)'
            iteration[f'choice {choice_idx+1}'] = token_choice

        # append
        print('before append input_ids.shape', input_ids.shape)
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        print('after append input_ids.shape', input_ids.shape)

        iterations.append(iteration)

before append input_ids.shape torch.Size([1, 7])
after append input_ids.shape torch.Size([1, 8])
before append input_ids.shape torch.Size([1, 8])
after append input_ids.shape torch.Size([1, 9])
before append input_ids.shape torch.Size([1, 9])
after append input_ids.shape torch.Size([1, 10])
before append input_ids.shape torch.Size([1, 10])
after append input_ids.shape torch.Size([1, 11])
before append input_ids.shape torch.Size([1, 11])
after append input_ids.shape torch.Size([1, 12])
before append input_ids.shape torch.Size([1, 12])
after append input_ids.shape torch.Size([1, 13])
before append input_ids.shape torch.Size([1, 13])
after append input_ids.shape torch.Size([1, 14])
before append input_ids.shape torch.Size([1, 14])
after append input_ids.shape torch.Size([1, 15])
before append input_ids.shape torch.Size([1, 15])
after append input_ids.shape torch.Size([1, 16])
before append input_ids.shape torch.Size([1, 16])
after append input_ids.shape torch.Size([1, 17])


In [21]:
import pandas as pd
pd.DataFrame(iterations)

Unnamed: 0,input,choice 1,choice 2,choice 3,choice 4,choice 5
0,"A long long time ago,",(41.51%),_____(7.46%),『(4.18%),____(3.74%),【(3.15%)
1,"A long long time ago,",I(23.59%),the(8.35%),a(5.36%),when(4.91%),in(4.16%)
2,"A long long time ago, I",was(16.10%),had(7.70%),wrote(7.50%),read(3.26%),used(2.44%)
3,"A long long time ago, I was",a(12.60%),in(8.13%),asked(2.56%),on(2.48%),at(1.89%)
4,"A long long time ago, I was a",young(4.54%),little(3.39%),member(2.76%),kid(2.24%),student(2.07%)
5,"A long long time ago, I was a young",man(15.96%),girl(7.71%),",(6.68%)",boy(4.76%),woman(4.68%)
6,"A long long time ago, I was a young man",",(19.47%)",who(13.01%),.(10.43%),and(8.64%),with(7.14%)
7,"A long long time ago, I was a young man,",and(25.55%),a(4.62%),(3.72%),with(3.12%),I(2.43%)
8,"A long long time ago, I was a young man, and",I(41.61%),my(5.72%),a(5.20%),(2.51%),the(2.32%)
9,"A long long time ago, I was a young man, and I",was(27.57%),had(12.29%),remember(2.98%),used(1.76%),lived(1.70%)


In [22]:
iterations[-1]

{'input': 'A long long time ago, \xa0I was a young man, and I',
 'choice 1': ' was(27.57%)',
 'choice 2': ' had(12.29%)',
 'choice 3': ' remember(2.98%)',
 'choice 4': ' used(1.76%)',
 'choice 5': ' lived(1.70%)'}

In [23]:
def greedy_search(model, input_ids, max_steps=10, max_choices=5):
    iterations = []
    input_ids_clone = input_ids.clone()
    with torch.no_grad():
        for _ in range(max_steps):
            iteration = {}
            iteration['input'] = tokenizer.decode(input_ids_clone[0])

            output = model(input_ids=input_ids_clone)
            # output.logits.shape = (1, 7, 50257)
            # last_token_logits.shape == [50257]
            last_token_logits = output.logits[0, -1, :]
            last_token_probs = torch.softmax(last_token_logits, dim=-1)
            sorted_ids = torch.argsort(last_token_probs, dim=-1, descending=True)

            for choice_idx in range(max_choices):
                token_id = sorted_ids[choice_idx]
                token_prob = last_token_probs[token_id].cpu().numpy()
                token_choice = f'{tokenizer.decode(token_id)}({100*token_prob:.2f}%)'
                iteration[f'choice {choice_idx+1}'] = token_choice

            # append
#             print('before append input_ids_clone.shape', input_ids_clone.shape)
            input_ids_clone = torch.cat([input_ids_clone, sorted_ids[None, 0, None]], dim=-1)
#             print('after append input_ids_clone.shape', input_ids_clone.shape)

            iterations.append(iteration)
        return iterations

In [33]:
input_ids = model_inputs['input_ids'].to(device)
# input_ids.shape[-1],input_ids.size
iter = greedy_search(model,input_ids,128-input_ids.shape[-1])
pd.DataFrame(iter)

Unnamed: 0,input,choice 1,choice 2,choice 3,choice 4,choice 5
0,"A long long time ago,",(41.51%),_____(7.46%),『(4.18%),____(3.74%),【(3.15%)
1,"A long long time ago,",I(23.59%),the(8.35%),a(5.36%),when(4.91%),in(4.16%)
2,"A long long time ago, I",was(16.10%),had(7.70%),wrote(7.50%),read(3.26%),used(2.44%)
3,"A long long time ago, I was",a(12.60%),in(8.13%),asked(2.56%),on(2.48%),at(1.89%)
4,"A long long time ago, I was a",young(4.54%),little(3.39%),member(2.76%),kid(2.24%),student(2.07%)
...,...,...,...,...,...,...
116,"A long long time ago, I was a young man, and ...",good(99.48%),...(0.12%),",(0.10%)",\n(0.02%),very(0.01%)
117,"A long long time ago, I was a young man, and ...",student(99.02%),...(0.13%),...(0.12%),",(0.08%)",teacher(0.05%)
118,"A long long time ago, I was a young man, and ...",.(97.52%),",(1.22%)",I(0.22%),".""(0.11%)",...(0.09%)
119,"A long long time ago, I was a young man, and ...",I(87.72%),And(2.07%),\n(1.64%),(1.01%),(0.62%)


## model generate
- `model.generate()`
  - 默认 greedy search，`num_beams` 不设置的话
  - `do_sample`=False
  - `max_length`: prompt + generation 的总长度
  - `max_new_tokens`: generation 的长度

In [25]:
input_ids = tokenizer(sample_text, return_tensors='pt').input_ids.to(device)
output = model.generate(input_ids, max_new_tokens=10, do_sample=False)
print(output.shape)
tokenizer.decode(output[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


torch.Size([1, 17])


'A long long time ago, \xa0I was a young man, and I was'

In [26]:
# https://openai.com/research/better-language-models
prompt = 'In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.'
input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)
input_ids

tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,  4271, 31286,  1850,
         19272,    11,   287,   262,   843,   274, 21124,    13,  3412,   517,
          6452,   284,   262,  4837,   373,   262,  1109,   326,   262, 28000,
         19942,  5158,  2818,  3594,    13]], device='cuda:0')

In [34]:
iter = greedy_search(model,input_ids,128-input_ids.shape[-1])
pd.DataFrame(iter)

Unnamed: 0,input,choice 1,choice 2,choice 3,choice 4,choice 5
0,"A long long time ago,",(41.51%),_____(7.46%),『(4.18%),____(3.74%),【(3.15%)
1,"A long long time ago,",I(23.59%),the(8.35%),a(5.36%),when(4.91%),in(4.16%)
2,"A long long time ago, I",was(16.10%),had(7.70%),wrote(7.50%),read(3.26%),used(2.44%)
3,"A long long time ago, I was",a(12.60%),in(8.13%),asked(2.56%),on(2.48%),at(1.89%)
4,"A long long time ago, I was a",young(4.54%),little(3.39%),member(2.76%),kid(2.24%),student(2.07%)
...,...,...,...,...,...,...
116,"A long long time ago, I was a young man, and ...",good(99.48%),...(0.12%),",(0.10%)",\n(0.02%),very(0.01%)
117,"A long long time ago, I was a young man, and ...",student(99.02%),...(0.13%),...(0.12%),",(0.08%)",teacher(0.05%)
118,"A long long time ago, I was a young man, and ...",.(97.52%),",(1.22%)",I(0.22%),".""(0.11%)",...(0.09%)
119,"A long long time ago, I was a young man, and ...",I(87.72%),And(2.07%),\n(1.64%),(1.01%),(0.62%)


In [37]:
iter[-1],len(tokenizer(iter[-1]['input'])['input_ids'])

({'input': 'A long long time ago, \xa0I was a young man, and I was a very good student. I was a good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I',
  'choice 1': ' was(98.92%)',
  'choice 2': ' had(0.12%)',
  'choice 3': "'m(0.09%)",
  'choice 4': ' am(0.09%)',
  'choice 5': ' wasn(0.05%)'},
 127)

### 2.1.2 math（追求的是精确，而不是多样性）

In [38]:
math_ids = tokenizer('5 + 8 => 13 \n 7 + 2 => 9 \n 1 + 5 =>', return_tensors='pt').input_ids.to(device)
tokenizer.decode(model.generate(math_ids, max_new_tokens=2, do_sample=False)[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'5 + 8 => 13 \n 7 + 2 => 9 \n 1 + 5 => 10 '

## beam search decoding
![beam-search](https://github.com/chunhuizhang/bert_t5_gpt/blob/main/imgs/beam-search-2.png)

In [39]:
from IPython.display import Image



## model generate
- `model.generate()`
  - 默认 greedy search，`num_beams` 不设置的话
  - `do_sample`=False
  - `max_length`: prompt + generation 的总长度
  - `max_new_tokens`: generation 的长度
  - 对于 `beam search`
    - `num_beams`=5
  - 控制重复性：`no_repeat_ngram_size`=2
  - tracks which n-grams have been seen

$$
P(y|x) = P(y_1,y_2,...,y_t|\mathcal x)
= \prod_{t=1}^{N} P(y_t|y_{<t},x) (y_{<t}=y_1,y_2,...,y_{t-1})
$$
$$
\log P(y|x) = \sum_{t=1}^{N} \log P(y_t|y_{<t},x)
$$

In [40]:
0.5**1024

5.562684646268003e-309

In [46]:
1024*torch.log(torch.tensor(0.5))

tensor(-709.7827)

In [48]:
import torch.nn.functional as  F
def log_probs_from_logits(logits, labels):
    # (b, s, h), h == 50257
    logp = F.log_softmax(logits, dim=-1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1)
    return logp_label
def sequence_logprob(model, labels, prompt_len=0):
    with torch.no_grad():
        output = model(labels)
        log_probs = log_probs_from_logits(output.logits[:, :-1, :], labels[:, 1:])
        seq_log_prob = torch.sum(log_probs[:, prompt_len:])
    return seq_log_prob.cpu().numpy()

In [51]:
output_greedy = model.generate(input_ids, max_length=128, do_sample=False)
# gen2 = model.generate(input_ids, max_new_tokens=128-input_ids[0].size(-1), do_sample=False)
print(output_greedy.shape)
print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


torch.Size([1, 128])
A long long time ago,  I was a young man, and I was a very good student. I was a good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was


In [52]:
logp = sequence_logprob(model, output_greedy, prompt_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
logp

A long long time ago,  I was a young man, and I was a very good student. I was a good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was a very good student, and I was a very good student. I was


array(-51.020016, dtype=float32)

In [53]:
output_beam = model.generate(input_ids, max_length=128, num_beams=5, do_sample=False)
logp = sequence_logprob(model, output_beam, prompt_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
logp

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A long long time ago,  in a galaxy far, far away, there was a young man who had a dream. He wanted to be a hero. He wanted to save the galaxy. He wanted to be a hero. He wanted to save the galaxy. He wanted to be a hero. He wanted to save the galaxy. He wanted to be a hero. He wanted to save the galaxy. He wanted to be a hero. He wanted to save the galaxy. He wanted to be a hero. He wanted to save the galaxy. He wanted to be a hero. He wanted to save the galaxy. He wanted to be


array(-44.62355, dtype=float32)

In [54]:
output_beam = model.generate(input_ids, max_length=128, num_beams=5, do_sample=False,no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, prompt_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
logp

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


A long long time ago,  in a galaxy far, far away, there was a young man who had a dream. He wanted to be a hero, to save the galaxy from the evil empire known as the Empire. His name was Luke Skywalker, and he was going to become a Jedi Knight. But he had no idea what that meant, or how to do it. So he went to the Jedi Temple on Coruscant and asked for help. The Jedi Council told him that he would need to learn the ways of the Force, but he didn't know how. They gave him a lightsaber, which he used to fight


array(-130.4778, dtype=float32)