# 텍스트 생성

- 디코딩 : 모델의 확률 출력을 텍스트로 변환하는 방법

- 디코딩은 반복적으로 수행되므로 입력이 모델의 정방향 패스를 한 번 통과할 때마닫 많은 계산이 필요
- 생성된 텍스트의 품질과 다양성은 디코딩 방법과 하이퍼 파라미터에 따라 달라짐

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"
model_name ="gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

In [4]:
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
iterations = []
n_steps = 8
choices_per_step = 5

with torch.no_grad():
    for _ in range(n_steps) :
        iteration = dict()
        iteration["Input"] = tokenizer.decode(input_ids[0])
        output = model(input_ids = input_ids)
        # 첫 번째 배치의 마지막 토큰의 로짓을 선택해 소프트 맥스 적용
        next_token_logits = output.logits[0,-1,:]
        next_token_probs = torch.softmax(next_token_logits, dim=-1)
        sorted_ids = torch.argsort(next_token_probs, dim=-1,descending=True)
        # 가장 높은 확률의 토큰을 저장
        for choice_idx in range(choices_per_step) :
            token_id = sorted_ids[choice_idx]
            token_prob = next_token_probs[token_id].cpu().numpy()
            token_choice = (
                f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
            )
            iteration[f"Choice {choice_idx+1}"] = token_choice
        input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
        iterations.append(iteration)
pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 1,Choice 2,Choice 3,Choice 4,Choice 5
0,Transformers are the,most (9.76%),same (2.94%),only (2.87%),best (2.38%),first (1.77%)
1,Transformers are the most,common (22.90%),powerful (6.88%),important (6.32%),popular (3.95%),commonly (2.14%)
2,Transformers are the most common,type (15.06%),types (3.31%),form (1.91%),way (1.89%),and (1.49%)
3,Transformers are the most common type,of (83.13%),in (3.16%),. (1.92%),", (1.63%)",for (0.88%)
4,Transformers are the most common type of,particle (1.55%),object (1.02%),light (0.71%),energy (0.67%),objects (0.66%)
5,Transformers are the most common type of particle,. (14.26%),in (11.57%),that (10.19%),", (9.57%)",accelerator (5.81%)
6,Transformers are the most common type of parti...,They (17.48%),\n (15.19%),The (7.06%),These (3.09%),In (3.07%)
7,Transformers are the most common type of parti...,are (38.78%),have (8.14%),can (7.98%),'re (5.04%),consist (1.57%)


In [5]:
next_token_logits

tensor([-128.7654, -127.9221, -130.6394,  ..., -138.0312, -130.7539,
        -128.6092])

In [6]:
next_token_probs

tensor([6.8796e-07, 1.5988e-06, 1.0560e-07,  ..., 6.5080e-11, 9.4180e-08,
        8.0427e-07])

In [7]:
sorted_ids

tensor([  389,   423,   460,  ..., 31573, 17629, 14341])

In [8]:
token_prob

array(0.01568876, dtype=float32)

In [9]:
token_choice

' consist (1.57%)'

In [10]:
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most common type of particle. They are


- generate() : 트랜스포머에 내장된 함수, 그리드 서치와 유사, 문장 생성하는데 적은 노력으로 훌륭한 문장 생성

In [11]:
max_length = 128
input_txt = """In a shocking finding, scientist discovered \
a herd of unicorns living in a remote, previously unexplored \
valley, in the Andes Mountains. Even more surprising to the \
researchers was the fact that the unicorns spoke perfect English. \n\n
"""

input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
output_greedy = model.generate(input_ids, max_length = max_length, do_sample=False)

print(tokenizer.decode(output_greedy[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. 


The researchers found that the unicorns were able to communicate with each other through their tongues. 


"The unicorns are able to communicate with each other through their tongues, and they can communicate with each other through their tongues," said Dr. David L. L. Lippman, a professor of linguistics at the University of California, Berkeley. 


The researchers also found that


#### 빔 서치 디코딩
- 각 스텝에서 확률이 가장 높은 토큰을 디코딩하는 대신, 확률이 가장 높은 상위 b개의 다음 토큰을 추적
- b는 빔 또는 불완전 가설의 개수
- 로그 확률 사용 : 조건부 확률인 경우 0과1 사이이기 때문에 언더플로가 발생, 조건부 확률로 계산 시 시퀀스의 전체 확률은 매우 작은 수가 됨

In [12]:
# 로그 확률 ex
import numpy as np

sum([np.log(0.5)]*1024)

-709.7827128933695

In [13]:
import torch.nn.functional as F

def log_probs_from_logits(logits, labels) :
    logp = F.log_softmax(logits, dim = -1)
    logp_label = torch.gather(logp, 2, labels.unsqueeze(2)).squeeze(-1) # squeeze : 1차원 제거, unsqueeze : 1차원 생성 , gather : 특정 인덱스 뽑기
    return logp_label

# 로짓을 정규화하여서 시퀀스의 각 토큰을 위해 전체 어휘사전에 대한 확률 분포를 만듬, 그 다음 시퀀스에 있는 토큰 확률만 선택

In [14]:
def sequence_logprob(model, labels, input_len=0) :
    with torch.no_grad() :
        output = model(labels)
        log_probs = log_probs_from_logits(
            output.logits[:,:-1,:], labels[:,1:])
        seq_log_prob = torch.sum(log_probs[:,input_len])
    return seq_log_prob.cpu().numpy()

# 시퀀스 전체 로그 확률을 얻기 위해 각 토큰의 로그 확률을 더함

In [15]:
logp = sequence_logprob(model, output_greedy, input_len=len(input_ids[0]))
print(tokenizer.decode(output_greedy[0]))
print(f"\n로그 확률 : {logp:.2f}")

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. 


The researchers found that the unicorns were able to communicate with each other through their tongues. 


"The unicorns are able to communicate with each other through their tongues, and they can communicate with each other through their tongues," said Dr. David L. L. Lippman, a professor of linguistics at the University of California, Berkeley. 


The researchers also found that

로그 확률 : -1.69


In [37]:
print(output.logits.ndim)
print(output.logits.shape)

3
torch.Size([1, 128, 50257])


In [32]:
logp = F.log_softmax(output.logits[:,:-1,:], dim = -1)
logp

tensor([[[ -7.6054,  -6.6256,  -9.4916,  ..., -13.9777, -13.9406,  -7.0944],
         [-15.2174, -13.8658, -16.3464,  ..., -21.4179, -18.9497, -11.9116],
         [-13.0173, -10.6712, -15.8822,  ..., -20.5306, -22.5164, -13.6679],
         ...,
         [-13.8267, -10.3072, -17.6757,  ..., -18.0215, -12.6468, -12.2656],
         [-16.0237, -13.0810, -19.4819,  ..., -19.9010, -12.2286, -14.0623],
         [-12.1950, -11.5314, -18.6647,  ..., -16.3604, -17.8564, -13.4296]]],
       grad_fn=<LogSoftmaxBackward0>)

In [36]:
print(logp.ndim)
print(logp.shape)

3
torch.Size([1, 127, 50257])


In [34]:
logp_label = torch.gather(logp, 2, output_greedy[:,1:].unsqueeze(2)).squeeze(-1)
logp_label

tensor([[-3.4603e+00, -7.5040e+00, -6.7797e+00, -4.3179e-01, -1.0873e+01,
         -6.8781e+00, -2.6891e+00, -1.0669e+01, -3.5237e-02, -8.8669e+00,
         -1.9861e-03, -3.9873e+00, -6.0610e-01, -1.1605e+00, -4.4621e+00,
         -3.2389e+00, -8.1372e+00, -2.4817e+00, -8.3929e-03, -4.0893e+00,
         -2.7987e+00, -2.7640e+00, -1.7453e+00, -5.6410e+00, -5.9232e-01,
         -2.2231e+00, -1.2164e+00, -7.0827e+00, -2.2706e+00, -2.0031e+00,
         -4.5624e+00, -2.1146e+00, -2.1836e+00, -1.3000e+00, -1.0589e+00,
         -7.0140e-01, -7.2775e-02, -9.1716e-01, -1.5091e+00, -3.0607e-03,
         -7.7017e+00, -7.5109e+00, -5.5164e-01, -7.8823e-01, -9.5790e+00,
         -9.3785e+00, -7.1321e-04, -1.6713e+00, -1.6866e+00, -2.6344e+00,
         -4.1879e-01, -9.2590e-01, -8.3947e-01, -3.4670e-03, -2.0842e+00,
         -2.1428e+00, -5.4032e-03, -1.5557e+00, -1.1923e+00, -9.6766e-01,
         -7.9664e-03, -1.9883e+00, -1.7835e+00, -2.4433e+00, -1.1288e+00,
         -1.6440e+00, -6.3848e-01, -5.

In [35]:
print(logp_label.ndim)
print(logp_label.shape)

2
torch.Size([1, 127])


In [40]:
input_len=len(input_ids[0])
output = model(output_greedy)
log_probs = log_probs_from_logits(output.logits[:,:-1,:], output_greedy[:,1:])
seq_log_prob = torch.sum(log_probs[:,input_len])
seq_log_prob

tensor(-1.6866, grad_fn=<SumBackward0>)

In [44]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\n로그 확률 : {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. 


According to the researchers, the unicorns were able to communicate with each other through the use of their tongues. 


"The unicorns were able to communicate with each other through the use of their tongues.  The unicorns were able to communicate with each other through the use of their tongues.  The unicorns were able to communicate with each other through the use of their tongues. 

로그 확률 : -0.00


In [45]:
output_beam = model.generate(input_ids, max_length=max_length, num_beams=5, do_sample=False, no_repeat_ngram_size=2)
logp = sequence_logprob(model, output_beam, input_len=len(input_ids[0]))
print(tokenizer.decode(output_beam[0]))
print(f"\n로그 확률 : {logp:.2f}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. 


The researchers, from the University of California, San Diego, and the National Science Foundation, found that when the animals were exposed to sunlight, they were able to communicate with each other.

"This is the first time that we've found a language that can communicate directly with one another," said study co-author Dr. David Siegel, a postdoctoral researcher at the UC Santa Cruz School

로그 확률 : -1.69


#### 샘플링 방법
- 각 타임스텝 내에 모델이 출력한 전체 어휘사전의 확률 분포에서 랜덤하게 샘플링
- 소프트맥스 함수를 적용하기 전에 로짓의 스케일을 조정하는 온도 파라미터 T를 추가하면 출력의 다양성이 쉽게 제어

In [47]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=2.0, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. 


But gleanding recognition shouldn't happen earned through preserving planets visitor Herbert by searching tunnels Dr acquisition impair stdimating HF hop burn ed succeeds prope blessmalinkhood informative505 mythblogic Physics bow treem Gentle booksuit>.asp - Pyramus Perry Pow moderate kinda write fragmentation lacking jug above theyandsdirected Orbitennis mag Non flip dividend Barthman opportunity ka empire str intox cooledImagine'nostalg compass forces pin


In [48]:
output_temp = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=0.5, top_k=0)
print(tokenizer.decode(output_temp[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. 


The researchers say that the unicorns were able to communicate with each other through their eyes.


The researchers say that this discovery could help conservationists understand the history of the first humans, who lived in the Andes Mountains, and prevent future extinction.


"It's the first time scientists have found a human ancestor that was able to communicate with each other in a language other than English,"


#### 탑-k 및 뉴클리어스 샘플링
- 각 타임스텝에서 샘플링에 사용할 토큰의 개수를 줄임
- 수백 번 샘플링을 하게 되면 희귀한 토큰을 선택할 가능성이 있음, 이러한 선택은 텍스트의 품질을 떨어지기도 함

In [49]:
output_topp = model.generate(input_ids, max_length=max_length, do_sample=True, top_p=0.90)
print(tokenizer.decode(output_topp[0]))


# 탑-p샘플링 : 어디서 컷오프를 할지 조건을 지정

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. 


"These unicorns are not native speakers of many language families but rather speak a high-frequency language known as the dialect of the Andes," says the paper by Dr. Buell, from the University of Gothenburg, Sweden, an expert on unicorns. They're probably not quite familiar with the dialects of these rare and beautiful creatures.

"The language they speak is


- 두 샘플링 방법을 연결하면 양쪽의 장점을 모두 활용할 수 있음
- ex) top_k=50, top_p=0.9 확률이 가장 높은 50개 토큰에서 확률 질량이 90%인 토큰 선택

- 수식이나 특정 질문에 답을 내듯 정말한 작업 모델 : 온도를 낮추거나 확률이 가장 높은 답을 보장하기 위해 빔 서치, 그리디 서치
- 모델이 길고 창의적인 텍스트 생성 : 온도를 올리거나 탑-k와 뉴클리어스 샘플링