# Autoregressive (자동회귀) 문장 생성

간단한 자동 회귀적인 텍스트 생성을 위한 코드 예제입니다. 여기서는 OpenAI의 GPT-2 모델을 사용하는데, GPT-3은 API를 통해서만 직접 사용할 수 있기 때문입니다. 그럼에도 GPT-2는 GPT-3와 매우 유사한 구조를 가지고 있으므로, 동일한 접근 방식을 사용합니다.

In [1]:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# 모델 및 토크나이저 초기화
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")
model = GPT2LMHeadModel.from_pretrained("gpt2-medium")

# 문장 시작 부분
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
input_ids

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tensor([[7454, 2402,  257,  640]])

In [2]:
# 문장 생성
output = model.generate(input_ids, max_length=50, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
output

tensor([[ 7454,  2402,   257,   640,    11,   612,   373,   257,   582,   508,
          5615,   287,   257,  7404,  1444,   509, 17716,   322,    13,   679,
           373,   257,   845,   922,   582,    11,   290,   339,   373,   845,
          1611,   284,   465,  1751,    13,  1881,  1110,    11,   339,   373,
          6155,  1863,   262,  2975,    11,   290,   339,  2497,   257,  2415]])

In [3]:
# 생성된 문장 출력
for i, generated_text in enumerate(output):
    decoded_text = tokenizer.decode(generated_text, skip_special_tokens=True)
    print(f"Generated text {i + 1}: {decoded_text}")

Generated text 1: Once upon a time, there was a man who lived in a village called Krakow. He was a very good man, and he was very kind to his children. One day, he was walking along the road, and he saw a woman


GPT-2는 자체적으로 autoregressive 모델입니다. "Autoregressive"란, 이전에 생성된 토큰들을 기반으로 다음 토큰을 생성하는 모델을 의미합니다.

위의 코드에서 `model.generate` 메서드는 이미 autoregressive한 방식으로 문장을 생성합니다. 그러나 이를 명시적으로 보여주기 위해 각 단계에서 토큰을 하나씩 생성하는 autoregressive한 코드를 아래에 작성하겠습니다:

In [4]:
# 문장 시작 부분
input_text = "Once upon a time"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
input_ids

tensor([[7454, 2402,  257,  640]])

In [5]:
predictions = model(input_ids)
logits = predictions.logits
logits

tensor([[[ -74.6888,  -74.2221,  -80.2355,  ...,  -83.7432,  -80.4568,
           -76.2927],
         [-225.0849, -225.3536, -229.2500,  ..., -236.9879, -237.3180,
          -224.3201],
         [  -6.1223,   -4.9655,  -10.4817,  ...,  -13.5870,  -12.2556,
            -5.6444],
         [  -9.2671,   -7.4639,  -14.5420,  ...,  -16.8138,  -16.4736,
            -9.4272]]], grad_fn=<UnsafeViewBackward0>)

In [6]:
predicted_token = torch.argmax(logits[0, -1]).item()

print(f"예측한 next token : {predicted_token}")
token = tokenizer.decode(predicted_token, skip_special_tokens=True)
print(f"decoding한 next token : {token}")

예측한 next token : 11
decoding한 next token : ,


In [7]:
# Autoregressive한 방식으로 문장 생성
max_length = 50
input_ids_concat = input_ids

while input_ids_concat.shape[1] < max_length:
    # 다음 토큰 예측
    predictions = model(input_ids_concat)
    logits = predictions.logits
    predicted_token = torch.argmax(logits[0, -1]).item()
    #print(predicted_token)

    # 생성된 토큰을 입력 토큰 뒤에 추가
    input_ids_concat = torch.cat([input_ids_concat, torch.tensor([[predicted_token]])], dim=1)
    print(input_ids_concat)

tensor([[7454, 2402,  257,  640,   11]])
tensor([[7454, 2402,  257,  640,   11,  612]])
tensor([[7454, 2402,  257,  640,   11,  612,  373]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257,  582]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257,  582,  508]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257,  582,  508, 5615]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257,  582,  508, 5615,  287]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257,  582,  508, 5615,  287,
          257]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257,  582,  508, 5615,  287,
          257, 7404]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257,  582,  508, 5615,  287,
          257, 7404, 1444]])
tensor([[7454, 2402,  257,  640,   11,  612,  373,  257,  582,  508, 5615,  287,
          257, 7404, 1444,  509]])
tensor([[ 7454,  2402,   257,   640,    11,   612,   373,   257,   582,   50

In [8]:
# 생성된 문장 출력
decoded_text = tokenizer.decode(input_ids_concat[0], skip_special_tokens=True)
print(decoded_text)

Once upon a time, there was a man who lived in a village called Krakow. He was a very good man, and he was very kind to his children. One day, he was walking along the road, and he saw a woman
