## GPT-2
GPT-2는 트랜스포머(Transformer) 기반의 언어 모델로, 대규모 인터넷 텍스트로 학습되었습니다.

입력 텍스트를 토큰화하여 모델에 넣고, 다음에 올 단어를 예측하는 식으로 문장을 생성합니다.

주요 구성요소:

Tokenizer(토크나이저): 텍스트를 토큰(숫자)으로 변환

Model(모델): 토큰 순서를 입력받아 다음 토큰 확률 분포를 출력

Decoder(디코더): 토큰을 다시 텍스트로 변환

In [1]:
# 최신 버전의 Transformers와 Torch 설치
# !pip install -q torch torchvision torchaudio transformers

# (선택) GPU 환경 확인
import torch
print("CUDA 사용 가능:", torch.cuda.is_available())

CUDA 사용 가능: True


In [2]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# 모델과 토크나이저 로딩 (124M: 'gpt2', 355M: 'gpt2-medium', 774M: 'gpt2-large', 1.5B: 'gpt2-xl')
model_name = "gpt2"  # 기본 소형 모델, 다른 모델로 변경 가능

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval()  # 평가 모드

if torch.cuda.is_available():
    model = model.to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [3]:
def generate_text(prompt, max_length=100, temperature=1.0, top_k=50, top_p=0.95):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    input_ids = input_ids.to(model.device)

    # 텍스트 생성
    output = model.generate(
        input_ids,
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_text

# 사용 예시
prompt = "What is AI?"
print(generate_text(prompt))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


What is AI?

It's a field that is not well known in the medical profession or public discourse, and the question that is asked by many is whether it is good for our mental health. A recent paper published in the Journal of Research in Cognitive Neuroscience found that only 6% of scientists who are trained in AI do not believe in the existence of AI, compared to 10% of researchers who are trained in the medical field. The study also found that it may be a good idea to


generate_text(): 프롬프트(입력문장)를 받아 모델이 다음 문장을 생성

temperature: 샘플링의 다양성(높을수록 창의적)

top_k, top_p: 확률이 높은 단어만 선택해 더 자연스러운 결과 생성

토크나이저는 텍스트를 숫자 토큰으로, 모델은 토큰을 입력받아 확률 분포를 계산해 다음 토큰을 예측

In [4]:
# 템플릿을 바꿔가며 다양한 문장 생성 실험
prompts = [
    "Today's weather",
    "Future of AI",
    "Most important accident in Korea"
]

for p in prompts:
    print(f"Prompt: {p}")
    print(generate_text(p, max_length=50, temperature=0.7))
    print("-" * 40)

Prompt: Today's weather
Today's weather in the East Coast is quite good, but the Northeast is looking to see much more cold this week.

On Saturday, temperatures in the Midwest were in the mid-70s for the first time since August 2011.


----------------------------------------
Prompt: Future of AI
Future of AI: I'm not sure what's going to happen in the near future, but it's pretty likely that we'll be able to figure out what happens in the near future with the kind of AI we've got.

The next
----------------------------------------
Prompt: Most important accident in Korea
Most important accident in Korea is the failure of the air traffic control system in Busan. This is why the air traffic control system in Busan was so important. The air traffic control system in Busan is important.

The government of South
----------------------------------------


In [5]:
# 토크나이저 사용법
text = "GPT-2 Model is Powerful."
tokens = tokenizer.encode(text)
print("토큰:", tokens)
print("디코딩:", tokenizer.decode(tokens))

# 모델 구조 요약
print(model)

토큰: [38, 11571, 12, 17, 9104, 318, 46308, 13]
디코딩: GPT-2 Model is Powerful.
GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_feature

In [7]:
user_prompt = input("프롬프트를 입력하세요: ")
print(generate_text(user_prompt, max_length=80, temperature=0.9))

프롬프트를 입력하세요: Who are you?
Who are you? Your father's wife?"

"Ah, yes, you are my father," said my father. "A very small thing. He was a very good boy. He loved the sea. I met him at sea. I saw him in a boat, and I said, 'Oh, great boy, you are wonderful.' "

My father said, "No,


GPT-2의 핵심은 입력 텍스트를 토큰화하여 모델에 넣고, 다음에 올 단어를 예측하는 것

최신 HuggingFace Transformers 라이브러리를 사용하면 Colab에서 쉽고 빠르게 실습 가능
