- T5모델 (Text-to-Text Transfer Transformer)
    - 입력텍스트, 태스크정의 를 입력하면 태스크에 맞게 입력텍스트로부터 동작을 수행
    - 문제 생성, 오답선지 생성, 정답찾기등 다양한 태스크를 하나의 모델로 구축 가능

입력 데이터 형식(텍스트 데이터 처리 결과):

 JSON형식으로 텍스트, 문장, 키워드등을 포함한 데이터 형식으로 입력받기

출력 데이터 형식: 

출제된 문제와 자료토대로한 정답

### **사전 학습된 모델 활용 + Few-shot Learning 전략**

T5모델을 사용 (다양한 태스크를 하나의 모델로 통일)

```json
Task: [수행할 작업] Input: [처리할 텍스트]
	[작업]
	"generate question:" → 주어진 텍스트에서 질문 생성.
	"summarize:" → 텍스트 요약.
	"translate English to French:" → 영어 텍스트를 프랑스어로 번역.
	"extract answer:" → 지문에서 질문에 대한 정답 추출.

```

→이미 사전학습된 모델임 / 토큰화등 데이터 전처리 필요X (일반적인 전처리는 T5모델 내부에서 실행 특수문자 제거, 슬라이싱, 도메인특화 등등 특수한 전처리만 실행)

초기 모델링시 몇개의 PDf파일로 FineTuning

```json
{
  "input": "generate question: The Eiffel Tower was completed in 1889.",
  "output": "When was the Eiffel Tower completed?"
}
```

# 1. 환경 설정

가상환경: cd C:\Users\j2982\gen_question_env      gen_question_env\Scripts\activate



In [1]:
#pip install transformers datasets

In [2]:
#pip install transformers torch

In [1]:
import torch
from transformers import T5TokenizerFast, T5ForConditionalGeneration

  from .autonotebook import tqdm as notebook_tqdm


# 2. 모델, 토크나이저 준비

In [2]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [3]:
tokenizer = T5TokenizerFast.from_pretrained("t5-base")
model = T5ForConditionalGeneration.from_pretrained("t5-base")
model.to(device)

T5ForConditionalGeneration(
  (shared): Embedding(32128, 768)
  (encoder): T5Stack(
    (embed_tokens): Embedding(32128, 768)
    (block): ModuleList(
      (0): T5Block(
        (layer): ModuleList(
          (0): T5LayerSelfAttention(
            (SelfAttention): T5Attention(
              (q): Linear(in_features=768, out_features=768, bias=False)
              (k): Linear(in_features=768, out_features=768, bias=False)
              (v): Linear(in_features=768, out_features=768, bias=False)
              (o): Linear(in_features=768, out_features=768, bias=False)
              (relative_attention_bias): Embedding(32, 12)
            )
            (layer_norm): T5LayerNorm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (1): T5LayerFF(
            (DenseReluDense): T5DenseActDense(
              (wi): Linear(in_features=768, out_features=3072, bias=False)
              (wo): Linear(in_features=3072, out_features=768, bias=False)
              (dropout): Dro

# 3. 데이터준비(Fiine Tuning)

In [6]:
#pip install sentencepiece

In [7]:
#pip install --upgrade accelerate

In [8]:
#!pip install accelerate==0.30.0

In [5]:
import accelerate
print(accelerate.__version__)

0.27.2


In [5]:
#pip install --upgrade transformers accelerate


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import transformers
import accelerate

print("Transformers Version:", transformers.__version__)
print("Accelerate Version:", accelerate.__version__)


Transformers Version: 4.48.0
Accelerate Version: 0.27.2


In [6]:
train_data = [
    {
        "input": "generate question: The Eiffel Tower was completed in 1889.",
        "output": "When was the Eiffel Tower completed?"
    },
    {
        "input": "generate question: The Great Wall of China is over 13,000 miles long.",
        "output": "How long is the Great Wall of China?"
    },
    {
        "input": "generate distractors: 1889 in the context of 'The Eiffel Tower was completed in 1889.'",
        "output": "1789, 1905, 1923"
    }
]


In [7]:
from datasets import Dataset

# 데이터셋 변환
dataset = Dataset.from_dict({
    "input": [item["input"] for item in train_data],
    "output": [item["output"] for item in train_data]
})

# 데이터셋 분리
train_test_split = dataset.train_test_split(test_size=0.2)
train_dataset = train_test_split["train"]
valid_dataset = train_test_split["test"]


In [8]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# 모델 및 토크나이저 로드
model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [9]:
# 토큰화 함수
def tokenize_function(batch):
    inputs = tokenizer(batch["input"], max_length=512, truncation=True, padding="max_length")
    outputs = tokenizer(batch["output"], max_length=128, truncation=True, padding="max_length")
    inputs["labels"] = outputs["input_ids"]
    return inputs

# 데이터셋 토큰화
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_valid = valid_dataset.map(tokenize_function, batched=True)


Map: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 32.79 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 33.33 examples/s]


In [10]:
from transformers import TrainingArguments, Trainer

# 학습 파라미터 설정
training_args = TrainingArguments(
    output_dir="./t5_finetuned",
    evaluation_strategy="steps",  # 평가를 steps 단위로 설정
    save_strategy="steps",        # 모델 저장도 steps 단위로 설정
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    save_steps=500,               # 저장 간격
    eval_steps=500,               # 평가 간격
    save_total_limit=2,
    logging_dir="./logs",
    load_best_model_at_end=True,  # 가장 좋은 모델 로드
    metric_for_best_model="eval_loss",
)

# Trainer 객체 생성
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    tokenizer=tokenizer,
)


  trainer = Trainer(


In [11]:
# Fine-tuning 실행
trainer.train()


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Step,Training Loss,Validation Loss


TrainOutput(global_step=3, training_loss=15.758079528808594, metrics={'train_runtime': 213.9412, 'train_samples_per_second': 0.028, 'train_steps_per_second': 0.014, 'total_flos': 3653747343360.0, 'train_loss': 15.758079528808594, 'epoch': 3.0})

In [12]:
model.save_pretrained("./t5_finetuned_model")
tokenizer.save_pretrained("./t5_finetuned_model")


('./t5_finetuned_model\\tokenizer_config.json',
 './t5_finetuned_model\\special_tokens_map.json',
 './t5_finetuned_model\\spiece.model',
 './t5_finetuned_model\\added_tokens.json')

In [13]:
# 테스트 입력
test_input = "generate question: The Eiffel Tower was completed in 1889."
input_ids = tokenizer(test_input, return_tensors="pt").input_ids

# 모델 추론
outputs = model.generate(input_ids, max_length=50, num_beams=4, early_stopping=True)
print("Generated Question:", tokenizer.decode(outputs[0], skip_special_tokens=True))


Generated Question: True
