- 240621(금) 중앙대학교 군 장병 AISW 역량강화: 고급자연어처리 실습 자료입니다.
- 본 내용은 IIPL (Intelligent Information Processing Lab) 소속 석사과정 김영화 조교가 작성하였습니다.

---
## 03
- DPO (Direct Preference Optimization)
- [ref](https://devocean.sk.com/blog/techBoardDetail.do?ID=165903&boardType=techBlog#none)
---

## Setting
### 라이브러리 설치

In [None]:
!pip3 install -q -U bitsandbytes==0.42.0
!pip install -q -U transformers==4.38.2
!pip3 install -q -U peft==0.9.0
!pip3 install -q -U accelerate==0.27.2
!pip3 install -q -U datasets==2.18.0
!pip3 install -q -U trl==0.7.11

## ChatML (Chat Markup Languague)

###  ChatML Prompt
- OpenAI에서 대화 인터페이스를 효과적으로 관리 할 수 있도록 데이터의 구조를 나타내는 구문
- ChatML prompt 형식
```json
<|im_start|>system
모델의 초기 지침 사항
<|im_end|>
<|im_start|>user
사용자의 메시지
<|im_end|>
<|im_start|>assistant
```
  - 문장의 시작은 <|im_start|> 로 시작, 이후 역할 (System, User, Assistant) 을 명시하고 문장의 끝은 <|im_end|> 구분자 토큰

  1. 시스템 메시지로 모델의 초기 지침 사항을 설명하며, 모델이 사용자의 질문에 어떻게 반응해야 할지에 대한 지침이나 규칙등을 명시.

  2. 사용자 메시지로 모델에게 질문할 내용

  3. 모델이 응답할 차례임을 나타내는 <|im_start|>assistant 토큰

## Gemma for ChatML
- https://huggingface.co/google/gemma-1.1-2b-it
  - Chat template: Chat prompt 형식
  ```json
  <bos><start_of_turn>user
Write a hello world program<end_of_turn>
<start_of_turn>model
```

### Model Load
- https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.6

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

BASE_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"

model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, add_special_tokens=True)

- special tokens 확인

In [None]:
print("Special Tokens:", tokenizer.special_tokens_map)

### Chat 형식으로 대화

In [None]:
question = "Why do I hate summer?"

In [None]:
prompt = f"""<bos><start_of_turn>system
You are a helpful AI assistant.<end_of_turn>
<start_of_turn>user
{question}<end_of_turn>
<start_of_turn>model
"""

In [None]:
# 모델 출력
from transformers import pipeline

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256)
outputs = pipe(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    add_special_tokens=True
)
print(outputs[0]["generated_text"])

### Chat template 사용
- HF tokenizer는 ChaML 형식을 만들 수 있는 템플릿을 제공
  - https://huggingface.co/docs/transformers/main/en/chat_templating

In [None]:
chat = [
    { "role": "user", "content": "Which country's capital is Seoul?" },
    { "role": "assistant", "content": "Seoul is the capital of Korea." },
    { "role": "user", "content": "How many people live in Seoul?" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

print(prompt)

### ChatML function
- multiturn 형식으로 chatbot 처럼 대화할 수 있도록 prompt를 만들어 주는 함수 생성

In [None]:
messages = []

def chat_func(input):
    messages.append({"role": "user", "content": input})
    prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    print("prompt:", prompt)

    inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
    outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=256)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=False)
    decoded_output = decoded_output.replace("<eos>", "").strip()
    parts = decoded_output.split('<start_of_turn>model')
    last_output = parts[-1]
    print(last_output)

    messages.append({"role": "assistant", "content": last_output})

In [None]:
# 확인 코드
chat_func("Why do I like summer?")

In [None]:
chat_func("How can you feel the beauty of nature?")

## DPO

## Dataset Load
- https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1

In [None]:
from datasets import load_dataset

dataset = load_dataset("jondurbin/truthy-dpo-v0.1")

In [None]:
dataset

In [None]:
dataset['train'][200]

### 학습용 프롬프트 조정
- TinyLlama special token을 통해 chatml 형식으로 만듦

In [None]:
def generate_prompt(example):
    prompt = example['prompt']
    rejected = example['rejected']
    chosen = example['chosen']

    example['prompt'] = f"<bos><start_of_turn>system\n <end_of_turn><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
    example['rejected'] = f"{rejected}<end_of_turn>\n<eos>"
    example['chosen'] = f"{chosen}<end_of_turn>\n<eos>"

    return example

In [None]:
transformed_dataset = dataset.map(generate_prompt)

In [None]:
# 확인
transformed_dataset['train'][0]

In [None]:
# data split
dataset = transformed_dataset['train'].train_test_split(test_size=0.05)

In [None]:
dataset

### DPO 학습
-  Colab에서 학습 할 수 있도록 QLoRa를 활용해 모델을 올리고 DPO 학습을 진행

In [None]:
import torch
from transformers import BitsAndBytesConfig
from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

In [None]:
# 모델 load
BASE_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"
model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map="auto", quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'right'

- DPO Trainer 실행

In [None]:
# Training Arguments 설정
from transformers import TrainingArguments

training_args = TrainingArguments(
        output_dir="./outputs",
        evaluation_strategy="steps",
        do_eval=True,
        optim="paged_adamw_32bit",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=1,
        per_device_eval_batch_size=2,
        logging_steps=100,
        learning_rate=5e-7,
        eval_steps=100,
        num_train_epochs=1,
        save_steps=500,
        warmup_ratio=0.1,
        lr_scheduler_type="cosine",
)

In [None]:
from trl import DPOTrainer

trainer = DPOTrainer(
    model,
    ref_model=None,
    args=training_args,
    beta=0.1,
    peft_config=lora_config,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    tokenizer=tokenizer,
    max_prompt_length=512,
    max_length=1024,
)

In [None]:
# 학습: 약 25분
trainer.train()

Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,Validation Loss,Rewards/chosen,Rewards/rejected,Rewards/accuracies,Rewards/margins,Logps/rejected,Logps/chosen,Logits/rejected,Logits/chosen
100,0.6927,0.691564,0.003891,0.000709,0.826923,0.003182,-186.024521,-154.832626,-4.003264,-3.812699
200,0.6901,0.688496,0.011521,0.002155,0.865385,0.009365,-186.010071,-154.756332,-4.003408,-3.812675
300,0.6868,0.685275,0.01882,0.002965,0.865385,0.015855,-186.001953,-154.683334,-4.003627,-3.81269
400,0.6829,0.682241,0.027336,0.005342,0.865385,0.021994,-185.978195,-154.598175,-4.0038,-3.81274
500,0.6811,0.679886,0.032155,0.005354,0.884615,0.026801,-185.978104,-154.549988,-4.003926,-3.812731
600,0.6792,0.678066,0.036169,0.005626,0.846154,0.030543,-185.975357,-154.509857,-4.004046,-3.812742
700,0.6773,0.676949,0.038598,0.005755,0.865385,0.032844,-185.97406,-154.485535,-4.004115,-3.812745
800,0.675,0.676309,0.04036,0.00619,0.865385,0.03417,-185.969727,-154.467941,-4.004153,-3.812748
900,0.6775,0.676144,0.040724,0.006212,0.865385,0.034512,-185.969482,-154.464294,-4.004165,-3.81275


TrainOutput(global_step=965, training_loss=0.6821798848364637, metrics={'train_runtime': 1395.6389, 'train_samples_per_second': 0.691, 'train_steps_per_second': 0.691, 'total_flos': 0.0, 'train_loss': 0.6821798848364637, 'epoch': 1.0})

### 학습된 lora weight 저장

In [None]:
ADAPTER_MODEL = "lora_adapter"

trainer.model.save_pretrained(ADAPTER_MODEL)



- 하나의 fine-tuned model

In [None]:
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map='auto', torch_dtype=torch.float16)
model = PeftModel.from_pretrained(base_model, ADAPTER_MODEL, device_map='auto', torch_dtype=torch.float16)

model = model.merge_and_unload()
model.save_pretrained("./TinyLlama-1.1B-Chat-v0.6_DPO") # Save the merged model

### 추론

- 저장한 모델 로드

In [None]:
BASE_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v0.6"
FINETUNE_MODEL = "./TinyLlama-1.1B-Chat-v0.6_DPO"

base_model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, device_map={"":0})
finetune_model = AutoModelForCausalLM.from_pretrained(FINETUNE_MODEL, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, add_special_tokens=True)

In [None]:
pipe_base = pipeline("text-generation", model=base_model, tokenizer=tokenizer, max_new_tokens=512)

In [None]:
pipe_finetuned = pipeline("text-generation", model=finetune_model, tokenizer=tokenizer, max_new_tokens=512)

In [None]:
prompt = dataset['test'][10]['prompt']

In [None]:
prompt

'<bos><start_of_turn>system\n <end_of_turn><start_of_turn>user\nDoes microwaving food destroy its nutrients?<end_of_turn>\n<start_of_turn>model\n'

- DPO 이전 모델 추론

In [None]:
outputs = pipe_base(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    add_special_tokens=True
)
print(outputs[0]["generated_text"])

- DPO 이후 모델 추론

In [None]:
outputs = pipe_finetuned(
    prompt,
    do_sample=True,
    temperature=0.2,
    top_k=50,
    top_p=0.95,
    repetition_penalty=1.2,
    add_special_tokens=True
)
print(outputs[0]["generated_text"])