# Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT to Generate Less Toxic Summaries

본 실습에서는 Meta AI의 혐오 발언 보상 모델을 사용하여 FLAN-T5 모델을 미세 조정하여 유해성을 줄입니다. 이 보상 모델은 주어진 텍스트에 대해 *혐오* 또는 *혐오 아님*으로 예측하는 이진 분류기입니다. 근접 정책 최적화(PPO)를 사용하여 모델의 유해성을 미세 조정하고 줄입니다.

#### _Code Cell 1_ ####

In [1]:
%%capture
%pip install torch==2.4.1.post100
%pip install transformers==4.47.0
%pip install datasets==3.2.0
%pip install accelerate==1.2.0
%pip install evaluate==0.4.3
%pip install trl==0.8.0
%pip install rouge_score==0.1.2
%pip install loralib==0.1.2
%pip install peft==0.14.0
%pip install -q awswrangler

#### _Code Cell 2_ ####

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate
import numpy as np
import pandas as pd
import peft

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

2026-01-12 04:56:45.979087: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a name='2'></a>
## 2 – Load the FLAN-T5 model, and prepare the reward model and toxicity evaluator
<a name='2.1'></a>
### 2.1 – Load the dataset and the FLAN-T5-BASE model. The model has been trained and fine-tuned to follow prompted instructions.
#### _Code Cell 3_ ####
Hugging Face의 DialogSum 데이터셋을 로드

In [3]:
from datasets import load_dataset

model_name="google/flan-t5-base"
huggingface_dataset_name = "knkarthick/dialogsum"

dataset_original = load_dataset(huggingface_dataset_name)

dataset_original

README.md: 0.00B [00:00, ?B/s]

train.csv:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/12460 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/500 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1500 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
})

다음 단계는 데이터셋을 전처리하는 것입니다. 데이터셋의 일부만 가져와 특정 길이의 대화만 필터링합니다(예시를 충분히 길면서도 읽기 쉽게 만들기 위해). 그런 다음 각 대화를 명령어로 감싸고 프롬프트를 토큰화합니다. 토큰 ID는 `*input_ids*` 필드에, 디코딩된 프롬프트는 `*query*` 필드에 저장합니다.

이 모든 과정을 아래 셀에 단계별로 작성할 수도 있지만, 모든 작업을 `*build_dataset*` 함수로 구성하는 것이 좋습니다.

#### _코드 셀 4_ ####

```text
Hugging Face Dataset 로드
 → 대화 길이 필터링
 → Instruction Prompt로 감싸기
 → Tokenize (input_ids, query 생성)
 → PyTorch 텐서 변환
 → Train / Test Split
 ```

In [4]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig

def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length, 
                  input_max_text_length):

    # Load dataset (the "train" part only is enough for this lab).
    dataset = load_dataset(dataset_name, split="train")
    
    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    # 대화 길이 기준 필터링
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare the tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto",force_download=True)
    
    def tokenize(sample):
        
        # Wrap each dialogue with the instruction.
        prompt = f"""
Summarize the following conversation.

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)
        
        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")
    
    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200, 
                        input_max_text_length=1000)

print(dataset)

Filter:   0%|          | 0/12460 [00:00<?, ? examples/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

Map:   0%|          | 0/10022 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


To save time in this lab, we fine-tuned the model by using PEFT with summarization instructions. The training in the notebook was done on a subset of data. We downloaded this model and saved it to the model-checkpoint-files.zip archive. Next, unzip that file and use it as a checkpoint.

#### _Code Cell 5_ ####

In [5]:
!unzip -o model-checkpoint-files.zip

Archive:  model-checkpoint-files.zip
  inflating: peft-dialogue-summary-checkpoint-from-s3/adapter_config.json  
  inflating: peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin  
  inflating: peft-dialogue-summary-checkpoint-from-s3/special_tokens_map.json  
  inflating: peft-dialogue-summary-checkpoint-from-s3/tokenizer_config.json  
  inflating: peft-dialogue-summary-checkpoint-from-s3/tokenizer.json  


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


List the model item and check its size. (It's less than 15 Mb.)
#### _Code Cell 6_ ####

In [6]:
!ls -alh ./peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin

-rw-r--r--. 1 sagemaker-user users 14M Jun 15  2023 ./peft-dialogue-summary-checkpoint-from-s3/adapter_model.bin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Prepare a function to pull out the number of model parameters.
#### _Code Cell 7_ ####

In [7]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

Add the adapter to the original FLAN-T5 model. Now, you must pass both to the constructed PEFT model, also putting `is_trainable=True`.

#### _Code Cell 8_ ####

대형 Seq2Seq 모델(FLAN-T5)을 전체 파인튜닝하지 않고,
LoRA 어댑터만 학습 가능 상태로 붙여서 효율적으로 미세조정하는 코드

LoRA란 : Low-Rank Adaptation

|파라미터|의미|
|---|---|
|r=32 | LoRA rank (저랭크 차원) |
|lora_alpha=32 | LoRA scaling factor (출력 스케일) |
|target_modules=["q","v"] | Attention의 Query / Value projection에만 LoRA 적용 |
|lora_dropout=0.05 | LoRA 경로에 dropout |
|bias="none" | 기존 bias 파라미터는 학습 안 함 |
|task_type=SEQ_2_SEQ_LM | Encoder-Decoder 모델 (T5 계열) |

In [8]:
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name, 
                                              torch_dtype=torch.bfloat16)

peft_model = PeftModel.from_pretrained(model, 
                                       './peft-dialogue-summary-checkpoint-from-s3/', 
                                       lora_config=lora_config,
                                       torch_dtype=torch.bfloat16, 
                                       device_map="auto",                                       
                                       is_trainable=True)

print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

PEFT model parameters to be updated:

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%



이 실습에서는 강화 학습(RL)을 사용하여 대규모 언어 모델(LLM)을 미세 조정하는 준비를 합니다. RL은 실습의 다음 섹션에서 간략하게 다루지만, 이 단계에서는 명령으로 미세 조정된 PEFT 모델을 입력으로 사용하여 근접 정책 최적화(PPO) 모델만 준비하면 됩니다. PPO는 보상 모델에 대해 RL 정책을 최적화하는 데 사용됩니다.

#### _코드 셀 9_ ####

In [9]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,                                                               
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


During PPO, only a few parameters are updated; specifically, the parameters of the `ValueHead.` For more information about this class of models, see the [Hugging Face documentation](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model). The number of trainable parameters can be computed as $(n+1)*m$, where $n$ is the number of input units (here $n=768$) and $m$ is the number of output units (you have $m=1$). The $+1$ term in the equation takes into account the bias term.

---

Now, create a frozen copy of the PPO, which will be a reference model and not fine-tuned. The reference model represents the LLM before detoxification. None of the parameters of the reference model are updated during PPO training. This is on purpose.

#### _Code Cell 10_ ####

**PPO 기반 RLHF 학습에서 사용되는 “Reference Model(참조 모델)”을 생성하고,
그 모델이 실제로 학습되지 않는다는 것(= 파라미터가 업데이트되지 않음)**을 확인하는 단계

PPO 학습 중, 정책 모델이 “너무 멀리 벗어나지 않도록” 기준점 역할을 하는
고정된(Freeze) 참조 모델을 만드는 코드

In [10]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%



Everything is set! You are ready to prepare the reward model.

<a name='2.2'></a>
### 2.2 – 보상 모델 준비

*강화 학습(RL)*은 에이전트가 누적 보상을 최대화하는 것을 목표로 하는 환경에서 행동을 취하는 머신 러닝 유형 중 하나입니다. 에이전트의 행동은 *정책*으로 정의됩니다. RL의 목표는 에이전트가 *보상 함수*를 최대화하는 최적 또는 거의 최적의 정책을 학습하는 것입니다.

[이전 섹션](#2.1)에서 초기 정책은 PEFT 모델을 기반으로 합니다. 이는 해독 전의 LLM입니다. 그런 다음 사람 라벨러에게 출력의 독성에 대한 피드백을 요청할 수 있지만, 사람 라벨러는 전체 미세 조정 과정에 많은 비용이 들 수 있습니다. 이러한 비용을 피하는 실용적인 방법은 에이전트가 대화 요약의 독성을 제거하도록 유도하는 보상 모델을 사용하는 것입니다. 직관적인 접근 방식은 '혐오하지 않음(nothate)'과 '혐오(hate)'라는 두 가지 범주에 대한 감정 분석을 수행하고, '혐오하지 않음'이 출력될 확률이 높을수록 더 높은 보상을 제공하는 것입니다.

이 실습에서는 보상 모델로 [Meta AI의 RoBERTa 기반 혐오 발언 모델](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)을 사용합니다. 이 모델은 로짓 값을 출력하고, 두 범주('혐오하지 않음'과 '혐오')에 대한 확률을 예측합니다. 출력값이 '혐오하지 않음'일 경우의 로짓 값은 긍정적인 보상으로 간주됩니다. 그런 다음, 이러한 보상 값을 사용하여 PPO(Preferred Probability of Ethics)를 통해 모델을 미세 조정합니다.

RoBERTa 모델에 필요한 모델 클래스의 인스턴스를 생성하세요. 모델 테스트를 위해 토크나이저도 로드해야 합니다. 모델 레이블 *0*은 '혐오하지 않음' 클래스에, 레이블 *1*은 '혐오' 클래스에 해당합니다.

#### _코드 셀 11_ ####

In [11]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/816 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

{0: 'nothate', 1: 'hate'}


Take some non-toxic text, tokenize it, and pass it to the model. Print the output logits, probabilities, and the corresponding reward that will be used for fine-tuning.
#### _Code Cell 12_ ####

In [12]:
non_toxic_text = "You are a great person and i like you."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (value of "not hate" logit): {nothate_reward}')

logits [not hate, hate]: [4.6532111167907715, -4.178226947784424]
probabilities [not hate, hate]: [0.9998539686203003, 0.0001460467028664425]
reward (value of "not hate" logit): [4.6532111167907715]


The following comment will have a low reward because it is more toxic.

#### _Code Cell 13_ ####

In [13]:
toxic_text = "You are a terrible person and i hate you."

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(toxicity_input_ids).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist() 
print(f'reward (value of "not hate" logit): {nothate_reward}')

logits [not hate, hate]: [-2.064443349838257, 1.665043830871582]
probabilities [not hate, hate]: [0.023442404344677925, 0.9765575528144836]
reward (value of "not hate" logit): [-2.064443349838257]


Set up the Hugging Face inference pipeline to streamline the code for the toxicity reward model.

#### _Code Cell 14_ ####

In [14]:
device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline("sentiment-analysis", 
                          model=toxicity_model_name, 
                          device=device)
reward_logits_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output for non-toxic text:")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("\nReward model output for toxic text:")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Device set to use cpu


Reward model output for non-toxic text:
[{'label': 'nothate', 'score': 4.6532111167907715}, {'label': 'hate', 'score': -4.178226947784424}]
[{'label': 'nothate', 'score': 0.9998539686203003}, {'label': 'hate', 'score': 0.0001460467028664425}]

Reward model output for toxic text:
[{'label': 'hate', 'score': 1.665043830871582}, {'label': 'nothate', 'score': -2.064443349838257}]
[{'label': 'hate', 'score': 0.9765575528144836}, {'label': 'nothate', 'score': 0.023442404344677925}]


The outputs are the logits for both *nothate* (positive) and *hate* (negative) classes. But PPO uses logits only from the *nothate* class as the positive reward signal to help detoxify the LLM outputs.

#### _Code Cell 15_ ####

In [15]:
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))

[{'label': 'nothate', 'score': 4.6532111167907715}, {'label': 'hate', 'score': -4.178226947784424}]
[{'label': 'nothate', 'score': 0.9998539686203003}, {'label': 'hate', 'score': 0.0001460467028664425}]


#### _Code Cell 16_ ####

In [16]:
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

[{'label': 'hate', 'score': 1.665043830871582}, {'label': 'nothate', 'score': -2.064443349838257}]
[{'label': 'hate', 'score': 0.9765575528144836}, {'label': 'nothate', 'score': 0.023442404344677925}]


<a name='2.3'></a>
### 2.3 – Evaluate toxicity

To evaluate the model before and after fine-tuning (detoxification) you must set up the [toxicity evaluation metric](https://huggingface.co/spaces/evaluate-measurement/toxicity). The *toxicity score* is a decimal value between 0 and 1 where 1 is the highest toxicity.

#### _Code Cell 17_ ####

In [17]:
toxicity_evaluator = evaluate.load("toxicity", 
                                    toxicity_model_name,
                                    module_type="measurement",
                                    toxic_label="hate")

Downloading builder script: 0.00B [00:00, ?B/s]

Device set to use cpu


Using the same sentences as in section [2.2](#2.2), try to calculate their toxicity. It's no surprise that the toxicity scores are the probabilities of the *hate* class returned directly from the reward model.

#### _Code Cell 18_ ####

In [18]:
toxicity_score = toxicity_evaluator.compute(predictions=[
    non_toxic_text
])

print("Toxicity score for non-toxic text:")
print(toxicity_score["toxicity"])

toxicity_score = toxicity_evaluator.compute(predictions=[
    toxic_text
])

print("\nToxicity score for toxic text:")
print(toxicity_score["toxicity"])

Toxicity score for non-toxic text:
[0.0001460467028664425]

Toxicity score for toxic text:
[0.9765575528144836]


This evaluator can be used to compute the toxicity of the dialogues prepared in section [2.1](#2.1). You must pass the test dataset, `dataset["test"]` (the same tokenizer that was used in that section), the frozen PEFT model prepared in section [2.2](#2.2), and the toxicity evaluator. For convenience, you can wrap the required steps in the function, `evaluate_toxicity`. 
#### _Code Cell 19_ ####

In [19]:
def evaluate_toxicity(model, 
                      toxicity_evaluator, 
                      tokenizer, 
                      dataset, 
                      num_samples):

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break
            
        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids
        
        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             tok_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids,
                                            generation_config=generation_config)
        
        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)
        
        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])

        toxicities.extend(toxicity_score["toxicity"])

    # Compute mean and std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)
        
    return mean, std

Now, perform the calculation of the model toxicity before fine-tuning (detoxification).

#### _Code Cell 20_ ####

In [20]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=ref_model, 
                                                                          toxicity_evaluator=toxicity_evaluator, 
                                                                          tokenizer=tokenizer, 
                                                                          dataset=dataset["test"], 
                                                                          num_samples=10)

print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

11it [00:18,  1.68s/it]

toxicity [mean, std] before detox: [0.03055911746130071, 0.0324480148665644]





<a name='3'></a>
## 3 – Perform fine-tuning to detoxify the summaries
Optimize an RL policy against the reward model by using Proximal Policy Optimization (PPO).

<a name='3.1'></a>
### 3.1 – Initialize the PPOTrainer
 
Set up the configuration parameters. Load the `ppo_model` and the tokenizer. You will also load a frozen version of the model, `ref_model`. The first model is optimized, and the second model serves as a reference to calculate the KL-divergence from the starting point. This works as an additional reward signal in the PPO training to ensure that the optimized model does not deviate too much from the original LLM.

#### _Code Cell 21_ ####

In [21]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)


def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

ppo_trainer = PPOTrainer(config=config, 
                         model=ppo_model, 
                         ref_model=ref_model, 
                         tokenizer=tokenizer, 
                         dataset=dataset["train"], 
                         data_collator=collator)

<a name='3.2'></a>
### 3.2 – Fine-tune the model

The fine-tuning loop consists of the following main steps:
1. Get the query responses from the policy LLM (PEFT model).
2. Get sentiments for query/responses from the hate speech RoBERTa model.
3. Optimize the policy with PPO by using the (query, response, reward) triplet.

The operation is running if the following metrics appear:
* `objective/kl`: minimize kl divergence
* `ppo/returns/mean`: maximize mean returns
* `ppo/policy/advantages_mean`: maximize advantages

#### _Code Cell 22 (This code cell can take up to 10 minutes to be completed.)_ ####

In [22]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break   

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()        
            
        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])
        
    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]    
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the "nothate" item because this is the score for the positive "nothate" class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]    

    # Run the PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)
    
    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

0it [00:00, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
1it [01:06, 66.02s/it]

objective/kl: 35.69118881225586
ppo/returns/mean: -0.7093673348426819
ppo/policy/advantages_mean: 0.018999043852090836
---------------------------------------------------------------------------------------------------


2it [02:09, 64.76s/it]

objective/kl: 30.212993621826172
ppo/returns/mean: -0.46174338459968567
ppo/policy/advantages_mean: 0.0044395942240953445
---------------------------------------------------------------------------------------------------


3it [03:10, 62.69s/it]

objective/kl: 28.143619537353516
ppo/returns/mean: -0.42218393087387085
ppo/policy/advantages_mean: 0.023562896996736526
---------------------------------------------------------------------------------------------------


4it [04:04, 59.54s/it]

objective/kl: 23.815454483032227
ppo/returns/mean: -0.2045554220676422
ppo/policy/advantages_mean: 0.019497783854603767
---------------------------------------------------------------------------------------------------


5it [05:01, 58.55s/it]

objective/kl: 23.983890533447266
ppo/returns/mean: -0.018388643860816956
ppo/policy/advantages_mean: 0.00619332492351532
---------------------------------------------------------------------------------------------------


6it [06:05, 60.50s/it]

objective/kl: 24.368545532226562
ppo/returns/mean: -0.09612344950437546
ppo/policy/advantages_mean: -0.004671149887144566
---------------------------------------------------------------------------------------------------


7it [07:05, 60.21s/it]

objective/kl: 25.428083419799805
ppo/returns/mean: -0.16490083932876587
ppo/policy/advantages_mean: 0.010492168366909027
---------------------------------------------------------------------------------------------------


8it [08:01, 59.01s/it]

objective/kl: 27.015817642211914
ppo/returns/mean: -0.2909148335456848
ppo/policy/advantages_mean: -0.0071167126297950745
---------------------------------------------------------------------------------------------------


9it [09:07, 61.16s/it]

objective/kl: 25.713504791259766
ppo/returns/mean: -0.16972775757312775
ppo/policy/advantages_mean: 0.017213433980941772
---------------------------------------------------------------------------------------------------


10it [10:06, 60.67s/it]

objective/kl: 25.09076499938965
ppo/returns/mean: -0.27405688166618347
ppo/policy/advantages_mean: 0.015224728733301163
---------------------------------------------------------------------------------------------------





<a name='3.3'></a>
### 3.3 – Evaluate the model quantitatively

Load the PPO/PEFT model back in from the disk, and use the test dataset split to evaluate the toxicity score of the RL-fine-tuned model.

#### _Code Cell 23_ ####

In [23]:
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model, 
                                                                        toxicity_evaluator=toxicity_evaluator, 
                                                                        tokenizer=tokenizer, 
                                                                        dataset=dataset["test"], 
                                                                        num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

11it [00:17,  1.56s/it]

toxicity [mean, std] after detox: [0.05145865214184265, 0.08639719215614429]





Compare the toxicity scores of the reference model (before detoxification) and the fine-tuned model (after detoxification).

#### _Code Cell 24_ ####

In [24]:
mean_improvement = (mean_after_detoxification - mean_before_detoxification) / mean_before_detoxification
std_improvement = (std_after_detoxification - std_before_detoxification) / std_before_detoxification


print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Percentage improvement of toxicity score after detoxification:
mean: 68.39%
std: 166.26%


<a name='3.4'></a>
### 3.4 – Evaluate the model qualitatively

Here are some examples to inspect from the test dataset. You can compare the original `ref_model` to the fine-tuned (detoxified) `ppo_model` by using the toxicity evaluator.

#### _Code Cell 25_ ####

In [25]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from the PPO and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len
    
    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device), 
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

100%|██████████| 20/20 [01:03<00:00,  3.15s/it]


Store and review the results in a DataFrame.

#### _Code Cell 26_ ####

In [26]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,"Summarize the following conversation. #Person1#: I'd like to have this cashed, please. #Person2#: Please put you name and address here. May I see your passport? #Person1#: Yes. #Person2#: How would you like it? #Person1#: Ten hundreds and ten twenties, and the rest in small change, please. #Person2#: OK. Here you are. Summary: </s>",<pad> #Person1# desires to have their work cashed and sends their passport to #Person2# as an addition to their paycheck.</s>,"<pad> #Person1# wants to have a cashed from online shop with 500 guest, who cashed at last printing. #Person2# will inve""ng it.</s>",1.416823,2.081782,0.664959
1,"Summarize the following conversation. #Person1#: How much are you asking for this? #Person2#: I'm offering them to you at 150 yuan a piece. Is that all right? #Person1#: Is tax already included in their price? #Person2#: Yes. Our price can't be matched. #Person1#: Would you consider a volume discount? #Person2#: If you buy 1, 000 or more, you'll get a 10 % discount. #Person1#: I'll accept your offer. Summary: </s>",<pad> #Person2#'s offering 100 yuan a piece and lets #Person1# look for a volume discount. #Person1# will accept the offer and accepts the offer.</s>,<pad> #Person1# is making a purchase from #Person2# for 150 yuan with include-pay.</s>,2.454588,3.00526,0.550672
2,"Summarize the following conversation. #Person1#: Could you help me, Sir? My flight got in 15 minutes ago. Everyone else has picked up the luggage but mine hasn't come through. #Person2#: I'm sorry, Madam, I'll go and find out if there is any more to come. Summary: </s>",<pad> #Person1# wants to let #Person2# help who hasn't picked up his luggage yet.</s>,<pad> The flight to #Person1# doesn't make it through as it hasn't happened.</s>,2.284972,2.834608,0.549636
3,"Summarize the following conversation. #Person1#: I'm forming a music band. #Person2#: Do you already know how to play an instrument? #Person1#: Uh. . . Yeah! I'Ve told you a thousand times that I'm learning to play the drums. Now that I know how to play well, I would like to form a rock band. #Person2#: Aside from yourself, who are the other members of the band? #Person1#: We have a guy who plays guitar, and another who plays bass. Although we still haven't found anyone to be our singer. You...",<pad> #Person1# is forming a music band and wants to form a rock band with their other musicians. #Person1# also knows dogs and coaches. #Person2# can audition at #Person1#'s house this weekend but #Person1# does not have enough room.</s>,<pad> people are forming a rock band and they want to know their other players and sometal...</s>,2.730146,2.942207,0.212061
4,"Summarize the following conversation. #Person1#: It smells like an ashtray in here! #Person2#: Hi honey! What's wrong? Why do you have that look on your face? #Person1#: What's wrong? I thought we agreed that you were gonna quit smoking. #Person2#: No! I said I was going to cut down which is very different. You can't just expect me to go cold turkey overnight! #Person1#: Look, there are other ways to quit. You can try the nicotine patch, or nicotine chewing gum. We spend a fortune on cigaret...","<pad> #Person2# doesn't want to quit smoking, though #Person1# advises her to look for the other options. #Person2#'s thinking of quitting only because she doesn't have the willpower to quit smoking.</s>",<pad> #Person2# smells like an ashtray with a look that looks like an ashtray. #Person2# tells #Person1# he's afraid of smoking and wants a divorce.</s>,1.486577,1.66634,0.179763
5,"Summarize the following conversation. #Person1#: So how did you like the restaurant? #Person2#: Actually, it could have been better. #Person1#: What didn't you like about it? #Person2#: It is a new restaurant. I don't think they have their act together yet. #Person1#: What did you think about the food? #Person2#: I felt that the food was pretty mediocre. #Person1#: The service wasn't that great, either. #Person2#: I agree. The service was not good. #Person1#: Do you think that you want to tr...",<pad> #Person2# loves the restaurant. #Person1# asks #Person2# what #Person2# does and #Person2# disagrees about the service and the food. #Person2#'s satisfied with this restaurant.</s>,"<pad> #Person1# looks for him. He says he liked the restaurant that was good and also regrets the food and the host, but he thinks too late.</s>",2.037616,2.203316,0.1657
6,"Summarize the following conversation. #Person1#: Amanda, how do you like this peaked cap? #Person2#: Didn't you say you want to buy a top hat? #Person1#: But I think this one fits me Well. Why don't you try on the sombrero in black? #Person2#: I don't like caps at all. Summary: </s>",<pad> Amanda wants to buy a top hat but she doesn't like the cap because she doesn't like to wear caps.</s>,<pad> Amanda tells Carl that her favorite wear is a peaked cap and doesn't want to try on a top hat.</s>,1.014204,1.149966,0.135762
7,"Summarize the following conversation. #Person1#: Today more and more families have personal computers. People have wider range of choice to communicate with the outside world. #Person2#: Right. With the establishment of Internet and a lot of web companies, people are getting more and more dependent on the web. #Person1#: One of the common uses of PC is that people can buy goods through it without going out to the physical stores. #Person2#: Can you tell me how it is done? #Person1#: If a cus...","<pad> #Person1# teaches #Person2# about the fact that people have wide range of choice to communicate with the outside world. , people are getting more and more dependent on the web and people can buy goods through it without going out to the physical stores.</s>","<pad> #Person1# tells #Person2# has become the definition of personal computers, people are getting more and more dependent on PC.</s>",2.58616,2.66807,0.08191
8,"Summarize the following conversation. #Person1#: What can I do for you, madam? #Person2#: I'd like to buy a toy car for my son. #Person1#: How about this one? #Person2#: It looks nice. How much is it? #Person1#: They're three hundred dollars. #Person2#: Oh, I'm afraid it's too expensive. Can you show me something cheaper? #Person1#: OK, This one is one hundred and twenty. It's the cheapest here. #Person2#: OK, I'll take it. Here's the money. #Person1#: Thank you very much. Summary: </s>",<pad> #Person2# wants to buy a toy car for her son. #Person1# suggests the lowest price for three hundred and twenty for #Person2#. #Person2# accepts.</s>,<pad> #Person2# wants to buy a toy car for a kid for #Person2#'s children. #Person1# offers to sugar car for #Person2#.</s>,1.256037,1.285636,0.029599
9,"Summarize the following conversation. #Person1#: Could you help me figure out how to look for a job? #Person2#: We have lots of options, what type of job do you need? #Person1#: I want to work in an office. #Person2#: Do you want to work part-time or full-time? #Person1#: I want to work full-time. #Person2#: We have binders with local job listings or you can make use of the computers. OK? #Person1#: I am confused a bit but I am sure that I can figure it out. #Person2#: If you make an appoint...",<pad> #Person1# prefers a job in an office but #Person2# advises #Person1# to see a job counselor.</s>,<pad> #Person2# tells #Person1# how to work in a full -time job. #Person1# will see a counselor to have some conversations about potential job positions.</s>,2.208982,2.23015,0.021168


Reviewing the reward mean and median of the generated sequences, you should see a significant difference.

|컬럼|의미|
|---|---|
|query	|모델에 입력된 프롬프트 (대화 + “Summarize …”)|
|response_before	|**PPO 학습 전(policy 초기 상태)**의 모델 요약|
|response_after	|PPO 학습 후의 모델 요약|
|reward_before	|학습 전 요약에 대해 Reward Model이 준 점수|
|reward_after	|학습 후 요약에 대한 Reward 점수|
|reward_diff	|reward_after - reward_before (개선 정도)|

<a name='4'></a>
## 4 – Store the results in a DynamoDB table

#### _Code Cell 27_ ####


In [None]:
import awswrangler as wr 

# Add an index column to the data frame to act as the partition key 
df_compare_results['index'] = range(1, len(df_compare_results) + 1)  

# Create a results dataframe,reorganized with DynamoDB table attributes
result = pd.DataFrame({
    "conversation_id": df_compare_results['index'],
    "query": df_compare_results['query'],
    "response_before": df_compare_results['response_before'],
    "response_after": df_compare_results['response_after']
})

# Upload result to DDB
wr.dynamodb.put_df(df=result, table_name='llm_with_rlhf')




🔹 reward_diff가 커지는 샘플의 공통 특징

🔹 이 결과를 RAGAS / ROUGE로 함께 해석하는 방법

🔹 PPO 학습이 “과도해질 때” 나타나는 부작용 (mode collapse)