# Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT to Generate Less-Toxic Summaries

이 노트북에서는, Meta AI's hate speech reward model를 사용해 덜 유해한 콘텐츠를 생성하도록 FLAN-T5모델을 파인튜닝하게 됩니다. 보상 모델은 "hate"와 "not hate"를 예측하는 이진 분류기 입니다. Proximal Policy Optimization(PPO)를 파인튜닝에 사용하고 모델의 유해성을 줄일 예정입니다.

# Table of Contents

- [ 1 - Set up Kernel and Required Dependencies](#1)
- [ 2 - Load FLAN-T5 Model, Prepare Reward Model and Toxicity Evaluator](#2)
  - [ 2.1 - Load Data and FLAN-T5 Model Fine-Tuned with Summarization Instruction](#2.1)
  - [ 2.2 - Prepare Reward Model](#2.2)
  - [ 2.3 - Evaluate Toxicity](#2.3)
- [ 3 - Perform Fine-Tuning to Detoxify the Summaries](#3)
  - [ 3.1 - Initialize `PPOTrainer`](#3.1)
  - [ 3.2 - Fine-Tune the Model](#3.2)
  - [ 3.3 - Evaluate the Model Quantitatively](#3.3)
  - [ 3.4 - Evaluate the Model Qualitatively](#3.4)

<a name='1'></a>
## 1 - Set up Kernel and Required Dependencies

LLM과 데이터셋에 필요한 패키지를 다운로드합니다.

In [2]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    peft==0.3.0 --quiet

# Installing the Reinforcement Learning library directly from github.
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd

[0mCollecting git+https://github.com/lvwerra/trl.git@25fa1bd
  Cloning https://github.com/lvwerra/trl.git (to revision 25fa1bd) to /tmp/pip-req-build-y8sbeoyc
  Running command git clone --filter=blob:none --quiet https://github.com/lvwerra/trl.git /tmp/pip-req-build-y8sbeoyc
[0m  Running command git checkout -q 25fa1bd
  Resolved https://github.com/lvwerra/trl.git to commit 25fa1bd
  Preparing metadata (setup.py) ... [?25l[?25hdone
[0m

필요한 요소를 Import 합니다. 새로운 요소가 있는데, 뒤에서 다루도록 합니다.

In [3]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

In [4]:
device = 0 if torch.cuda.is_available() else "cpu"

<a name='2'></a>
## 2 - Load FLAN-T5 Model, Prepare Reward Model and Toxicity Evaluator

<a name='2.1'></a>
### 2.1 - Load Data and FLAN-T5 Model Fine-Tuned with Summarization Instruction

이전과 동일하게, 계속해서 Hugging Face dataset [DialogSum](https://huggingface.co/datasets/knkarthick/dialogsum)과 the pre-trained model [FLAN-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5)를 사용합니다..

In [5]:
model_name='marianna13/flan-t5-base-summarization'
huggingface_dataset_name = "knkarthick/dialogsum"

dataset_original = load_dataset(huggingface_dataset_name)

dataset_original

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

다음으로 데이터셋을 전처리합니다. 특정한 길이의 담화만 필터링해 전체의 일부만 사용합니다(충분히 길고 읽기 쉽게 만들기 위해). 이후, Instruction에 맞춰 각 담화를 감싸고, 토큰화합니다. 토큰 ids는 `input_ids`에 저장하고, 디코딩된 버전은 `query`에 저장합니다.

단계적으로 아래 셀에서 수행할 수 있지만, `build_dataset`이라는 함수로 구성하는 것이 더 좋은 습관입니다.

In [6]:
def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length,
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.

    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """

    # load dataset (only "train" part will be enough for this lab).
    dataset = load_dataset(dataset_name, split="train")

    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

    def tokenize(sample):

        # Wrap each dialogue with the instruction.
        prompt = f"""
          Summarize the following conversation.

          {sample["dialogue"]}

          Summary:
          """
        sample["input_ids"] = tokenizer.encode(prompt)

        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")

    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200,
                        input_max_text_length=1000)

print(dataset)



Map:   0%|          | 0/10022 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 8017
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic', 'input_ids', 'query'],
        num_rows: 2005
    })
})


이전 랩에서, 데이터의 일부분을 사용하여, 요약 instruction과 함께 PEFT 모델을 파인튜닝 했습니다. 그리고 S3로부터 모든 데이터를 활용해 학습된 PEFT 모델을 불러왔던 것을 기억할 것 입니다.

여기서는 Colab을 활용하기에 이미 Summerization에 대해 Full Finetuning된 모델을 불러오겠습니다.

Prepare a function to pull out the number of model parameters (it is the same as in the previous lab):

In [7]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"\ntrainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

FLAN-T5모델의 어뎁터를 추가합니다. 이전의 랩에서는 인퍼런스 타임에만 fully trained adapter을 더했기 때문에 LoRA 설정을 전달할 필요가 없었지만, 이번에는 `is_trainable=True`와 함께 전달해줍시다.

In [8]:
from peft import LoraConfig, get_peft_model, TaskType

In [9]:
lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

model = AutoModelForSeq2SeqLM.from_pretrained(model_name,
                                              torch_dtype=torch.bfloat16)

peft_model = get_peft_model(model=model,
                            peft_config=lora_config)

print(f'PEFT model parameters to be updated:\n{print_number_of_trainable_model_parameters(peft_model)}\n')


PEFT model parameters to be updated:

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%



이번 랩에서는, 강화학습(RL)으로 LLM을 파인튜닝하기 위해 준비합니다. RL은 간단하게 다음 차수의 랩에서 소개될 예정입니다. 이번 스테이지에서는 파인튜닝된 PEFT모델을 전달해서 간단히 PPO 모델을 준비하기만 하면됩니다. PPO는 reward model의 RL policy를 최적하는데 사용합니다.

In [10]:
ppo_model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(peft_model,
                                                               torch_dtype=torch.bfloat16,
                                                               is_trainable=True)

print(f'PPO model parameters to be updated (ValueHead + 769 params):\n{print_number_of_trainable_model_parameters(ppo_model)}\n')
print(ppo_model.v_head)

PPO model parameters to be updated (ValueHead + 769 params):

trainable model parameters: 3539713
all model parameters: 251117569
percentage of trainable model parameters: 1.41%

ValueHead(
  (dropout): Dropout(p=0.1, inplace=False)
  (summary): Linear(in_features=768, out_features=1, bias=True)
  (flatten): Flatten(start_dim=1, end_dim=-1)
)


PPO 동안, 매우 적은 파라미터만 업데이트됩니다. 특히, `ValueHead`의 파라미터를 업데이트합니다. 모델의 class에 대한 더 많은 정보는 여기 [documentation](https://huggingface.co/docs/trl/main/en/models#trl.create_reference_model)서 찾을 수 있습니다. 학습가능한 파라미터수는 $(n+1)*m$로 나타낼 수 있고, 여기서 $n$는 인풋 유닛의 수(여기서는 $n=768$)이고 $m$은 아웃풋 유닛의 수(여기서는 $m=1$)입니다. $+1$은 bias로 인해 계산된 항목입니다.

이제 레퍼런스 모델로 쓰일 파인튜닝 되지 않은 PPO의 frozen copy를 생성합니다. 이 레퍼런스 모델은 detoxification을 하지 않은 LLM을 나타냅니다. 이 레퍼런스 모델의 파라미터의 어느것도 PPO동안 학습되지 않습니다.

In [11]:
ref_model = create_reference_model(ppo_model)

print(f'Reference model parameters to be updated:\n{print_number_of_trainable_model_parameters(ref_model)}\n')

Reference model parameters to be updated:

trainable model parameters: 0
all model parameters: 251117569
percentage of trainable model parameters: 0.00%



모든 준비가 끝났습니다. 리워드 모델을 준비할 시간입니다.

<a name='2.2'></a>
### 2.2 - Prepare Reward Model

**강화학습(RL)**은 누적된 보상을 최대화하는것을 목표로 주어진 환경에서 액션을 선택하도록하는 머신러닝의 종류 중 하나입니다. 강화학습의 목표는 에이전트가 **보상함수**를 최대화하는 optimal, 혹은 nearly-optimal, policy를 배우도록 하는 것입니다.

[이전 섹션](#2.1)에서 오리지널 Policy는 instruct PEFT 모델을 바탕으로 합니다. 이것은 detoxification 전의 LLM입니다. 이후, 인간 라벨러에게 주어진 출력의 유해성을 피드백하도록 할 수 있지만 이러한 작업은 전체 파인튜닝 작업에 사용하면 매우 비쌉니다. 이러한 문제를 회피하는 실용적인 방법은 보상 모델을 사용해 에이전트가 담화를 detoxify하도록 장려하는 것입니다. 직관적인 접근으로 아마 두 개의 클래스(`nothate` and `hate`)에 대해 감성 분석하고 `nothate` 클래스의 확률이 높을 때 높은 보상을 주는 방법을 생각할 수 있습니다.

우리는 [Meta AI's RoBERTa-based hate speech model](https://huggingface.co/facebook/roberta-hate-speech-dynabench-r4-target)을 보상 모델로 사용할 에정입니다. 이 모델은 **logits**를 반환하고 두 클래스: `nothate`와 `hate`에 대한 확률을 계산합니다. `nothate` 출력에 대한 logits을 보상으로 사용합니다. 이후, 이 보상값을 사용해 PPO로 파인튜닝 합니다.

RoBERTa을 위한 모델 클래스의 인스턴스를 생성합니다. 또한 필요한 토크나이저를 로드합니다. `0`이 `nothate`이고, `1`이 `hate`라는 것을 잊지마세요.

In [12]:
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name, device_map="auto")
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name, device_map="auto")
print(toxicity_model.config.id2label)

{0: 'nothate', 1: 'hate'}


non-toxic text를 토큰화한 후 모델에 넣어보자. 출력 logit, 확률 그리고 파인튜닝에 사용할 상응하는 보상값을 출력하자.

In [13]:
non_toxic_text = "#Person 1# tells Tommy that he didn't like the movie."

toxicity_input_ids = toxicity_tokenizer(non_toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(input_ids=toxicity_input_ids.to(device)).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# get the logits for "not hate" - this is the reward!
not_hate_index = 0
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (high): {nothate_reward}')

logits [not hate, hate]: [3.114100694656372, -2.4896175861358643]
probabilities [not hate, hate]: [0.9963293671607971, 0.0036706167738884687]
reward (high): [3.114100694656372]


toxic한 text에 대해서도 해보세요. 이 텍스트는 더 유해하기 때문에, 더 낮은 보상을 보일 것 입니다.

In [14]:
toxic_text = "#Person 1# tells Tommy that the movie was terrible, dumb and stupid."

toxicity_input_ids = toxicity_tokenizer(toxic_text, return_tensors="pt").input_ids

logits = toxicity_model(toxicity_input_ids.to(device)).logits
print(f'logits [not hate, hate]: {logits.tolist()[0]}')

# Print the probabilities for [not hate, hate]
probabilities = logits.softmax(dim=-1).tolist()[0]
print(f'probabilities [not hate, hate]: {probabilities}')

# Get the logits for "not hate" - this is the reward!
nothate_reward = (logits[:, not_hate_index]).tolist()
print(f'reward (low): {nothate_reward}')

logits [not hate, hate]: [-0.6921164393424988, 0.37227070331573486]
probabilities [not hate, hate]: [0.2564719617366791, 0.7435280084609985]
reward (low): [-0.6921164393424988]


유해성 보상 모델을 위한 코드를 단순화하기 위해 허깅페이스 인퍼런스 파이프라인을 설정합니다.

In [15]:
sentiment_pipe = pipeline("sentiment-analysis",
                          model=toxicity_model_name,
                          device=device)
reward_logits_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # Set to "none" to retrieve raw logits.
    "batch_size": 16
}

reward_probabilities_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "softmax", # Set to "softmax" to apply softmax and retrieve probabilities.
    "batch_size": 16
}

print("Reward model output:")
print("For non-toxic text")
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))
print("For toxic text")
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

Reward model output:
For non-toxic text
[{'label': 'nothate', 'score': 3.114100694656372}, {'label': 'hate', 'score': -2.4896175861358643}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.003670616541057825}]
For toxic text
[{'label': 'hate', 'score': 0.37227070331573486}, {'label': 'nothate', 'score': -0.6921164393424988}]
[{'label': 'hate', 'score': 0.7435280084609985}, {'label': 'nothate', 'score': 0.25647199153900146}]


`nothate` (positive)와 `hate` (negative) 두 클래스에 대한 출력이 나옵니다. 하지만 PPO는 LLM의 출력을 detoxify하기 위해 `nothate` 클래스에 대한 logit만 positive reward 신호로 사용합니다.

In [16]:
print(sentiment_pipe(non_toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(non_toxic_text, **reward_probabilities_kwargs))

[{'label': 'nothate', 'score': 3.114100694656372}, {'label': 'hate', 'score': -2.4896175861358643}]
[{'label': 'nothate', 'score': 0.9963293671607971}, {'label': 'hate', 'score': 0.003670616541057825}]


In [17]:
print(sentiment_pipe(toxic_text, **reward_logits_kwargs))
print(sentiment_pipe(toxic_text, **reward_probabilities_kwargs))

[{'label': 'hate', 'score': 0.37227070331573486}, {'label': 'nothate', 'score': -0.6921164393424988}]
[{'label': 'hate', 'score': 0.7435280084609985}, {'label': 'nothate', 'score': 0.25647199153900146}]


<a name='2.3'></a>
### 2.3 - Evaluate Toxicity

fine-tuning/detoxification의 전후의 모델을 평가하기 위해서 [toxicity evaluation metric](https://huggingface.co/spaces/evaluate-measurement/toxicity)를 설정합니다. **toxicity score**은 0~1사이의 소수이며, 1이 가장 toxicity합니다.

In [18]:
toxicity_evaluator = evaluate.load("toxicity",
                                    toxicity_model_name,
                                    module_type="measurement",
                                    toxic_label="hate")

Downloading builder script:   0%|          | 0.00/6.08k [00:00<?, ?B/s]

섹션 [2.2](#2.2)와 같은 문장에 대한 유해성을 검사합니다. 유해성 점수가 보상 모델에서 나온 `hate` 클래스에 대한 확률이라는 점은 당연합니다.

In [19]:
toxicity_score = toxicity_evaluator.compute(predictions=[
    non_toxic_text
])

print("Toxicity score for non-toxic text:")
print(toxicity_score["toxicity"])

toxicity_score = toxicity_evaluator.compute(predictions=[
    toxic_text
])

print("\nToxicity score for toxic text:")
print(toxicity_score["toxicity"])

Toxicity score for non-toxic text:
[0.003670616541057825]

Toxicity score for toxic text:
[0.7435289621353149]


섹션 [2.1](#2.1)에서 나온 담화의 유해성을 계산하기 위해 evaluator을 사용합니다. test dataset (`dataset["test"]`)를 섹션 [2.2](#2.2)의 frozen PEFT model에서 사용했던 것과 같은 토크나이저를 사용하여 전달합니다. 편의를 위해 일련의 작업을 `evaluate_toxicity`에 warp합니다.

In [26]:
def evaluate_toxicity(model,
                      toxicity_evaluator,
                      tokenizer,
                      dataset,
                      num_samples):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model (trl model): Model to be evaluated.
    - toxicity_evaluator (evaluate_modules toxicity metrics): Toxicity evaluator.
    - tokenizer (transformers tokenizer): Tokenizer to be used.
    - dataset (dataset): Input dataset for the evaluation.
    - num_samples (int): Maximum number of samples for the evaluation.

    Returns:
    tuple: A tuple containing two numpy.float64 values:
    - mean (numpy.float64): Mean of the samples toxicity.
    - std (numpy.float64): Standard deviation of the samples toxicity.
    """

    max_new_tokens=100

    toxicities = []
    input_texts = []
    for i, sample in tqdm(enumerate(dataset)):
        input_text = sample["query"]

        if i > num_samples:
            break

        input_ids = tokenizer(input_text, return_tensors="pt", padding=True).input_ids

        generation_config = GenerationConfig(max_new_tokens=max_new_tokens,
                                             top_k=0.0,
                                             top_p=1.0,
                                             do_sample=True)

        response_token_ids = model.generate(input_ids=input_ids.to(device),
                                            generation_config=generation_config)

        generated_text = tokenizer.decode(response_token_ids[0], skip_special_tokens=True)

        toxicity_score = toxicity_evaluator.compute(predictions=[(input_text + " " + generated_text)])

        toxicities.extend(toxicity_score["toxicity"])

    # Compute mean & std using np.
    mean = np.mean(toxicities)
    std = np.std(toxicities)

    return mean, std

fine-tuning/detoxification을 하기 전 모델에 대해 유해성을 검사합니다.

In [21]:
tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

mean_before_detoxification, std_before_detoxification = evaluate_toxicity(model=ref_model,
                                                                          toxicity_evaluator=toxicity_evaluator,
                                                                          tokenizer=tokenizer,
                                                                          dataset=dataset["test"],
                                                                          num_samples=10)

print(f'toxicity [mean, std] before detox: [{mean_before_detoxification}, {std_before_detoxification}]')

11it [01:27,  7.96s/it]

toxicity [mean, std] before detox: [0.022183267939412457, 0.030635774486926608]





<a name='3'></a>
## 3 - Perform Fine-Tuning to Detoxify the Summaries

Proximal Policy Optimization (PPO)를 사용하여 보상 모델의 RL policy를 최적화합니다.

<a name='3.1'></a>
### 3.1 - Initialize `PPOTrainer`

`PPOTrainer`를 사용하기 위해, collator가 필요합니다. 여기서는 딕셔너리를 특정한 방법으로 변환하는 함수입니다. 아래에서 정의하고 테스트 할 수 있습니다.

In [22]:
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

test_data = [{"key1": "value1", "key2": "value2", "key3": "value3"}]
print(f'Collator input: {test_data}')
print(f'Collator output: {collator(test_data)}')

Collator input: [{'key1': 'value1', 'key2': 'value2', 'key3': 'value3'}]
Collator output: {'key1': ['value1'], 'key2': ['value2'], 'key3': ['value3']}


Configuration 파라미터들을 설정합니다. `ppo_model`과 토크나이저를 불러옵니다. 또한 frozen버전인 `ref_model`도 불러옵니다. 첫번쨰 모델은 최적화 되지만 두번째 모델은 시작 모델과 KL-divergence를 계산하기 위해 기능합니다. 이는 최적화된 모델이 원래 LLM과 너무 다른 결과를 도출하기 않도록 추가적인 보상 신호로 작동합니다.

In [23]:
learning_rate=1.41e-5
max_ppo_epochs=1
mini_batch_size=4
batch_size=16

config = PPOConfig(
    model_name=model_name,
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

ppo_trainer = PPOTrainer(config=config,
                         model=ppo_model,
                         ref_model=ref_model,
                         tokenizer=tokenizer,
                         dataset=dataset["train"],
                         data_collator=collator)

<a name='3.2'></a>
### 3.2 - Fine-Tune the Model

파인튜닝 루프는 다음의 주요한 스텝을 포함합니다.
1. policy LLM(PEFT model)로 부터 쿼리에 대한 응답을 도출합니다.
2. hate speech RoBERTa model로 부터 응답의 감성(보상)을 도출합니다.
3. 쿼리, 응답, 보상 triplet을 사용해 PPO로 policy를 최적화합니다.

The operation is running if you see the following metrics appearing:
* `objective/kl`: minimize kl divergence,
* `ppo/returns/mean`: maximize mean returns,
* `ppo/policy/advantages_mean`: maximize advantages.



In [24]:
output_min_length = 100
output_max_length = 400
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True
}

reward_kwargs = {
    "top_k": None, # Return all scores.
    "function_to_apply": "none", # You want the raw logits without softmax.
    "batch_size": 16
}

max_ppo_steps = 10

for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    # Break when you reach max_steps.
    if step >= max_ppo_steps:
        break

    prompt_tensors = batch["input_ids"]

    # Get response from FLAN-T5/PEFT LLM.
    summary_tensors = []

    for prompt_tensor in prompt_tensors:
        max_new_tokens = output_length_sampler()

        generation_kwargs["max_new_tokens"] = max_new_tokens
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)

        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # This needs to be called "response".
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward outputs.
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = sentiment_pipe(query_response_pairs, **reward_kwargs)

    # You use the `nothate` item because this is the score for the positive `nothate` class.
    reward_tensors = [torch.tensor(reward[not_hate_index]["score"]) for reward in rewards]

    # Run PPO step.
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')
    print(f'ppo/policy/advantages_mean: {stats["ppo/policy/advantages_mean"]}')
    print('-'.join('' for x in range(100)))

1it [00:40, 40.50s/it]

objective/kl: 0.0
ppo/returns/mean: 0.6619201898574829
ppo/policy/advantages_mean: -4.2552670720397145e-08
---------------------------------------------------------------------------------------------------


2it [01:33, 48.09s/it]

objective/kl: 0.019753267988562584
ppo/returns/mean: 0.4865519404411316
ppo/policy/advantages_mean: 2.640093921257858e-09
---------------------------------------------------------------------------------------------------


3it [02:13, 44.11s/it]

objective/kl: -0.014077756553888321
ppo/returns/mean: 0.5967159271240234
ppo/policy/advantages_mean: -2.8326697432135006e-08
---------------------------------------------------------------------------------------------------


4it [02:46, 39.86s/it]

objective/kl: 0.042236704379320145
ppo/returns/mean: 0.5421863794326782
ppo/policy/advantages_mean: 2.631889373105878e-09
---------------------------------------------------------------------------------------------------


5it [03:26, 39.90s/it]

objective/kl: -0.007229343056678772
ppo/returns/mean: 0.6220415830612183
ppo/policy/advantages_mean: 1.2450429132115914e-08
---------------------------------------------------------------------------------------------------


6it [03:59, 37.64s/it]

objective/kl: -0.02252703160047531
ppo/returns/mean: 0.6286816596984863
ppo/policy/advantages_mean: -4.86919127240526e-09
---------------------------------------------------------------------------------------------------


7it [04:37, 37.69s/it]

objective/kl: 0.022838713601231575
ppo/returns/mean: 0.6580877304077148
ppo/policy/advantages_mean: 6.1684293228836395e-09
---------------------------------------------------------------------------------------------------


8it [05:12, 36.94s/it]

objective/kl: -0.0021027810871601105
ppo/returns/mean: 0.6942576169967651
ppo/policy/advantages_mean: 2.235320550880715e-08
---------------------------------------------------------------------------------------------------


9it [05:54, 38.29s/it]

objective/kl: 0.01859004981815815
ppo/returns/mean: 0.5624541640281677
ppo/policy/advantages_mean: 3.325614272853272e-08
---------------------------------------------------------------------------------------------------


10it [06:34, 39.45s/it]

objective/kl: -0.14429761469364166
ppo/returns/mean: 0.5977291464805603
ppo/policy/advantages_mean: 2.8035724852770727e-08
---------------------------------------------------------------------------------------------------







<a name='3.3'></a>
### 3.3 - Evaluate the Model Quantitatively

PPO/PEFT을 디스크에서 불러오고, 나눠두었던 test dataset을 사용해 RL-fine-tuned model의 유해성을 검사합니다.

In [27]:
mean_after_detoxification, std_after_detoxification = evaluate_toxicity(model=ppo_model,
                                                                        toxicity_evaluator=toxicity_evaluator,
                                                                        tokenizer=tokenizer,
                                                                        dataset=dataset["test"],
                                                                        num_samples=10)
print(f'toxicity [mean, std] after detox: [{mean_after_detoxification}, {std_after_detoxification}]')

11it [00:28,  2.57s/it]

toxicity [mean, std] after detox: [0.029684017020785672, 0.03683596797140133]





레퍼런스 모델(detoxification 전)과 fine-tuned model(detoxification 후)의 유해성 점수를 비교합니다.

In [28]:
mean_improvement = (mean_before_detoxification - mean_after_detoxification) / mean_before_detoxification
std_improvement = (std_before_detoxification - std_after_detoxification) / std_before_detoxification

print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Percentage improvement of toxicity score after detoxification:
mean: -33.81%
std: -20.24%


<a name='3.4'></a>
### 3.4 - Evaluate the Model Qualitatively

test dataset의 예제를 몇개 검사해 봅시다. 원래 `ref_model`과 파인튜닝/detoxified `ppo_model`을 toxicity evaluator을 사용해 비교합니다.

In [29]:
batch_size = 20
compare_results = {}

df_batch = dataset["test"][0:batch_size]

compare_results["query"] = df_batch["query"]
prompt_tensors = df_batch["input_ids"]

summary_tensors_ref = []
summary_tensors = []

# Get response from ppo and base model.
for i in tqdm(range(batch_size)):
    gen_len = output_length_sampler()
    generation_kwargs["max_new_tokens"] = gen_len

    summary = ref_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors_ref.append(summary)

    summary = ppo_model.generate(
        input_ids=torch.as_tensor(prompt_tensors[i]).unsqueeze(dim=0).to(device),
        **generation_kwargs
    ).squeeze()[-gen_len:]
    summary_tensors.append(summary)

# Decode responses.
compare_results["response_before"] = [tokenizer.decode(summary_tensors_ref[i]) for i in range(batch_size)]
compare_results["response_after"] = [tokenizer.decode(summary_tensors[i]) for i in range(batch_size)]

# Sentiment analysis of query/response pairs before/after.
texts_before = [d + s for d, s in zip(compare_results["query"], compare_results["response_before"])]
rewards_before = sentiment_pipe(texts_before, **reward_kwargs)
compare_results["reward_before"] = [reward[not_hate_index]["score"] for reward in rewards_before]

texts_after = [d + s for d, s in zip(compare_results["query"], compare_results["response_after"])]
rewards_after = sentiment_pipe(texts_after, **reward_kwargs)
compare_results["reward_after"] = [reward[not_hate_index]["score"] for reward in rewards_after]

100%|██████████| 20/20 [01:08<00:00,  3.40s/it]


리뷰와 결과를 DataFrame에 담아 출력합니다.

In [30]:
pd.set_option('display.max_colwidth', 500)
df_compare_results = pd.DataFrame(compare_results)
df_compare_results["reward_diff"] = df_compare_results['reward_after'] - df_compare_results['reward_before']
df_compare_results_sorted = df_compare_results.sort_values(by=['reward_diff'], ascending=False).reset_index(drop=True)
df_compare_results_sorted

Unnamed: 0,query,response_before,response_after,reward_before,reward_after,reward_diff
0,"Summarize the following conversation. #Person1#: Here is the final draft of our contract. I'm glad that we have reached an agreement on almost every term in our trade. #Person2#: Yes, it seems to me we have come quite a long way. However, let me take a close look at the final draft. #Person1#: Do you have some points to bring up? #Person2#: Well, everything we've discussed seems to be here. #Person1#: Yes, including a description of the shirts you want to purchase this time, the total amount...",<pad> The contract between the collector and the shirt owner has been finalized and the positions differ from a preliminary presenter. The agreement ends with the buyer figuring out all the terms of the contract carefully. The woman offers to sign the contract on the bloody day.</s>,"<pad> The contract signing is completed. The contract includes a description of the shirts to purchase, the total amount of the order, price for each piece, mode of payment, packaging, shipping time, insurance and compensation, claim and arbitration, and rights and duties for both sides. The sample 25 is the standard for others and is complete. The final draft is reviewed and asked to sign.</s>",1.512996,3.590135,2.077139
1,"Summarize the following conversation. #Person1#: So how did you like the restaurant? #Person2#: Actually, it could have been better. #Person1#: What didn't you like about it? #Person2#: It is a new restaurant. I don't think they have their act together yet. #Person1#: What did you think about the food? #Person2#: I felt that the food was pretty mediocre. #Person1#: The service wasn't that great, either. #Person2#: I agree. The service was not good. #Person1#: Do you think that you want to tr...","<pad> A restaurant criticizes the current restaurant the someone recommended. According to the speaker, there aren't any official plans to open another restaurant. The food is mediocre, but the service is not as good, as is the food delivery. The speaker feels that he hasn't found the right place to try again.</s>","<pad> The restaurant is new and inexperienced actually, due to the lack of a joint. The reviews are negative, however, and dinner was lacking in quality. The service was poor, and they didn't come back.</s>",2.38886,3.27986,0.891
2,"Summarize the following conversation. #Person1#: Could you help me figure out how to look for a job? #Person2#: We have lots of options, what type of job do you need? #Person1#: I want to work in an office. #Person2#: Do you want to work part-time or full-time? #Person1#: I want to work full-time. #Person2#: We have binders with local job listings or you can make use of the computers. OK? #Person1#: I am confused a bit but I am sure that I can figure it out. #Person2#: If you make an appoint...","<pad> ""An interviewer asks someone about looking for a job in an office. They take a personal test, such as doing a background check on a worker and seeing different responsibilities, such as payroll, benefits, experience, and citizenship. The employer suggests the person try to locate a job in an office and presently rate working with them on other aspects of their work.</s>","<pad> The job center offers help in terms of looking for jobs in an office like customer service, computer skills, and other disciplines. The job center offers suitable jobs for both full-time and full-time employees, and visits local job listings from computers, print ads, and computer guides. Persons have their doubts about talking to a job counselor.</s>",2.204305,2.962445,0.75814
3,"Summarize the following conversation. #Person1#: Could you help me, Sir? My flight got in 15 minutes ago. Everyone else has picked up the luggage but mine hasn't come through. #Person2#: I'm sorry, Madam, I'll go and find out if there is any more to come. Summary: </s>","<pad> ""The conversation is about a flight going in 15 minutes and still not arriving,"" Mr. Evans tells the person. The person asks why Alice's flight hasn't been cancelled and the woman asks if they can solve the issue. The person declines, agreeing to return and finds out if there is a problem.</s>","<pad> ""Another flight arrived fifteen minutes late for Person1 and everyone else, but her flight has not found a way in"". Someone apologized and advised them that they would ask if there is any more to be done.</s>",2.113165,2.662541,0.549376
4,"Summarize the following conversation. #Person1#: What can I do for you, madam? #Person2#: I'd like to buy a toy car for my son. #Person1#: How about this one? #Person2#: It looks nice. How much is it? #Person1#: They're three hundred dollars. #Person2#: Oh, I'm afraid it's too expensive. Can you show me something cheaper? #Person1#: OK, This one is one hundred and twenty. It's the cheapest here. #Person2#: OK, I'll take it. Here's the money. #Person1#: Thank you very much. Summary: </s>",<pad> Person1 is buying a toy car for her son but has a difficult time finding one suitable for his age. They are having a conversation about a cheaper one which is one hundred and twenty. The person decides to take the bought car.</s>,"<pad> In a call, the secretary of the Toy Car Registry is helping Person1 purchase a toy car for his son, talking about it in detail. The difficulty of choosing a toy car is being covered by the desk clerk who concludes that the total price is three hundred dollars. If the person agrees, the car will be offered for a discount and the person is offered the correct amount of money.</s>",1.426254,1.800506,0.374251
5,"Summarize the following conversation. #Person1#: Where shall I register, please? #Person2#: Here. Do you have a registration card? #Person1#: Yes. Here you are. #Person2#: Please register your information here and pay for it. And I'll make a medical record for you. #Person1#: OK. How much do I need to pay for the registration? #Person2#: Please pay ten yuan for the registration. #Person1#: Here is my money. #Person2#: This is your registration card. Please don't lose it and bring it whenever...",<pad> A person asks that they pay for a medical record and have the registration card provided. The person replies that he has a registration card and pays for it.</s>,"<pad> ""Please register and pay ten yuan before the consultation."" Customer is asked where to go to register his information and be guided through the procedure. After logging in, the consumer receives a welcome notice on their registration card. The patient is advised to follow the procedure and leave after the consultation is to the drugstore.</s>",1.461796,1.780276,0.31848
6,"Summarize the following conversation. #Person1#: Hello? #Person2#: Hello? #Person1#: Can I speak to Li Hong, please? #Person2#: Speaking. #Person1#: Hi, Li Hong. This is Alice. #Person2#: Hi, Alice. How are you? #Person1#: Not bad. Li Hong, I am sorry that I can't go to see Mrs. Brown with you tomorrow morning. My mother is ill. I must take care of her. #Person2#: I'm sorry to hear that. You'd better stay at home. After all, we can visit Mrs. Brown later #Person1#: OK. Bye - bye. #Person2#: ...",<pad> Li Hong apologizes to Alice's mother in distress and sends her to hospital. Alice agrees.</s>,<pad> Alice cannot meet with Li Hong tomorrow morning because her mother is ill and she will care for her. He apologizes and agrees to go home to visit Mrs. Brown later.</s>,1.373268,1.596571,0.223303
7,"Summarize the following conversation. #Person1#: It smells like an ashtray in here! #Person2#: Hi honey! What's wrong? Why do you have that look on your face? #Person1#: What's wrong? I thought we agreed that you were gonna quit smoking. #Person2#: No! I said I was going to cut down which is very different. You can't just expect me to go cold turkey overnight! #Person1#: Look, there are other ways to quit. You can try the nicotine patch, or nicotine chewing gum. We spend a fortune on cigaret...","<pad> Person1 complains to their partner that it's not like they are quitting smoking and wants to stop body odorism. Daniel advises him to consider alternative solutions. They plan on attempting nicotine patches or chewing gum, but are not certain if they will achieve the maximum effect. Meanwhile, the teen wishes to divorce the person if he doesn't have the willpower to stop.</s>","<pad> People complain about their behavior on social media like to have a look of ashtray on their face. They suggest they use nicotine patches, chewing gum, and smoking bans instead. They try craving tobacco. However, they need a decision after explaining to the others that they want a divorce.</s>",1.814415,2.017397,0.202982
8,"Summarize the following conversation. #Person1#: Mom, I just finished my paper. Can you proofread it before I hand it in? #Person2#: Sure, let's take a look. Sweetie, this is terrific. Your ideas are so original. #Person1#: Thanks. #Person2#: I can tell you worked hard on it. #Person1#: I really did! I started thinking about what I wanted to say three weeks ago. #Person2#: Well, it was definitely worth all the time. #Person1#: Let's just hope my teacher agrees. Summary: </s>","<pad> Person1 summarizes his paper and asks his mom to proofread it before handing in it to her. Father Thomas sends them a photo of the original paper, explaining that ""its worth all the time.""</s>","<pad> The teacher proofreads the paper, congratulating little person one on it, sent it in.</s>",2.486018,2.687668,0.20165
9,"Summarize the following conversation. #Person1#: Oh, my God! What's this? #Person2#: What? #Person1#: Look! This window is open. #Person2#: Did you open it before we left? #Person1#: Are you kidding? It's winter. Why would I open it? #Person2#: I don't know. Wait. Is this yours? #Person1#: No! Oh, my God! Someone has broken into the house. #Person2#: It looks that way. That's probably why the door wasn't locked when we came in. #Person1#: I locked it when I left though. #Person2#: Yes, but t...","<pad> A search reveals that a robber has not been present and has not re-locked the house. The robber left through the door to gain entry and then left through the door, leading to illegal entry. As a warning, the police do not suspect there will be someone up there.</s>",<pad> Persons contact the police at a crime scene and learn that someone has broken into their house in winter. The robber can't unlock the window but instead left through the door because it wasn't tied.</s>,2.02351,1.976732,-0.046778


생성된 문장의 보상에 대한 mean/median 점수가 상당한 차이가 있는 것을 알 수 있습니다.