# **담화 요약을 위한 생성모델 Fine-Tung하기**
이 노트북에서는, 강화된 대화 요약을 위해 존재하는 LLM을 허깅페이스에서 불러와, 파인튜닝하게 됩니다. 높은 수준의 instruction tune을 제공하는, FLAN-T5를 사용할 것입니다. 인퍼런스 능력을 향상하기 위해 full fine-tuning을 수행하며, ROUGE 메트릭으로 평가하는 작업을 수행합니다. 이후, Parameter Efficient Fine-Tunign(PEFT)를 수행하고, 결과 모델을 평가함으로서 PEFT의 장점이 비교적 낮은 성능 하락보다 크다는 것을 확인합니다.

Table of Contents
- 1 - Set up Kernel, Load Required Dependencies, Dataset and LLM
  - 1.1 - Set up Kernel and Required Dependencies
  - 1.2 - Load Dataset and LLM
  - 1.3 - Test the Model with Zero Shot Inferencing
- 2 - Perform Full Fine-Tuning
  - 2.1 - Preprocess the Dialog-Summary Dataset
  - 2.2 - Fine-Tune the Model with the Preprocessed Dataset
  - 2.3 - Evaluate the Model Qualitatively (Human Evaluation)
  - 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)
- 3 - Perform Parameter Efficient Fine-Tuning (PEFT)
  - 3.1 - Setup the PEFT/LoRA model for Fine-Tuning
  - 3.2 - Train PEFT Adapter
  - 3.3 - Evaluate the Model Qualitatively (Human Evaluation)
  - 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

# **1 - Set up Kernel, Load Required Dependencies, Dataset and LLM**

### 1.1 - Set up Kernel and Required Dependencies

LLM과 데이터셋에 필요한 패키지를 다운로드합니다.

In [1]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    loralib==0.1.1 \
    peft==0.3.0 --quiet

Collecting pip
  Downloading pip-23.3.2-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.2
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m79.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m4.9 MB/s[0m eta [36m0

필요한 요소를 Import 합니다. 새로운 요소가 있는데, 뒤에서 다루도록 합니다.

In [2]:
from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig, TrainingArguments, Trainer
import torch
import time
import evaluate
import pandas as pd
import numpy as np

### 1.2 - Load Dataset and LLM

허깅페이스의 DialogSum 데이터셋과 함께 진행합니다. 이 데이터셋은 10000+개의 대화와 대응되는 손수 레이블링된 요약, 주제를 포함하고 있습니다.

In [3]:
huggingface_dataset_name = "knkarthick/dialogsum"

dataset = load_dataset(huggingface_dataset_name)

dataset

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/4.65k [00:00<?, ?B/s]

Downloading and preparing dataset csv/knkarthick--dialogsum to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.35M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/442k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/knkarthick___csv/knkarthick--dialogsum-cd36827d3490488d/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary', 'topic'],
        num_rows: 500
    })
})

미리 학습된 FLAN-T5 모델과 대응되는 토크나이저를 불러옵니다. 이때, FLAN-T5의 small version을 사용한다는 것을 알아두세요. `torch_dtype=torch.dfloat16`으로 모델이 사용할 메모리 타입을 명시합니다.

In [56]:
model_name='google/flan-t5-base'

original_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

모델의 전체 파라미터수와, 그 중 얼마나 학습가능한지 출력할 수 있습니다. 아래 함수를 통해 수행가능하며, 세부적인 사항은 알 필요 없습니다.

In [5]:
def print_number_of_trainable_model_parameters(model):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"

print(print_number_of_trainable_model_parameters(original_model))

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%


### 1.3 - Test the Model with Zero Shot Inferencing



zero shot 인퍼런스와 함께 모델을 테스트합니다. 모델이 베이스라인과 비교하여 요약에 어려움을 겪고 있음을 알 수 있지만, 후에 fine-tune을 통해 개선될 수 있음을 나타내는 암시하는 듯, 몇 가지 중요한 정보를 텍스트에서 끌어냅니다.

In [38]:
original_model.to('cuda')

index = 200

dialogue = dataset['test'][index]['dialogue']
summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

inputs = tokenizer(prompt, return_tensors='pt').to('cuda')
output = tokenizer.decode(
    original_model.generate(
        inputs["input_ids"],
        max_new_tokens=200,
    )[0],
    skip_special_tokens=True
)

dash_line = '-'.join('' for x in range(100))
print(dash_line)
print(f'INPUT PROMPT:\n{prompt}')
print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{summary}\n')
print(dash_line)
print(f'MODEL GENERATION - ZERO SHOT:\n{output}')

original_model.to('cpu')
inputs.to('cpu')

---------------------------------------------------------------------------------------------------
INPUT PROMPT:

Summarize the following conversation.

#Person1#: Have you considered upgrading your system?
#Person2#: Yes, but I'm not sure what exactly I would need.
#Person1#: You could consider adding a painting program to your software. It would allow you to make up your own flyers and banners for advertising.
#Person2#: That would be a definite bonus.
#Person1#: You might also want to upgrade your hardware because it is pretty outdated now.
#Person2#: How can we do that?
#Person1#: You'd probably need a faster processor, to begin with. And you also need a more powerful hard disc, more memory and a faster modem. Do you have a CD-ROM drive?
#Person2#: No.
#Person1#: Then you might want to add a CD-ROM drive too, because most new software programs are coming out on Cds.
#Person2#: That sounds great. Thanks.

Summary:

-------------------------------------------------------------------

{'input_ids': tensor([[12198,  1635,  1737,     8,   826,  3634,     5,  1713,   345, 13515,
           536,  4663,    10,  2114,    25,  1702, 21066,    39,   358,    58,
          1713,   345, 13515,   357,  4663,    10,  2163,     6,    68,    27,
            31,    51,    59,   417,   125,  1776,    27,   133,   174,     5,
          1713,   345, 13515,   536,  4663,    10,   148,   228,  1099,  2651,
             3,     9,  3924,   478,    12,    39,   889,     5,    94,   133,
           995,    25,    12,   143,    95,    39,   293,  3971,   277,    11,
         11662,     7,    21,  3662,     5,  1713,   345, 13515,   357,  4663,
            10,   466,   133,    36,     3,     9,     3, 14339,  4023,     5,
          1713,   345, 13515,   536,  4663,    10,   148,   429,    92,   241,
            12,  5941,    39,  4214,   250,    34,    19,  1134, 21643,   230,
             5,  1713,   345, 13515,   357,  4663,    10,   571,    54,    62,
           103,    24,    58,  1713,  

## 2 - Perform Full Fine-Tuning

### 2.1 - Preprocess the Dialog-Summary Dataset

담화-요약(프롬프트-결과) 쌍을 LLM을 위한 명시적인 Instruction으로 바꿔줄 필요가 있습니다. `Summarize the following conversation`를 담화의 앞에 추가하고, 요약의 앞에 `Summary`를 추가해줍니다.

Training prompt (dialogue):
```
Summarize the following conversation.
    Chris: This is his part of the conversation.
    Antje: This is her part of the conversation.
Summary:
    Training response (summary):
```
Both Chris and Antje participated in the conversation.
Then preprocess the prompt-response dataset into tokens and pull out their input_ids (1 per token).

In [39]:
def tokenize_function(example):
    start_prompt = 'Summarize the following conversation.\n\n'
    end_prompt = '\n\nSummary: '
    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]
    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids
    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids

    return example

# The dataset actually contains 3 diff splits: train, validation, test.
# The tokenize_function code is handling all data across all splits in batches.
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary',])



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

lab을 빠르게 끝내기 위해, 데이터셋의 일부만 사용하도록 합니다.

In [40]:
tokenized_datasets = tokenized_datasets #.filter(lambda example, index: index % 10 == 0, with_indices=True)

데이터셋의 세 구분(Train, Test, Validation)의 shape를 확인합니다.

In [41]:
print(f"Shapes of the datasets:")
print(f"Training: {tokenized_datasets['train'].shape}")
print(f"Validation: {tokenized_datasets['validation'].shape}")
print(f"Test: {tokenized_datasets['test'].shape}")

print(tokenized_datasets)

Shapes of the datasets:
Training: (12460, 2)
Validation: (500, 2)
Test: (1500, 2)
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 12460
    })
    test: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 1500
    })
    validation: Dataset({
        features: ['input_ids', 'labels'],
        num_rows: 500
    })
})


### 2.2 - Fine-Tune the Model with the Preprocessed Dataset

이제 허깅페이스에 내장된 Trainer class를 사용할 것입니다. ([여기](https://huggingface.co/docs/transformers/main_classes/trainer)서 문서를 확인하세요.). 전처리된 데이터셋과 원래 모델을 인자로 전달하세요. 다른 학습 파라미터는 하다보면 실험적으로 찾아지기 때문에, 지금 상세히 알 필요는 없습니다.

In [51]:
instruct_model = AutoModelForSeq2SeqLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to('cuda')

In [52]:
output_dir = f'./dialogue-summary-training-{str(int(time.time()))}'

training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    num_train_epochs=1,
    weight_decay=0.01,
    logging_steps=1,
#    max_steps=1,
)

trainer = Trainer(
    model=instruct_model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation']
)

학습해봅시다.

In [53]:
trainer.train()



Step,Training Loss


OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 0; 14.75 GiB total capacity; 14.33 GiB already allocated; 13.06 MiB free; 14.54 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In [60]:
import gc
gc.collect()
torch.cuda.empty_cache()
del trainer, instruct_model

NameError: name 'instruct_model' is not defined

## 2.3 - Evaluate the Model Qualitatively (Human Evaluation)

많은 GenAI 애플리케이션과 마찬가지로 스스로 "내 모델이 예상대로 작동하고 있는가?"라는 질문을 던지는 것은 일반적으로 좋은 출발점이 됩니다. 아래 예를 통해 원래 모델이 요청한 내용을 이해할 수 없었던 것과 대비해, fine-tune된 모델이 얼마나 잘 요약을 생성할 수 있는지 확인할 수 있습니다.

In [16]:
index = 200
dialogue = dataset['test'][index]['dialogue']
human_baseline_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary:
"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
print(dash_line)
print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person1#: I'm thinking of upgrading my computer.
---------------------------------------------------------------------------------------------------
INSTRUCT MODEL:
##Person1: Have you considered upgrading your system? ##Person2: Yes, but you might want to add a painting program to your software. ##Person1: I'd like to make up my own flyers and banners. ##Person2: I'd like to make up my own flyers and banners. ##Person1: I'd like to make up my own flyers and banners. ##Person2: I'd like to make up my own flyers and banners. ##Person1: I'd like to make up my own flyers and banners. ##Person2: I'd like to make up my own flyers and banners. ##Person1: I'd like to ma

## 2.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

ROUGE 메트릭은 모델이 생성한 요약본의 유효성(validity)을 정량화하는데 도움이 됩니다. 이 메트릭은 인간이 작성한 "베이스라인" 요약과 모델의 요약을 비교합니다. 결과가 비록 완벽하진 않지만, 메트릭은 fine-tuning을 통해 전반적인 요약의 개선을 달성했음을 이야기해줍니다.

In [17]:
rouge = evaluate.load('rouge')

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

In [25]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
instruct_model_summaries = []

for _, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)
    original_model_summaries.append(original_model_text_output)

    instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)
    instruct_model_summaries.append(instruct_model_text_output)

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries'])
df

KeyboardInterrupt: 

ROUGE를 사용해 모델을 평가합니다. 개선이 있었다는 것에 주목하세요!

In [24]:
original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

instruct_model_results = rouge.compute(
    predictions=instruct_model_summaries,
    references=human_baseline_summaries[0:len(instruct_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
print('INSTRUCT MODEL:')
print(instruct_model_results)

IndexError: list index out of range

결과는 모든 ROUGE 지표에서 상당한 개선을 보여줍니다.

In [None]:
print("Absolute percentage improvement of INSTRUCT MODEL over HUMAN BASELINE")

improvement = (np.array(list(instruct_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(instruct_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

## 3 - Perform Parameter Efficient Fine-Tuning (PEFT)

이제, 위에서 수행했던 "full fine-tuning"과 대비되는 Parameter Efficient Fine-Tuning(PEFT)를 수행해봅시다. PEFT는 full fine-tuning과 비슷한 결과를 도출하면서(뒤에서 확인할 수 있습니다.), 훨씬 효율적인 instruction fine-tuning의 종류입니다.

PEFT는 Low-Rank Adaptation(LoRA)와 Prompt Tuning을 포함하는 일반적인(전반적인) 용어입니다. 많은 경우에서 PEFT는 LORA를 의미하는 경우가 많습니다. LORA는, 간단히 이야기 하자면, 적은 컴퓨터 용량(경우에 따라 단 하나의 GPU로도)으로 모델을 fine-tuning할 수 있도록 해주는 방법입니다. LoRA를 사용해 특정한 Task에 대해 fine-tune하게 되면, 원래 LLM의 결과는 바뀌지 않으면서, 새롭게 학습된 "LoRA adapter"가 작동합니다. "LoRA adapter"는 원래 LLM과 대비하여 굉장히 작은 크기입니다. - 원래 LLM의 한자리수 %(MBs vs GBs)



즉, 인퍼런스 시, 인퍼런스 요청을 처리하기 위해, LoRA adapter는 reunit되어 원래 LLM과 합쳐입니다. PEFT/LoRA를 사용할 때, 원래 LLM은 freezing하고, 오직 adapter만 학습하게 됩니다. 아래에서 LoRA 설정을 보시고, Rank(r) 하이퍼파라미터는, 학습될 adapter의 rank/dimension을 주의하세요.

In [55]:
from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=32, # Rank
    lora_alpha=32,
    target_modules=["q", "v"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_2_SEQ_LM # FLAN-T5
)

In [57]:
peft_model = get_peft_model(original_model,
                            lora_config)
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


### 3.2 - Train PEFT Adapter

`Trainer`객체를, 학습 인자를 정의하여 생성합니다.

In [67]:
output_dir = f'./peft-dialogue-summary-training-{str(int(time.time()))}'

peft_training_args = TrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # Higher learning rate than full fine-tuning.
    num_train_epochs=1,
    logging_steps=1,
    max_steps=100
)

peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset=tokenized_datasets["train"],
)

학습하고 저장해봅시다.

In [68]:
peft_trainer.train()

peft_model_path="./peft-dialogue-summary-checkpoint-local"

peft_trainer.model.save_pretrained(peft_model_path)
tokenizer.save_pretrained(peft_model_path)

Step,Training Loss
1,0.3555
2,0.7266
3,0.3535
4,0.498
5,0.3984
6,0.4648
7,0.4805
8,0.4004
9,0.2891
10,0.3359


('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
 './peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
 './peft-dialogue-summary-checkpoint-local/tokenizer.json')

FLAN-T5모델에 adapter을 더하여 준비합니다. 이때, PEFT 모델을 통해 inference만 수행하는 것이 목적이므로, `is_trainable=False`로 설정하세요. 만약 학습할 목적이라면, `True`로 바꾸어주어야합니다.

In [None]:
# from peft import PeftModel, PeftConfig

# peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base", torch_dtype=torch.bfloat16)
# tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")

# peft_model = PeftModel.from_pretrained(peft_model_base,
#                                        './peft-dialogue-summary-checkpoint-from-s3/',
#                                        torch_dtype=torch.bfloat16,
#                                        is_trainable=False)

`is_trainable=False` 설정 때문에, 학습가능한 파라미터가 `0`이 되었습니다.

In [69]:
print(print_number_of_trainable_model_parameters(peft_model))

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%


## 3.3 - Evaluate the Model Qualitatively (Human Evaluation)

1.3, 2.3에서와 마찬가지로 원래 모델과, PEFT 파인튜닝 모델의 인퍼런스를 수행합니다.

In [71]:
index = 200
dialogue = dataset['test'][index]['dialogue']
baseline_human_summary = dataset['test'][index]['summary']

prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

# instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
# instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))
peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

print(dash_line)
print(f'BASELINE HUMAN SUMMARY:\n{human_baseline_summary}')
print(dash_line)
print(f'ORIGINAL MODEL:\n{original_model_text_output}')
# print(dash_line)
# print(f'INSTRUCT MODEL:\n{instruct_model_text_output}')
print(dash_line)
print(f'PEFT MODEL: {peft_model_text_output}')

---------------------------------------------------------------------------------------------------
BASELINE HUMAN SUMMARY:
#Person1# teaches #Person2# how to upgrade software and hardware in #Person2#'s system.
---------------------------------------------------------------------------------------------------
ORIGINAL MODEL:
#Person2# suggests that #Person2# should add a painting program to his software. #Person2# advises #Person2# to add a CD-ROM drive. #Person2# advise #Person2# to upgrade his system.
---------------------------------------------------------------------------------------------------
PEFT MODEL: #Person1# suggests that #Person2# should upgrade his computer and upgrade his hardware. #Person2# suggests adding a painting program to his software.


## 3.4 - Evaluate the Model Quantitatively (with ROUGE Metric)

Test dataset의 일부만 뽑아 인퍼런스합니다.(시간을 아끼기 위해 10개만 사용합니다.)

In [74]:
dialogues = dataset['test'][0:10]['dialogue']
human_baseline_summaries = dataset['test'][0:10]['summary']

original_model_summaries = []
# instruct_model_summaries = []
peft_model_summaries = []

for idx, dialogue in enumerate(dialogues):
    prompt = f"""
Summarize the following conversation.

{dialogue}

Summary: """

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to('cuda')

    human_baseline_text_output = human_baseline_summaries[idx]

    original_model_outputs = original_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    original_model_text_output = tokenizer.decode(original_model_outputs[0], skip_special_tokens=True)

    # instruct_model_outputs = instruct_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    # instruct_model_text_output = tokenizer.decode(instruct_model_outputs[0], skip_special_tokens=True)

    peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200))
    peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)

    original_model_summaries.append(original_model_text_output)
    # instruct_model_summaries.append(instruct_model_text_output)
    peft_model_summaries.append(peft_model_text_output)

# zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, instruct_model_summaries, peft_model_summaries))

# df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'instruct_model_summaries', 'peft_model_summaries'])
# df

zipped_summaries = list(zip(human_baseline_summaries, original_model_summaries, peft_model_summaries))

df = pd.DataFrame(zipped_summaries, columns = ['human_baseline_summaries', 'original_model_summaries', 'peft_model_summaries'])
df

Unnamed: 0,human_baseline_summaries,original_model_summaries,peft_model_summaries
0,Ms. Dawson helps #Person1# to write a memo to ...,#Person2# asks #Person2# to take a memo and ta...,#Person2# asks Ms. Dawson to take a dictation ...
1,In order to prevent employees from wasting tim...,#Person2# wants #Person2# to take a dictation ...,#Person1# asks #Person1# to take dicting for #...
2,Ms. Dawson takes a dictation for #Person1# abo...,#Person2# asks #Person2# to take a dictation f...,#Person1# asks #Person1# to take a dictation. ...
3,#Person2# arrives late because of traffic jam....,#Person2# thinks it's good to take public tran...,#Person2# is stuck in traffic and he's to blam...
4,#Person2# decides to follow #Person1#'s sugges...,#Person1# is a little confused about the traff...,#Person2# gets stuck in traffic and got stuck ...
5,#Person2# complains to #Person1# about the tra...,#Person2# says #Person2# thinks it's a good id...,#Person2# gets stuck in traffic jam near the C...
6,#Person1# tells Kate that Masha and Hero get d...,#Person2# is surprised by the couple's divorce...,#Person1# tells #Person1# that Masha and Hero ...
7,#Person1# tells Kate that Masha and Hero are g...,#Person1# tells #Person1# that Masha and Hero ...,#Person2# tells #Person2# that Masha and Hero ...
8,#Person1# and Kate talk about the divorce betw...,#Person1# and #Person2# are getting a divorce....,#Person1# and Masha and Hero are getting divor...
9,#Person1# and Brian are at the birthday party ...,Brian's birthday is Brian's. Brian is always p...,Brian's birthday is coming. Brian is always po...


In [76]:
rouge = evaluate.load('rouge')

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# instruct_model_results = rouge.compute(
#     predictions=instruct_model_summaries,
#     references=human_baseline_summaries[0:len(instruct_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

ORIGINAL MODEL:
{'rouge1': 0.31057723787282054, 'rouge2': 0.10499435920667168, 'rougeL': 0.2500854205318491, 'rougeLsum': 0.25222760635354624}
PEFT MODEL:
{'rouge1': 0.34266386634091767, 'rouge2': 0.12630009228128802, 'rougeL': 0.2984291704103025, 'rougeLsum': 0.302388490586165}


PEFT모델이 훨씬 학습이 쉽지만, 결과가 나쁘지 않다는 것에 주목하세요!

`data/dialogue-summary-training-results.csv`에서 결과를 불러온 뒤, 위에서 이미 full dataset에 대한 ROUGE 점수를 계산해 두었습니다. 이제 PEFT모델을 위한 값을 불러온 후, 다른 모델과 성능을 비교 해봅시다.

In [77]:
human_baseline_summaries = results['human_baseline_summaries'].values
original_model_summaries = results['original_model_summaries'].values
# instruct_model_summaries = results['instruct_model_summaries'].values
peft_model_summaries     = results['peft_model_summaries'].values

original_model_results = rouge.compute(
    predictions=original_model_summaries,
    references=human_baseline_summaries[0:len(original_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

# instruct_model_results = rouge.compute(
#     predictions=instruct_model_summaries,
#     references=human_baseline_summaries[0:len(instruct_model_summaries)],
#     use_aggregator=True,
#     use_stemmer=True,
# )

peft_model_results = rouge.compute(
    predictions=peft_model_summaries,
    references=human_baseline_summaries[0:len(peft_model_summaries)],
    use_aggregator=True,
    use_stemmer=True,
)

print('ORIGINAL MODEL:')
print(original_model_results)
# print('INSTRUCT MODEL:')
# print(instruct_model_results)
print('PEFT MODEL:')
print(peft_model_results)

NameError: name 'results' is not defined

결과는 full fine-tuning보다 약간의 성능 하락이 있지만, PEFT의 장점이 약간 낮은 성능보다 크다는 것은 자명합니다.

원래 모델 대비 PEFT모델의 개선

In [79]:
print("Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(original_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Absolute percentage improvement of PEFT MODEL over HUMAN BASELINE
rouge1: 3.21%
rouge2: 2.13%
rougeL: 4.83%
rougeLsum: 5.02%


Full fine-tune모델 대비 PEFT모델의 개선

In [None]:
print("Absolute percentage improvement of PEFT MODEL over INSTRUCT MODEL")

improvement = (np.array(list(peft_model_results.values())) - np.array(list(instruct_model_results.values())))
for key, value in zip(peft_model_results.keys(), improvement):
    print(f'{key}: {value*100:.2f}%')

Full fine-tune에 비해 ROUGE 점수가 조금 낮아진 것을 볼 수 있습니다. 그러나 학습에 훨씬 적은 컴퓨팅 및 메모리 리소스가 필요합니다(단일 GPU).