## Fine-Tuning Llama3 8B for Text Summarization 📄

논문 Summarization용 데이터셋으로 실험해봤으나, 질이 너무 좋지 않은 관계로 질이 좋은 QA 데이터셋을 활용하고자 합니다~  Q를 그냥 내용을 간단하게 요약해달라고 넣으면 되겠죠?

#### Install & Import Dependencies

In [1]:
%%capture
# Unsloth, Xformers (Flash Attention) 및 기타 필요한 패키지들을 설치해요!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes

In [2]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset

#### Load Model

In [3]:
# 최대 시퀀스 길이 설정. 내부적으로 RoPE 스케일링을 자동으로 지원해요!
max_seq_length = 2048 # 원하는 값으로 설정~

# 데이터 타입 설정. 자동 감지를 위해 None, Tesla T4나 V100은 Float16, Ampere 이상에서는 Bfloat16 사용하세요!
dtype = None

# 메모리 사용량을 줄이기 위해 4비트 양자화 사용. 필요하지 않다면 False로 설정.
load_in_4bit = True

# 4bit pre quantized models로 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit", # Instruct version of Gemma 7b
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit", # Instruct version of Gemma 2b
    "unsloth/llama-3-8b-bnb-4bit", # [NEW] 15 Trillion token Llama-3
] # https://huggingface.co/unsloth 에 더 많은 모델이 있어요~

# pretrained model과 tokenizer를 불러옵니다
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # 사용할 모델의 이름
    max_seq_length = max_seq_length, # 최대 시퀀스 길이
    dtype = dtype, # 데이터 타입
    load_in_4bit = load_in_4bit, # 4비트 양자화 사용 여부
)




config.json:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/464 [00:00<?, ?B/s]

Unsloth: unsloth/llama-3-8b-bnb-4bit has no tokenizer.model file.
Just informing you about this - this is not a critical error.


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


LoRA Adapter를 가져옵니다~ 이렇게 하면 전체 파라미터의 1~10%만 업데이트하여 빠르고 효율적인 학습이 가능합니다 😀

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


#### Data Prep

In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

In [6]:
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
# Remember to add the EOS_TOKEN to the tokenized output!! Otherwise you'll get infinite generations!

def formatting_prompts_func(examples):
    instructions = examples["question"]
    inputs       = examples["context"]
    outputs      = examples["answer"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }

# 데이터셋 로드
dataset = load_dataset("hanyueshf/ml-arxiv-papers-qa", split="train")

# 데이터셋 변환
dataset = dataset.map(formatting_prompts_func, batched=True)


Downloading readme:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/82.6M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/43713 [00:00<?, ? examples/s]

Map:   0%|          | 0/43713 [00:00<?, ? examples/s]

#### Train Model

Huggingface TRL에서 제공하는 SFTTrainer를 이용해서 모델을 fine-tuning 해봐요~

SFTTrainer는 Supervised Fine-Tuning으로 RLHF에서 필수적인 과정입니다! [링크](https://huggingface.co/docs/trl/sft_trainer)에 들어가서 직접 확인해보세요 😎

In [7]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/43713 [00:00<?, ? examples/s]

  self.pid = os.fork()
max_steps is given, it will override any value given in num_train_epochs


In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 43,713 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,2.4179
2,2.2525
3,2.1654
4,2.0966
5,2.0339
6,1.9957
7,2.1214
8,1.9377
9,1.9087
10,1.645


#### Inference

Inference 해봅시다~ output을 비워둬서 모델이 직접 생성할 수 있도록 합니다 🤓

instruction은 요약해달라고 넣고, input은 llama2 논문 제목과 abstract를 합쳐서 넣었어요~

In [11]:

FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Summarize what the paper is about", # instruction
        "Federated learning: Applications, challenges and future directions.Federated learning (FL) is a system in which a central aggregator coordinates the efforts of multiple clients to solve machine learning problems. This setting allows training data to be dispersed in order to protect privacy. The purpose of this paper is to provide an overview of FL systems with a focus on healthcare. FL is evaluated here based on its frameworks, architectures, and applications. It is shown here that FL solves the preceding issues with a shared global deep learning (DL) model via a central aggregator server. This paper examines recent developments and provides a comprehensive list of unresolved issues, inspired by the rapid growth of FL research. In the context of FL, several privacy methods are described, including secure multiparty computation, homomorphic encryption, differential privacy, and stochastic gradient descent. Furthermore, a review of various FL classes, such as horizontal and vertical FL and federated transfer learning, is provided. FL has applications in wireless communication, service recommendation, intelligent medical diagnosis systems, and healthcare, all of which are discussed in this paper. We also present a thorough review of existing FL challenges, such as privacy protection,communication cost, system heterogeneity, and unreliable model upload, followed by future research directions.", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nSummarize what the paper is about\n\n### Input:\nFederated learning: Applications, challenges and future directions.Federated learning (FL) is a system in which a central aggregator coordinates the efforts of multiple clients to solve machine learning problems. This setting allows training data to be dispersed in order to protect privacy. The purpose of this paper is to provide an overview of FL systems with a focus on healthcare. FL is evaluated here based on its frameworks, architectures, and applications. It is shown here that FL solves the preceding issues with a shared global deep learning (DL) model via a central aggregator server. This paper examines recent developments and provides a comprehensive list of unresolved issues, inspired by the rapid growth of FL research. In the context 

In [13]:
# Rogue Score Calculation
!pip install rouge
from rouge import Rouge

input = "Federated learning: Applications, challenges and future directions.Federated learning (FL) is a system in which a central aggregator coordinates the efforts of multiple clients to solve machine learning problems. This setting allows training data to be dispersed in order to protect privacy. The purpose of this paper is to provide an overview of FL systems with a focus on healthcare. FL is evaluated here based on its frameworks, architectures, and applications. It is shown here that FL solves the preceding issues with a shared global deep learning (DL) model via a central aggregator server. This paper examines recent developments and provides a comprehensive list of unresolved issues, inspired by the rapid growth of FL research. In the context of FL, several privacy methods are described, including secure multiparty computation, homomorphic encryption, differential privacy, and stochastic gradient descent. Furthermore, a review of various FL classes, such as horizontal and vertical FL and federated transfer learning, is provided. FL has applications in wireless communication, service recommendation, intelligent medical diagnosis systems, and healthcare, all of which are discussed in this paper. We also present a thorough review of existing FL challenges, such as privacy protection,communication cost, system heterogeneity, and unreliable model upload, followed by future research directions."
reference = "Federated Learning (FL) is a system that allows multiple clients to work together to train machine learning models without having to share data. The technology coordinates processes by a central server while protecting privacy. This paper covers the overview, framework, architecture, application field, privacy protection method, type, application field, challenges, and future research directions of FL, with a particular focus on the healthcare field."
output = "The paper provides an overview of Federated Learning (FL) systems with a focus on healthcare. It discusses FL frameworks, architectures, and applications, evaluating its effectiveness in solving issues related to data privacy. The paper highlights recent developments in FL and outlines unresolved issues, inspired by the rapid growth of research in this area."

rouge = Rouge()
ref_score = rouge.get_scores(input, reference)
output_score = rouge.get_scores(input, output)

print(ref_score)
print(output_score)

[{'rouge-1': {'r': 0.5892857142857143, 'p': 0.2426470588235294, 'f': 0.34374999586805555}, 'rouge-2': {'r': 0.22580645161290322, 'p': 0.07142857142857142, 'f': 0.10852712813172298}, 'rouge-l': {'r': 0.5357142857142857, 'p': 0.22058823529411764, 'f': 0.3124999958680556}}]
[{'rouge-1': {'r': 0.782608695652174, 'p': 0.2647058823529412, 'f': 0.3956043918270741}, 'rouge-2': {'r': 0.39215686274509803, 'p': 0.10204081632653061, 'f': 0.16194331656116318}, 'rouge-l': {'r': 0.7391304347826086, 'p': 0.25, 'f': 0.37362636984905206}}]


어떤가요? 꽤 response가 괜찮죠? 👍

#### Save & Load Fine-tuned Model

최종 모델을 LoRA adaptor로 저장하기 위해 huggingface의 push_to_hub를 이용하거나 로컬 저장을 위해 save_pretrained를 사용하면 됩니다!

다만 이는 LoRA adaptor만 저장할 뿐 전체 모델은 저장하지 않아요!

In [14]:
model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

방금 inference를 위해 저장한 LoRA adapter를 사용하고 싶다면, `False`를 `True`로 세팅하세요!

In [15]:
if True:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Summarize what the paper is about", # instruction
        "How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study. Meta’s LLAMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLAMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of lowbit quantization for LLMs in resource-limited scenarios, we explore LLAMA3’s capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLAMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 existing posttraining quantization and LoRA-finetuning methods of LLAMA3 on 1-8 bits and diverse datasets to comprehensively reveal LLAMA3’s low-bit quantization performance. Our experiment results indicate that LLAMA3 still suffers nonnegligent degradation in these scenarios, especially in ultra-low bit-width. This highlights the significant performance gap under low bit-width that needs to be bridged in future developments. We expect that this empirical study will prove valuable in advancing future models, pushing the LLMs to lower bit-width with higher accuracy for being practical. ", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)



==((====))==  Unsloth: Fast Llama patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Unsloth: lora_model has no tokenizer.model file.
Just informing you about this - this is not a critical error.


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


["<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nSummarize what the paper is about\n\n### Input:\nHow Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study. Meta’s LLAMA family has become one of the most powerful open-source Large Language Model (LLM) series. Notably, LLAMA3 models have recently been released and achieve impressive performance across various with super-large scale pre-training on over 15T tokens of data. Given the wide application of lowbit quantization for LLMs in resource-limited scenarios, we explore LLAMA3’s capabilities when quantized to low bit-width. This exploration holds the potential to unveil new insights and challenges for low-bit quantization of LLAMA3 and other forthcoming LLMs, especially in addressing performance degradation problems that suffer in LLM compression. Specifically, we evaluate the 10 ex

#### 과제

1. 자신의 요약과 모델의 요약의 Rouge score를 비교해주세요
2. Rouge 점수와 실제 주관적인 요약 질에 차이가 있나요? (나의 요약이 더 좋은 것 같은데 Rouge 지표는 더 낮게 나왔다던지…?) 이런 차이가 발생하는 이유가 무엇인가요?

3. 실제 요약 성능을 보다 잘 측정하기 위해서 evaluation metric으로 무엇이 더 고려되어야하는지 자유롭게 작성해주세요.



1. 전반적으로 모델 요약문의 Rouge score가 제 요약보다 높게 나왔습니다. 특히 Recall을 비교하면, 모델의 요약문이 원문의 내용을 더 많이 포함하고 있습니다. 이를 통해 모델이 생성한 요약문이 제가 작성한 요약보다 원문의 내용을 더 잘 반영하고 있다는 결과를 나타냅니다.

2. Rouge는 n-gram을 기반으로 하는 평가 지표로, 원문의 의미나 문맥을 고려하지 않고 단순히 단어의 중복도를 기준으로 점수를 측정합니다. 따라서 단순히 원문과 비슷한 단어를 사용하면 더 높은 점수를 받을 수 있습니다. 그러나 사람이 판단하는 요약의 질은 내용의 정확성이나 가독성 등 다양한 관점을 기준으로 종합적으로 고려하기 때문에 이런 차이가 발생합니다.


3. BERTScore: BERT처럼 사전 학습된 LLM을 사용해 원문과 요약문의 유사도를 판단하여 평가할 수 있습니다. 이는 rouge와 다르게 문장 수준에서 유사도를 파악할 수 있어 더 잘 측정할 수 있습니다.