## 使用 unsloth 和 DPOTrainer 微调 Llama3 模型

In [1]:
%pip install datasets trl peft bitsandbytes wandb accelerate transformers xformers

[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
%pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

Collecting unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-vkdqcarg/unsloth_8973429625644cfbb9921b26b969a0d3
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-vkdqcarg/unsloth_8973429625644cfbb9921b26b969a0d3
  Resolved https://github.com/unslothai/unsloth.git to commit 38663b01f5dd0e610b12475bd95b144303cff539
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting unsloth-zoo (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Obtaining dependency information for unsloth-zoo from https://files.pythonhosted.org/packages/31/e9/1fee23655b1c0674a63b92ec960c04db12f01df27a1d45eac7de0b4f3651/unsloth_zoo-2024.10.1-py3-none-any.whl.metadata
  Downloading unsloth_zoo-2024.10.1-py3-none-any.whl.metadat

In [3]:
import torch
import os

# unsloth 不支持多GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "7"

for k, v in os.environ.items():
    if "cuda" in k.lower():
        print(k, v)

print()
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.__version__)

NV_CUDA_COMPAT_PACKAGE cuda-compat-11-8
NV_CUDA_NSIGHT_COMPUTE_VERSION 11.8.0-1
CUDA_VERSION 11.8.0
NVIDIA_REQUIRE_CUDA cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,drive

In [4]:
from unsloth import PatchDPOTrainer
PatchDPOTrainer()

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [5]:
from unsloth import FastLanguageModel
from datasets import load_dataset
import wandb
from trl import DPOConfig, DPOTrainer
from unsloth import is_bfloat16_supported
import gc
from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TextStreamer

max_prompt_length = 1024

max_seq_length = 8192  # Choose any! We auto support RoPE Scaling internally!
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.
base_model = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"
output_model = f"./output/{base_model}/dpo"
final_ckpt = f"{output_model}/final_ckpt"

wandb_name = "llama3-finetune"

In [6]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=base_model,  # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

==((====))==  Unsloth 2024.10.0: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla V100-SXM2-32GB. Max memory: 31.739 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

Unsloth 2024.10.0 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [20]:
train_dataset = "argilla/ultrafeedback-binarized-preferences-cleaned"

dataset = load_dataset(train_dataset)['train']
# dataset = dataset.select(range(100))
dataset = dataset.shuffle(seed=42)
dataset

Dataset({
    features: ['source', 'prompt', 'chosen', 'chosen-rating', 'chosen-model', 'rejected', 'rejected-rating', 'rejected-model'],
    num_rows: 100
})

In [21]:
dataset[0]['prompt']

'Classify the following sentence as the type of speech it is, considering the context of a physics problem. Additionally, provide an explanation of the mathematical concept behind the sentence, using LaTeX to represent the equation that represents the answer. Sentence: "The answer is the force required to move a mass of 5kg with an acceleration of 1m/s^2."'

In [22]:
dataset[0]['rejected']

[{'content': 'In what ways can the programming language Go be utilized to conduct a comprehensive analysis and comparison of the nutritional and ingredient composition of freshly-cooked food and packaged food? Can you provide a sample code that effectively leverages Go to generate user-friendly insights and visualizations regarding the findings of the analysis?',
  'role': 'user'},
 {'content': 'Thank you for your question! I\'m happy to help you explore the potential uses of the Go programming language for analyzing and comparing the nutritional and ingredient composition of freshly-cooked food and packaged food.\nTo start with, Go is an excellent choice for this task due to its strong typing system, concise syntax, and built-in support for processing and data analysis capabilities. With Go, you can quickly and efficiently collect data from various sources, perform complex calculations, and visualize the results in a user-friendly manner.\nHere are some potential ways you could use Go

In [23]:
dataset[0]['chosen']

[{'content': 'Classify the following sentence as the type of speech it is, considering the context of a physics problem. Additionally, provide an explanation of the mathematical concept behind the sentence, using LaTeX to represent the equation that represents the answer. Sentence: "The answer is the force required to move a mass of 5kg with an acceleration of 1m/s^2."',
  'role': 'user'},
 {'content': "The given sentence can be classified as an informative or declarative statement as it conveys information about a physics problem, specifically the calculation of force.\n\nThe concept behind the sentence refers to Newton's second law of motion, which states that the force acting on an object is equal to the product of its mass and acceleration. Mathematically, it can be represented using the equation:\n\n$$ F = m \\times a$$\n\nWhere,\n- $F$ represents the force\n- $m$ represents the mass of the object (in this case, 5 kg)\n- $a$ represents the acceleration (in this case, 1 m/s²)\n\nUs

In [24]:
chat_template = open('llama-3-instruct.jinja').read()
chat_template

"{% if messages[0]['role'] == 'system' %}\n    {% set offset = 1 %}\n{% else %}\n    {% set offset = 0 %}\n{% endif %}\n\n{{ bos_token }}\n{% for message in messages %}\n    {% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}\n        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}\n    {% endif %}\n\n    {{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' + message['content'] | trim + '<|eot_id|>' }}\n{% endfor %}\n\n{% if add_generation_prompt %}\n    {{ '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\\n\\n' }}\n{% endif %}"

In [25]:
chat_template = chat_template.replace('    ', '').replace('\n', '')
tokenizer.chat_template = chat_template
chat_template

"{% if messages[0]['role'] == 'system' %}{% set offset = 1 %}{% else %}{% set offset = 0 %}{% endif %}{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' + message['content'] | trim + '<|eot_id|>' }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\\n\\n' }}{% endif %}"

In [26]:
def get_assistant_content(data):
    for item in data:
        if item["role"] == "assistant":
            return item["content"]
    return ""


def get_question_content(data):
    for item in data:
        if item["role"] == "user":
            return item["content"]
    return ""

system_prompt = "You are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can. While performing the task think step-by-step and justify your steps."

def dataset_format(example):
    # Format system
    if "system" in example and len(example["system"]) > 0:
        message = {"role": "system", "content": example["system"]}
        system = tokenizer.apply_chat_template([message], tokenize=False)
    else:
        message = {"role": "system", "content": system_prompt}
        system = tokenizer.apply_chat_template([message], tokenize=False)
    # Format instruction
    message = {"role": "user", "content": get_question_content(example["chosen"])}
    prompt = tokenizer.apply_chat_template([message], tokenize=False, add_generation_prompt=True)
    # Format chosen answer
    chosen = get_assistant_content(example["chosen"]) + "<|eot_id|>\n"
    # Format rejected answer
    rejected = get_assistant_content(example["rejected"]) + "<|eot_id|>\n"
    return {
        "prompt": system + prompt,
        "chosen": chosen,
        "rejected": rejected,
    }

In [27]:
dataset_format(dataset[0])

{'prompt': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can. While performing the task think step-by-step and justify your steps.<|eot_id|><|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nClassify the following sentence as the type of speech it is, considering the context of a physics problem. Additionally, provide an explanation of the mathematical concept behind the sentence, using LaTeX to represent the equation that represents the answer. Sentence: "The answer is the force required to move a mass of 5kg with an acceleration of 1m/s^2."<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n',
 'chosen': "The given sentence can be classified as an informative or declarative statement as it conveys information about a physics problem, specifically the calculation of force.\n\nThe concept behind the sentence refers to Newton's second law of m

In [28]:
original_columns = dataset.column_names
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

dataset = dataset.map(
    dataset_format,
    remove_columns=original_columns,
    num_proc=os.cpu_count(),
)

Map (num_proc=96):   0%|          | 0/100 [00:00<?, ? examples/s]

In [29]:
wandb.login()
os.environ["WANDB_PROJECT"] = wandb_name

In [30]:
dpo_config = DPOConfig(
    output_dir=output_model,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    save_strategy="no",
    logging_steps=1,
    optim="paged_adamw_32bit",
    warmup_steps=10,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    report_to="wandb",
    beta=0.1,
    max_prompt_length=max_prompt_length,
    max_length=max_seq_length,
    force_use_ref_model=True,
    remove_unused_columns=False,
)

In [31]:
dpo_trainer = DPOTrainer(
    model,
    None,
    args=dpo_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

Tokenizing train dataset:   0%|          | 0/100 [00:00<?, ? examples/s]

Detected kernel version 4.9.70, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [32]:
dpo_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 12
 "-____-"     Number of trainable parameters = 24,313,856


Could not estimate the number of tokens of the input, floating-point operations will not be computed


Step,Training Loss,rewards / chosen,rewards / rejected,rewards / accuracies,rewards / margins,logps / rejected,logps / chosen,logits / rejected,logits / chosen
1,0.6931,0.0,0.0,0.0,0.0,-346.599487,-439.619568,-0.831523,0.214761
2,0.6931,0.0,0.0,0.0,0.0,-374.947754,-530.129639,-0.398564,-0.181835
3,0.6937,0.004613,0.005645,0.5,-0.001032,-330.874084,-433.769257,-0.486776,-0.211405
4,0.6949,-0.002227,0.001264,0.5,-0.003491,-304.160706,-361.233795,-0.690558,-0.302086
5,0.6929,-0.005834,-0.00633,0.375,0.000496,-393.764587,-449.716248,-0.705344,-0.171326
6,0.686,0.000987,-0.01336,0.875,0.014347,-322.016052,-604.72937,-0.942304,-0.138371
7,0.6923,-0.001494,-0.003322,0.5,0.001828,-359.222412,-402.259766,-0.454612,-0.269678
8,0.6899,-0.00891,-0.015593,0.75,0.006683,-332.230469,-456.787598,-0.832791,-0.22727
9,0.676,-0.016832,-0.052506,0.875,0.035675,-322.846863,-413.016296,-0.432009,-0.311457
10,0.6772,-0.003046,-0.035945,0.625,0.032899,-522.161987,-384.475769,-0.172401,-0.461164


TrainOutput(global_step=12, training_loss=0.6849456280469894, metrics={'train_runtime': 82.561, 'train_samples_per_second': 1.211, 'train_steps_per_second': 0.145, 'total_flos': 0.0, 'train_loss': 0.6849456280469894, 'epoch': 0.96})

In [33]:
model.save_pretrained_merged(final_ckpt, tokenizer, save_method="merged_16bit")
# model.save_pretrained_merged(final_ckpt, tokenizer, save_method = "merged_4bit")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 157.05 out of 251.58 RAM for saving.


100%|██████████| 28/28 [00:00<00:00, 77.58it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


In [34]:
# Flush memory
# del dpo_trainer, model
# gc.collect()
# torch.cuda.empty_cache()

In [35]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=final_ckpt,  # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

==((====))==  Unsloth 2024.10.0: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla V100-SXM2-32GB. Max memory: 31.739 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

./output/unsloth/Llama-3.2-3B-Instruct-bnb-4bit/dpo/final_ckpt does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), ep

In [37]:
# Format prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant chatbot that provides concise answers."},
    {"role": "user", "content": "What are GPUs and why would I use them for machine learning tasks?"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(
    "cuda"
)
inputs

tensor([[128000, 128006,   9125, 128007,    271,   2675,    527,    264,  11190,
          18328,   6369,   6465,    430,   5825,  64694,  11503,     13, 128009,
         128006,    882, 128007,    271,   3923,    527,  71503,    323,   3249,
           1053,    358,   1005,   1124,    369,   5780,   6975,   9256,     30,
         128009, 128006,  78191, 128007,    271]], device='cuda:0')

In [38]:
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=1024, use_cache=True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant chatbot that provides concise answers.<|eot_id|><|start_header_id|>user<|end_header_id|>

What are GPUs and why would I use them for machine learning tasks?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

**What are GPUs?**

Graphics Processing Units (GPUs) are specialized computer chips designed to perform massive parallel processing, making them extremely efficient for computationally intensive tasks.

**Why use GPUs for machine learning?**

GPUs are particularly well-suited for machine learning (ML) tasks because they:

1. **Speed up computations**: GPUs can perform many calculations simultaneously, reducing the time required for tasks like matrix multiplication, convolution, and neural network computations.
2. **Reduce memory bandwidth**: GPUs have high memory bandwidth, allowing for faster data transfer between the GPU and host system.
3. **Increase parallelism**: GPUs can handle many 