## 使用 unsloth 和 SFTTrainer 微调 Llama3 模型

In [None]:
%pip install datasets trl peft bitsandbytes wandb accelerate transformers xformers

In [None]:
%pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

In [4]:
import torch
import os

# unsloth 不支持多GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "6"

for k, v in os.environ.items():
    if "cuda" in k.lower():
        print(k, v)

print()
print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.__version__)

NV_CUDA_COMPAT_PACKAGE cuda-compat-11-8
NV_CUDA_NSIGHT_COMPUTE_VERSION 11.8.0-1
CUDA_VERSION 11.8.0
NVIDIA_REQUIRE_CUDA cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,drive

In [5]:
from unsloth import FastLanguageModel
from datasets import load_dataset
import wandb
from trl import SFTConfig, SFTTrainer
from unsloth import is_bfloat16_supported
import gc
from unsloth import FastLanguageModel
from transformers import AutoTokenizer, TextStreamer


max_seq_length = 8192  # Choose any! We auto support RoPE Scaling internally!
dtype = None  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True  # Use 4bit quantization to reduce memory usage. Can be False.
base_model = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit"
output_model = f"./output/{base_model}/sft"
final_ckpt = f"{output_model}/final_ckpt"

wandb_name = "llama3-finetune"

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


In [6]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=base_model,  # Choose ANY! eg mistralai/Mistral-7B-Instruct-v0.2
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

==((====))==  Unsloth 2024.10.0: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla V100-SXM2-32GB. Max memory: 31.739 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


In [7]:
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

Unsloth 2024.10.0 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


In [8]:
train_dataset = "quyanh/lima"

In [22]:
dataset = load_dataset(train_dataset)["train"]
dataset = dataset.select(range(100))
dataset = dataset.shuffle(seed=42)
dataset

Using the latest cached version of the dataset since quyanh/lima couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /root/.cache/huggingface/datasets/quyanh___lima/default/0.0.0/b5bcfd82d16a543b9bdea4383c4034ff5d618cf5 (last modified on Tue Oct 15 14:11:48 2024).


Dataset({
    features: ['system_prompt', 'prompt'],
    num_rows: 100
})

In [10]:
dataset[0]["system_prompt"]

'You are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can. While performing the task think step-by-step and justify your steps.'

In [11]:
dataset[0]["prompt"]

'Q: Summarize the text below in less than 15 words.\n\nCivil engineering is a professional engineering discipline that deals with the design, construction, and maintenance of the physical and naturally built environment, including public works such as roads, bridges, canals, dams, airports, sewage systems, pipelines, structural components of buildings, and railways. \nA: Civil Engineering deals with the design, construction, and maintenance of public infrastructure.\n'

In [12]:
chat_template = open("llama-3-instruct.jinja").read()
chat_template

"{% if messages[0]['role'] == 'system' %}\n    {% set offset = 1 %}\n{% else %}\n    {% set offset = 0 %}\n{% endif %}\n\n{{ bos_token }}\n{% for message in messages %}\n    {% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}\n        {{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}\n    {% endif %}\n\n    {{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' + message['content'] | trim + '<|eot_id|>' }}\n{% endfor %}\n\n{% if add_generation_prompt %}\n    {{ '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\\n\\n' }}\n{% endif %}"

In [13]:
chat_template = chat_template.replace("    ", "").replace("\n", "")
tokenizer.chat_template = chat_template
chat_template

"{% if messages[0]['role'] == 'system' %}{% set offset = 1 %}{% else %}{% set offset = 0 %}{% endif %}{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == offset) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{{ '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n' + message['content'] | trim + '<|eot_id|>' }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>' + 'assistant' + '<|end_header_id|>\\n\\n' }}{% endif %}"

In [23]:
import re


def formatting_prompts_func(example):
    # 使用正则表达式匹配Q:和A:开头的对话
    pattern = re.compile(r"(Q:.*?)(?=Q:|A:|$)|(A:.*?)(?=Q:|A:|$)", re.DOTALL)
    matches = pattern.findall(example["prompt"])

    message = []
    message.append({"role": "system", "content": example["system_prompt"]})

    for match in matches:
        user_part, assistant_part = match
        if user_part:
            message.append({"role": "user", "content": user_part[2:].strip()})
        if assistant_part:
            message.append({"role": "assistant", "content": assistant_part[2:].strip()})
    
    try:
        text = tokenizer.apply_chat_template(message, tokenize=False)
    except:
        print(message)
        raise RuntimeError
    return {"message": message, "text": text}


print(dataset[0])
print("\n\n")
result = formatting_prompts_func(dataset[0])
print(result)

{'system_prompt': 'You are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can. While performing the task think step-by-step and justify your steps.', 'prompt': 'Q: How to invite people to a party? \nA: Planning, hosting, and enjoying a good party is a great way to build and strengthen friendships and community!  An important, but sometimes undervalued element in the success of a party is the invitation.  The following answer will have you writing and sending out excellent invitations - and welcoming in happy guests - in no time!\n\n## General approaches to invitations\n\n1. Design your invitation to resemble the party theme. For example, a disco-themed party invitation could feature a large disco ball. People are likely to look at your invitation and make a quick first impression -- you want that first impression to be informative and fun. If your party doesn\'t have a theme, have the invitation mirror the formality of the party. 

In [24]:
original_columns = dataset.column_names
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

dataset = dataset.map(
    formatting_prompts_func,
    remove_columns=original_columns,
    num_proc=os.cpu_count(),
)

Map (num_proc=96):   0%|          | 0/100 [00:00<?, ? examples/s]

In [25]:
wandb.login()
os.environ["WANDB_PROJECT"] = wandb_name

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


[34m[1mwandb[0m: Currently logged in as: [33mmurphypei[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [26]:
sft_config = SFTConfig(
    output_dir=output_model,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    learning_rate=5e-5,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    save_strategy="no",
    logging_steps=1,
    optim="paged_adamw_32bit",
    warmup_steps=10,
    fp16=not is_bfloat16_supported(),
    bf16=is_bfloat16_supported(),
    report_to="wandb",
    max_seq_length=max_seq_length,
    # remove_unused_columns=False,
    dataset_num_proc=os.cpu_count(),
)

In [29]:
sft_trainer = SFTTrainer(
    model,
    args=sft_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="text",
)

Detected kernel version 4.9.70, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [30]:
sft_trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 100 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 12
 "-____-"     Number of trainable parameters = 24,313,856


Step,Training Loss
1,4.3242
2,4.2873
3,3.6683
4,3.9626
5,4.2304
6,3.7594
7,4.0333
8,4.0152
9,3.8235
10,4.1727


TrainOutput(global_step=12, training_loss=4.015695333480835, metrics={'train_runtime': 42.6067, 'train_samples_per_second': 2.347, 'train_steps_per_second': 0.282, 'total_flos': 1565832443289600.0, 'train_loss': 4.015695333480835, 'epoch': 0.96})

In [31]:
model.save_pretrained_merged(final_ckpt, tokenizer, save_method="merged_16bit")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 159.6 out of 251.58 RAM for saving.


100%|██████████| 28/28 [00:00<00:00, 77.29it/s]


Unsloth: Saving tokenizer... Done.
Unsloth: Saving model... This might take 5 minutes for Llama-7b...
Done.


In [None]:
# Flush memory
# del dpo_trainer, model
# gc.collect()
# torch.cuda.empty_cache()

In [32]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=final_ckpt,  # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

==((====))==  Unsloth 2024.10.0: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla V100-SXM2-32GB. Max memory: 31.739 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.0. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

./output/unsloth/Llama-3.2-3B-Instruct-bnb-4bit/sft/final_ckpt does not have a padding token! Will use pad_token = <|finetune_right_pad_id|>.


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaExtendedRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((3072,), ep

In [34]:
# Format prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant chatbot that provides concise answers."},
    {"role": "user", "content": "What are GPUs and why would I use them for machine learning tasks?"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(
    "cuda"
)
inputs

tensor([[128000, 128006,   9125, 128007,    271,   2675,    527,    264,  11190,
          18328,   6369,   6465,    430,   5825,  64694,  11503,     13, 128009,
         128006,    882, 128007,    271,   3923,    527,  71503,    323,   3249,
           1053,    358,   1005,   1124,    369,   5780,   6975,   9256,     30,
         128009, 128006,  78191, 128007,    271]], device='cuda:0')

In [35]:
text_streamer = TextStreamer(tokenizer)
_ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=1024, use_cache=True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant chatbot that provides concise answers.<|eot_id|><|start_header_id|>user<|end_header_id|>

What are GPUs and why would I use them for machine learning tasks?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

**What are GPUs?**

Graphics Processing Units (GPUs) are specialized computer chips designed to perform massive parallel processing, making them extremely efficient for computationally intensive tasks.

**Why use GPUs for machine learning?**

GPUs are particularly well-suited for machine learning (ML) tasks because they:

1. **Speed up computations**: GPUs can perform many calculations simultaneously, reducing the time required for tasks like matrix multiplication, convolution, and neural network computations.
2. **Reduce memory bandwidth**: GPUs have high memory bandwidth, allowing for faster data transfer between the GPU and host system.
3. **Increase parallelism**: GPUs can handle many 