<a href="https://colab.research.google.com/github/nirvana66649/felixRepo/blob/main/fine_tuning_llama3_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 使用Unsloth对LLAMA3.2**进行微调**

什么是unsloth：

 Unsloth 是一个基于 PyTorch 和 HuggingFace 的轻量级库，主打 极致高效的 LoRA 微调框架，号称比其他方法快 2-5 倍，并能在 Colab T4 这样的低端 GPU 上运行。

 专门用于在 低资源环境下微调大语言模型（LLMs），如 LLaMA、Mistral、Gemma 等。它的主要特点是：极致快、极致省内存、极易用。

In [2]:

!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git@nightly git+https://github.com/unslothai/unsloth-zoo.git

Collecting unsloth
  Using cached unsloth-2025.5.5-py3-none-any.whl.metadata (46 kB)
Collecting unsloth_zoo>=2025.5.7 (from unsloth)
  Using cached unsloth_zoo-2025.5.7-py3-none-any.whl.metadata (8.0 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Using cached xformers-0.0.30-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Using cached bitsandbytes-0.45.5-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting tyro (from unsloth)
  Using cached tyro-0.9.20-py3-none-any.whl.metadata (10 kB)
Collecting datasets>=3.4.1 (from unsloth)
  Using cached datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting trl!=0.15.0,!=0.9.0,!=0.9.1,!=0.9.2,!=0.9.3,<=0.15.2,>=0.7.9 (from unsloth)
  Using cached trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting protobuf<4.0.0 (from unsloth)
  Using cached protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023

Found existing installation: unsloth 2025.5.5
Uninstalling unsloth-2025.5.5:
  Successfully uninstalled unsloth-2025.5.5
Collecting git+https://github.com/unslothai/unsloth.git@nightly
  Cloning https://github.com/unslothai/unsloth.git (to revision nightly) to /tmp/pip-req-build-mc5_vjtk
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-req-build-mc5_vjtk
  Running command git checkout -b nightly --track origin/nightly
  Switched to a new branch 'nightly'
  Branch 'nightly' set up to track remote branch 'nightly' from 'origin'.
  Resolved https://github.com/unslothai/unsloth.git to commit 6a894cf92bcc55731a22f90372dd8d00245d7770
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting git+https://github.com/unslothai/unsloth-zoo.git
  Cloning https://github.com/unslothai/unsloth-zoo.git to /tmp/pip-req-build-

使用 Unsloth 框架 快速加载一个预训练的 LLaMA 3.2 模型（或其他模型）并启用 4bit 量化优化以节省显存和加快推理/训练速度

In [3]:
from unsloth import FastLanguageModel
import torch


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


这三个参数是你在使用 Unsloth 加载大语言模型（如 LLaMA 3.2）时非常重要的配置项，它们控制模型的输入长度、数据精度和内存优化方式

In [4]:
max_seq_length = 2048 # 支持输入的最大token数
dtype = None # 使用低精度可以大幅降低显存占用、加速训练和推理，对结果影响微小
load_in_4bit = True  #启用 4-bit 量化加载模型权重，大幅度减少显存占用

4-bit 模型是一种 极低精度压缩形式的模型，其参数占用仅为普通 FP16 模型的 1/4，优点如下：

✨ 显著降低显存占用（甚至在一张 T4 显卡上能跑 13B 模型）

⚡ 加速加载速度和推理速度

🎯 微调时可以结合 LoRA 实现超轻量级训练

In [6]:

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 2x faster
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # 4bit for 405b!
    "unsloth/Mistral-Small-Instruct-2409",     # Mistral 22b 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!

    "unsloth/Llama-3.2-1B-bnb-4bit",           # NEW! Llama 3.2 models
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",

    "unsloth/Llama-3.3-70B-Instruct-bnb-4bit" # NEW! Llama 3.3 70B!
] # More models at https://huggingface.co/unsloth


1. load_in_4bit=True 控制是否使用 4bit 加载
只要这个开关是 True，Unsloth 会尝试从 Hugging Face 自动找对应的 4-bit quantized 权重版本。

在 "unsloth/Llama-3.2-3B-Instruct" 这个 repo 里，确实存在 bnb-4bit 格式的权重（由 bitsandbytes 支持）。



In [7]:
# 加载模型
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.5.5: Fast Llama patching. Transformers: 4.51.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.3.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.30. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.35G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

先对未进行微调的模型进行预训练的检测

In [8]:
from transformers import TextStreamer # 一个“输出流”，可以边生成边输出文本，像 ChatGPT 一样流式打印

In [9]:

# 输入文本
prompt = "I love China because"

# 编码为输入张量，移动到模型所在设备（CPU/GPU）
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# 实时输出设置
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

# 模型生成，内部会逐步预测下一个 token
model.generate(
    **inputs,
    max_new_tokens=100,       # 生成最多100个新token
    do_sample=True,           # 使用采样策略，避免死板
    top_p=0.9,                # nucleus采样，多样性控制
    temperature=0.8,          # 温度调节，越高越随机
    streamer=streamer         # ✅ 实时输出流
)


 of its rich history, diverse culture, and breathtaking natural beauty. From the majestic Great Wall to the stunning Great Barrier Reef, China has something to offer for every type of traveler. Here are some of the most amazing destinations in China:
**Natural Wonders**
1. **The Great Wall of China**: A UNESCO World Heritage Site, the Great Wall is an iconic symbol of China and one of the Seven Wonders of the Medieval World.
2. **Mount Everest**: The highest mountain in the world


tensor([[128000,     40,   3021,   5734,   1606,    315,   1202,   9257,   3925,
             11,  17226,   7829,     11,    323,  57192,   5933,  13444,     13,
           5659,    279,  81389,   8681,   9935,    311,    279,  20441,   8681,
          72087,  77036,     11,   5734,    706,   2555,    311,   3085,    369,
           1475,    955,    315,  63865,     13,   5810,    527,   1063,    315,
            279,   1455,   8056,  34205,    304,   5734,    512,    334,  55381,
            468,  28413,   1035,     16,     13,   3146,    791,   8681,   9935,
            315,   5734,  96618,    362,  81876,   4435,  34243,  13207,     11,
            279,   8681,   9935,    374,    459,  27373,   7891,    315,   5734,
            323,    832,    315,    279,  31048,    468,  28413,    315,    279,
          78248,   4435,    627,     17,     13,   3146,  16683,  87578,  96618,
            578,   8592,  16700,    304,    279,   1917]], device='cuda:0')

下面进行LORA微调

使用 Unsloth 对预加载的 LLaMA 模型进行 LoRA 微调配置 的核心操作

这里需要先了解LLAMA3.2的结构

N 层 Transformer Block：



```
self_attn = MultiHeadAttention(
    q_proj,   # 查询向量变换（Query）
    k_proj,   # 关键向量变换（Key）
    v_proj,   # 值向量变换（Value）
    o_proj    # 输出变换（Output）
)

```

LLaMA 3.2 通常采用 多查询注意力（Multi-Query Attention, MQA）

也就是说 k_proj 和 v_proj 是共享的，节省计算

比 GPT-style 的 MHA 更高效

✅ LoRA 插入点：q_proj, k_proj, v_proj, o_proj






```
hidden = gate_proj(x) * activation_fn(x)
hidden = up_proj(hidden)
output = down_proj(hidden)

```

这是典型的 Gated FFN 结构，也称 SwiGLU结构：

gate_proj: 门控层

up_proj: 升维

down_proj: 降维

✅ LoRA 插入点：gate_proj, up_proj, down_proj

In [10]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # LoRA 的秩（rank），控制可学习参数的数量
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",], # lora插入的位置
    lora_alpha = 16,
    lora_dropout = 0, # unsloth对其做出了优化
    bias = "none",
    use_gradient_checkpointing = "unsloth", # 用于节省显存的技术：梯度检查点
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2025.5.5 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


准备数据集：

用 Maxime Labonne 的 FineTome-100k 数据集（这是一个对话数据集，原本是 ShareGPT 风格的）。

用一套特殊的标记符号（如 <|begin_of_text|>、<|start_header_id|>、<|eot_id|>）来表示对话的用户和助手轮次

llama3.2兼容的文本格式：



```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi! How can I help?<|eot_id|><|start_header_id|>user<|end_header_id|>
What's the weather?<|eot_id|>

```



In [12]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
# 它的作用是根据你指定的 chat_template 名称（这里是 "llama-3.1"），返回一个能把对话格式化成 Llama-3.1 格式的“tokenizer”对象或模板工具。

In [14]:
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}
pass
from datasets import load_dataset
dataset = load_dataset("mlabonne/FineTome-100k", split="train")


README.md:   0%|          | 0.00/982 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/117M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/100000 [00:00<?, ? examples/s]

```
{"from": "system", "value": "You are an assistant"}
{"from": "human", "value": "What is 2+2?"}
{"from": "gpt", "value": "It's 4."}
```
to
```
{"role": "system", "content": "You are an assistant"}
{"role": "user", "content": "What is 2+2?"}
{"role": "assistant", "content": "It's 4."}
```

把整个训练集转成 Llama-3.1 兼容格式的文本数据了。

In [15]:
from unsloth.chat_templates import standardize_sharegpt
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

Unsloth: Standardizing formats (num_proc=8):   0%|          | 0/100000 [00:00<?, ? examples/s]

Map:   0%|          | 0/100000 [00:00<?, ? examples/s]

In [16]:
dataset[1]["conversations"]

[{'content': 'Explain how recursion works and provide a recursive function in Python that calculates the factorial of a given number.',
  'role': 'user'},
 {'content': "Recursion is a programming technique where a function calls itself to solve a problem. It breaks down a complex problem into smaller, more manageable subproblems until a base case is reached. The base case is a condition where the function does not call itself, but instead returns a specific value or performs a specific action.\n\nIn the case of calculating the factorial of a number, recursion can be used to break down the problem into simpler subproblems. The factorial of a non-negative integer n is the product of all positive integers less than or equal to n.\n\nHere is a recursive function in Python that calculates the factorial of a given number:\n\n```python\ndef factorial(n):\n    # Base case: factorial of 0 or 1 is 1\n    if n == 0 or n == 1:\n        return 1\n    # Recursive case: factorial of n is n multiplied

In [17]:
dataset[1]["text"]

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nExplain how recursion works and provide a recursive function in Python that calculates the factorial of a given number.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nRecursion is a programming technique where a function calls itself to solve a problem. It breaks down a complex problem into smaller, more manageable subproblems until a base case is reached. The base case is a condition where the function does not call itself, but instead returns a specific value or performs a specific action.\n\nIn the case of calculating the factorial of a number, recursion can be used to break down the problem into simpler subproblems. The factorial of a non-negative integer n is the product of all positive integers less than or equal to n.\n\nHere is a recursive function in Python that calculates the factori

上述操作：


将
```
{
  "conversations": [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi! How can I help you?"},
    {"role": "user", "content": "What's the weather today?"}
  ]
}

```
转为



```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>
Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Hi! How can I help you?<|eot_id|><|start_header_id|>user<|end_header_id|>
What's the weather today?<|eot_id|>

```




# *开始训练*

用 Huggingface TRL 库里的 SFTTrainer（Supervised Fine-Tuning Trainer，监督微调训练器）来训练模型

In [18]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported


In [25]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=512,   # 建议先用512，避免过长序列导致内存和编译压力
    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
    dataset_num_proc=1,   # Colab 单卡环境，开太多进程有时反而慢
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=1,     # 减小batch size，避免OOM或编译失败
        gradient_accumulation_steps=8,    # 保持等效batch size 8，节省显存
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        fp16=False,   # 关闭fp16混合精度，避免triton编译问题
        bf16=False,   # Colab基本不支持bf16
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)


Unsloth: Tokenizing ["text"]:   0%|          | 0/100000 [00:00<?, ? examples/s]

对 训练器 (trainer) 做一个调整，让它只对对话中的“助手回复部分”进行监督训练（微调），而忽略用户的输入部分

In [20]:
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=8):   0%|          | 0/100000 [00:00<?, ? examples/s]

用空格 token 替代所有 -100，这样解码时不会报错，也能直观看到标签文本的内容布局

In [21]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

'                                                                  Astronomers make use of the unique spectral fingerprints of elements found in stars. These elements emit and absorb light at specific, known wavelengths, forming an absorption spectrum. By analyzing the light received from distant stars and comparing it to the laboratory-measured spectra of these elements, astronomers can identify the shifts in these wavelengths due to the Doppler effect. The observed shift tells them the extent to which the light has been redshifted or blueshifted, thereby allowing them to calculate the speed of the star along the line of sight relative to Earth.<|eot_id|>'

In [23]:
import os
os.environ["TRITON_DISABLE"] = "1"


In [26]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0


In [27]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100,000 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 8
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 8 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/3,000,000,000 (0.81% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.586
2,1.4007
3,1.9479
4,1.47
5,1.5574
6,1.2894
7,1.3761
8,1.4651
9,1.2972
10,0.9526


最后我们对微调好的模型进行验证

In [30]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # 加速推理
messages = [
    {"role": "user", "content": "can you tell me something about the company Apple?"},
] # 注意：对话格式要正确
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # 表示末尾添加模型生成时的提示符号
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ncan you tell me something about the company Apple?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nApple is an American multinational technology company that was founded on April 1, 1976. It's one of the world's leading companies and has made significant impacts on the technology sector.<|eot_id|>"]