要运行此笔记本，请在**免费** Tesla T4 Google Colab 实例上按 "*Runtime*" 然后按 "*Run all*"！
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> 如需帮助请加入 Discord + ⭐ <i>在 <a href="https://github.com/unslothai/unsloth">Github</a> 给我们点 Star </i> ⭐
</div>

要在您自己的计算机上安装 Unsloth，请按照我们 Github 页面上的安装说明进行操作[这里](https://docs.unsloth.ai/get-started/installing-+-updating)。

您将学习如何进行[数据准备](#Data)、如何[训练](#Train)、如何[运行模型](#Inference) 以及 [如何保存模型](#Save)

### 新闻

**最新** Unsloth 现在支持训练来自 OpenAI 的新 **gpt-oss** 模型！您可以使用我们的 **[Colab notebook](https://x.com/UnslothAI/status/1953896997867729075)** 免费开始微调 gpt-oss！

Unsloth 现在支持文本转语音 (TTS) 模型。阅读我们的[指南](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning)。

阅读我们的 **[Gemma 3N 指南](https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-tune)** 并查看我们新的 **[Dynamic 2.0](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs)** 量化方法，它优于其他量化方法！

访问我们的文档了解所有[模型上传](https://docs.unsloth.ai/get-started/all-our-models)和[notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks)。

### 安装

In [None]:
%%capture
# We're installing the latest Torch, Triton, OpenAI's Triton kernels, Transformers and Unsloth!
!pip install --upgrade -qqq uv
try: import numpy; install_numpy = f"numpy=={numpy.__version__}"
except: install_numpy = "numpy"
!uv pip install -qqq \
    "torch>=2.8.0" "triton>=3.4.0" {install_numpy} \
    "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
    "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
    torchvision bitsandbytes \
    git+https://github.com/huggingface/transformers \
    git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels


### Unsloth

我们即将通过一个推理示例来展示新的 OpenAI GPT-OSS 20B 模型的强大功能。对于我们的 `MXFP4` 版本，请使用这个 [notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/GPT_OSS_MXFP4_(20B)-Inference.ipynb)。

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 4096
dtype = None

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
# 我们支持的4bit预量化模型，可实现4倍快速下载且无内存溢出问题
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization / 使用bitsandbytes 4bit量化的20B模型
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format / 使用MXFP4格式的20B模型
    "unsloth/gpt-oss-120b", 
] # More models at https://huggingface.co/unsloth

# Load the model and tokenizer / 加载模型和分词器
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b-unsloth-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = True,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Llama patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/131 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/449 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


我们现在添加 LoRA 适配器进行参数高效微调 - 这使我们只需高效训练所有参数的 1%。

In [None]:
# Add LoRA adapters for parameter efficient finetuning / 添加LoRA适配器进行参数高效微调
model = FastLanguageModel.get_peft_model(
    model,
    r = 8, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128 / 选择任何大于0的数字！建议8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized / 支持任何值，但=0是优化的
    bias = "none",    # Supports any, but = "none" is optimized / 支持任何值，但="none"是优化的
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    # [新功能] "unsloth"使用少30%的显存，可容纳2倍大的批次大小！
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context / 对于很长的上下文使用True或"unsloth"
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA / 我们支持秩稳定LoRA
    loftq_config = None, # And LoftQ / 以及LoftQ
)

### 推理强度
来自 OpenAI 的 `gpt-oss` 模型包含一个功能，允许用户调整模型的"推理强度"。这让您可以控制模型性能与响应速度（延迟）之间的权衡，这取决于模型用来思考的 token 数量。

----

`gpt-oss` 模型提供三个不同级别的推理强度供您选择：

* **Low（低）**: 针对需要非常快速响应且不需要复杂多步推理的任务进行优化。
* **Medium（中）**: 在性能和速度之间的平衡。
* **High（高）**: 为需要最强推理性能的任务提供支持，尽管这会导致更高的延迟。

In [None]:
from transformers import TextStreamer

# Test with low reasoning effort / 使用低推理强度进行测试
messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low", # **NEW!** Set reasoning effort to low, medium or high / **新功能!** 设置推理强度为low、medium或high
).to(model.device)

# Generate response with streaming output / 生成带流式输出的响应
_ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))

将 `reasoning_effort` 更改为 `medium` 会让模型思考更长时间。我们必须增加 `max_new_tokens` 来容纳生成的 token 数量，但这会提供更好、更正确的答案

In [None]:
from transformers import TextStreamer

# Test with medium reasoning effort / 使用中等推理强度进行测试
messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium", # **NEW!** Set reasoning effort to low, medium or high / **新功能!** 设置推理强度为low、medium或high
).to(model.device)

# Generate response with more tokens for deeper reasoning / 生成更多token以进行更深入的推理
_ = model.generate(**inputs, max_new_tokens = 1024, streamer = TextStreamer(tokenizer))

最后我们将使用 `reasoning_effort` 设置为 `high` 来测试

In [None]:
from transformers import TextStreamer

# Test with high reasoning effort / 使用高推理强度进行测试
messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "high", # **NEW!** Set reasoning effort to low, medium or high / **新功能!** 设置推理强度为low、medium或high
).to(model.device)

# Generate response with maximum tokens for deepest reasoning / 生成最多token以进行最深入的推理
_ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))

<a name="Data"></a>
### 数据准备

我们将使用 `HuggingFaceH4/Multilingual-Thinking` 数据集作为示例。这个在 Hugging Face 上可用的数据集包含从用户问题派生的推理思维链示例，这些问题已从英语翻译成其他四种语言。它也是 OpenAI 的微调 [cookbook](https://cookbook.openai.com/articles/gpt-oss/fine-tune-transfomers) 中引用的相同数据集。使用此数据集的目的是使模型能够学习并在这四种不同语言中发展推理能力。

In [None]:
# Define function to format prompts for training / 定义格式化训练提示词的函数
def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

from datasets import load_dataset

# Load the multilingual thinking dataset / 加载多语言思维数据集
dataset = load_dataset("HuggingFaceH4/Multilingual-Thinking", split="train")
dataset

为了格式化我们的数据集，我们将应用我们版本的 GPT OSS 提示词

In [None]:
from unsloth.chat_templates import standardize_sharegpt
# Standardize dataset format and apply prompt formatting / 标准化数据集格式并应用提示词格式化
dataset = standardize_sharegpt(dataset)
dataset = dataset.map(formatting_prompts_func, batched = True,)

让我们查看数据集，并检查第一个示例显示了什么

In [None]:
print(dataset[0]['text'])

GPT-OSS 的独特之处在于它使用 OpenAI [Harmony](https://github.com/openai/harmony) 格式，该格式支持对话结构、推理输出和工具调用。

<a name="Train"></a>
### 训练模型
现在让我们使用 Huggingface TRL 的 `SFTTrainer`！更多文档请参考: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer)。我们进行 60 步来加快速度，但您可以设置 `num_train_epochs=1` 进行完整运行，并关闭 `max_steps=None`。

In [None]:
from trl import SFTConfig, SFTTrainer
from transformers import DataCollatorForSeq2Seq

# Set up the trainer for supervised fine-tuning / 设置监督微调训练器
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        per_device_train_batch_size = 4,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run. / 设置为1进行完整训练运行
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc / 用于WandB等
    ),
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
)

In [None]:
# @title Show current memory stats / 显示当前内存状态
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
# Start training / 开始训练
trainer_stats = trainer.train()

In [None]:
# @title Show final memory and time stats / 显示最终内存和时间统计
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### 推理
让我们运行模型！您可以更改指令和输入 - 请将输出留空！

In [None]:
# Test the fine-tuned model with a different language / 使用不同语言测试微调后的模型
messages = [
    {"role": "system", "content": "reasoning language: French\n\nYou are a helpful assistant that can solve mathematical problems."},
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "medium",
).to(model.device)
from transformers import TextStreamer
# Generate response to test multilingual capabilities / 生成响应以测试多语言能力
_ = model.generate(**inputs, max_new_tokens = 2048, streamer = TextStreamer(tokenizer))

我们完成了！如果您对 Unsloth 有任何问题，我们有一个 [Discord](https://discord.gg/unsloth) 频道！如果您发现任何错误或想要了解最新的 LLM 信息，或需要帮助、加入项目等，请随时加入我们的 Discord！

其他一些链接：
1. 训练您自己的推理模型 - Llama GRPO notebook [免费 Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. 将微调保存到 Ollama。[免费 notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision 微调 - 放射学用例。[免费 Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. 查看我们[文档](https://docs.unsloth.ai/get-started/unsloth-notebooks)中的 DPO、ORPO、持续预训练、对话微调等更多 notebooks！

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  如需帮助请加入 Discord + ⭐️ <i>在 <a href="https://github.com/unslothai/unsloth">Github</a> 给我们点 Star </i> ⭐️
</div>