在这个笔记本中，将演示如何使用 Unsloth 对 DeepSeek-R1-Distill-Llama-8B 进行微调，使用的医疗数据集。

### 为什么我们需要对大语言模型（LLM）进行微调？
微调可以让模型在特定任务上表现得更好，从而在实际应用中更加高效和多样化。这个过程对于将已有模型定制为适应特定任务或领域是非常关键的。

In [None]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
!pip install bitsandbytes unsloth_zoo


## Choose a Base Model

1. Choose a model that aligns with your usecase
2. Assess your storage, compute capacity and dataset
3. Select a Model and Parameters
4. Choose Between Base and Instruct Models

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "/root/autodl-tmp/models/DeepSeek", #https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.50.0. vLLM: 0.8.1.
   \\   /|    NVIDIA GeForce RTX 4090 D. Num GPUs = 1. Max memory: 23.643 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [51]:
model


PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
        (layers): ModuleList(
          (0-31): 32 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lor

## 在微调前推理测试一下

In [69]:
prompt_style = """以下是一条描述任务的指令，以及提供更多上下文的信息。
请撰写一个恰当的回答来完成该请求。
在作答前，请仔细思考问题，并构建清晰的逐步推理链，以确保逻辑严谨、回答准确。

### Instruction:
你是一名具有丰富经验的宝可梦世界对话问答专家
请回答以下问题。

### Question:
{}

### Response:
<think>{}"""

In [53]:
question = "一个患有急性阑尾炎的病人已经发病5天，腹痛稍有减轻但仍然发热，在体检时发现右下腹有压痛的包块，此时应如何处理？"


FastLanguageModel.for_inference(model)
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=8192,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>

</think>
首先需要明确急性阑尾炎的常见表现与病理特征：1. 皮肤发热（37.5-39℃），伴随腹痛（通常在左下腹，但病情进展可能向右下移位），伴随发热、恶心、呕吐等症状。2. 体检发现右下腹有压痛的包块，提示存在明显的膈膜包块或盆腔包块等实体病变。3. 病史为5天急性发作，若未处理，可能导致感染性并发症（如脓肿、多器官功能衰竭等）。4. 病情稳定期（腹痛缓解但仍有发热）时，应避免过度刺激或误诊为急性阑尾炎（需区分阑尾炎的急性发作与包块性阑尾炎）。5. 对于包块性阑尾炎的处理原则：（a）急性发作期需控制感染（如抗生素治疗），同时保护膈膜包块；（b）若有明显膈膜包块，需在稳定期行包块引流术（需明确包块的病理性质，如是否为脓肿）；（c）若病情稳定且无包块，需考虑盆腔包块的可能性（需进一步影像学检查）。6. 对于包块性阑尾炎的抗生素选择：需根据病原菌敏感性结果选择敏感的药物（如第三代青霉素、莫匹罗星、头孢告比）；若有疑虑包块性病变，需进行盆腔镜检查以明确病情。7. 若有明显膈膜包块且病情稳定，应在包块性阑尾炎稳定期进行包块引流术，同时注意：（a）包块引流术需在包块性阑尾炎稳定期进行，且需有经验的肝脏、心脏、肾脏等系统的支持；（b）包块引流术后需密切观察病情变化，防止二次感染或并发症；（c）若有疑虑包块性病变的严重程度，需结合病史、体征、影像学检查结果进行综合判断。8. 对于包块性阑尾炎的影像学检查：（a）急性发作期需进行超声检查（B超）以明确包块的存在；（b）若有膈膜包块，需进行盆腔镜检查以明确包块的病理性质（如是否为脓肿）；（c）若病情稳定且有包块，需进一步进行CT或MRI检查以评估包块的大小、位置和周围组织关系。9. 对于包块性阑尾炎的治疗步骤：（a）急性发作期需及时处理包块（如包块引流术）；（b）若病情稳定且有包块，需在稳定期进行包块引流术；（c）若无包块，需考虑盆腔包块的可能性（需进一步影像学检查）；（d）若病情稳定且无包块，需考虑是否有脓肿或感染性病变（需进行盆腔镜检查）。<｜end▁of▁sentence｜>


In [54]:
#训练

## Prepare Dataset

A medical dataset [https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/](https://huggingface.co/datasets/FreedomIntelligence/medical-o1-reasoning-SFT/) will be used to train the selected model.

In [55]:
train_prompt_style = """以下是一条描述任务的指令，以及提供更多上下文的信息。
请撰写一个恰当的回答来完成该请求。
在作答前，请仔细思考问题，并构建清晰的逐步推理链，以确保逻辑严谨、回答准确。

### Instruction:
你是一名具有丰富经验的宝可梦世界对话问答专家
请回答以下问题。

### Question:
{}

### Response:
<think>
{}
</think>
{}"""

### Important Notice
务必在每条训练数据的末尾添加 EOS（序列结束）标记，否则可能会导致模型生成内容时出现无限循环的问题。

In [56]:
import re

EOS_TOKEN = tokenizer.eos_token  # 一般是 "<|endoftext|>" 或 "<eos>"

def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]

    texts = []

    for inst, inp, out in zip(instructions, inputs, outputs):
        # 从 output 中提取 <think> 思考链 和 回答正文
        match = re.match(r"<think>\s*(.*?)\s*</think>\s*(.*)", out, re.DOTALL)
        if match:
            cot, final_answer = match.groups()
        else:
            cot, final_answer = "", out  # fallback

        text = train_prompt_style.format(inst, inp, cot.strip(), final_answer.strip()) + EOS_TOKEN
        texts.append(text)

    return {"text": texts}


In [57]:
from datasets import load_dataset

dataset = load_dataset("json", data_files="output.json")

print(dataset.column_names)


{'train': ['instruction', 'input', 'output', 'history']}


In [58]:
dataset["train"][0]

{'instruction': '一起来看看玩pokemon go',
 'input': '',
 'output': '<think>\\n1. 首先需要明确用户的核心诉求：了解Pokemon Go的基本玩法与入坑指南\\n2. 分析原回答信息点：\\n   - 游戏类型：基于LBS的AR手游，支持安卓/iOS\\n   - 安装方式：安卓直接安装，iOS需外区账号\\n   - 特色系统：精灵永久保值、可交易系统、Mega进化等\\n   - 联动功能：支持与Pokemon Home数据互通\\n   - 国内特殊需求：需要解决网络和定位问题\\n3. 需要补充结构化信息：\\n   - 区分安卓/iOS安装指引\\n   - 说明国内玩家注意事项\\n   - 规范术语表达（如「海鲜市场」应说明是二手平台）\\n4. 优化逻辑顺序：从基础认知到具体操作，最后补充特色玩法\\n5. 移除口语化表达（如「起飞！！」「好玩咩？」），保持专业但易懂\\n</think>\\n《Pokemon Go》是一款基于地理位置服务的AR手游，以下是详细指南：\\n\\n【基础信息】\\n▶ 平台支持：安卓可直接安装APK，iOS需切换外区App Store账号下载\\n▶ 国内须知：需VPN支持定位权限，建议关闭「精确位置」保护隐私\\n\\n【核心玩法】\\n✓ 现实地图捕捉：通过GPS定位在真实场景捕捉宝可梦\\n✓ 精灵永久保值：所有精灵可传送至《Pokemon Home》与正作联动\\n✓ 交易系统：支持玩家间精灵交易（注意：官方仅限本地交易，第三方平台存在风险）\\n✓ 特殊进化：包含Mega进化和原始回归等独特机制\\n\\n【入坑建议】\\n① 先体验基础捕捉和道馆对战玩法\\n② 加入本地玩家社群获取补给点地图\\n③ 使用防水手机支架保障户外游玩安全\\n④ 首次登录建议关闭AR模式节省电量',
 'history': []}

为了使 Ollama 和 llama.cpp 像一个自定义的 ChatGPT 聊天机器人那样工作，我们的数据集必须只包含两列：一列是 instruction，一列是 output。因此，我们需要将数据集转换为正确的结构。

In [59]:
dataset = dataset.map(formatting_prompts_func, batched = True)

In [60]:
# 先取出 train
train_dataset = dataset["train"]

# 查看第一个样本的 text 字段
print(train_dataset[0]["text"])


以下是一条描述任务的指令，以及提供更多上下文的信息。
请撰写一个恰当的回答来完成该请求。
在作答前，请仔细思考问题，并构建清晰的逐步推理链，以确保逻辑严谨、回答准确。

### Instruction:
你是一名具有丰富经验的宝可梦世界对话问答专家
请回答以下问题。

### Question:
一起来看看玩pokemon go

### Response:
<think>

</think>
\n1. 首先需要明确用户的核心诉求：了解Pokemon Go的基本玩法与入坑指南\n2. 分析原回答信息点：\n   - 游戏类型：基于LBS的AR手游，支持安卓/iOS\n   - 安装方式：安卓直接安装，iOS需外区账号\n   - 特色系统：精灵永久保值、可交易系统、Mega进化等\n   - 联动功能：支持与Pokemon Home数据互通\n   - 国内特殊需求：需要解决网络和定位问题\n3. 需要补充结构化信息：\n   - 区分安卓/iOS安装指引\n   - 说明国内玩家注意事项\n   - 规范术语表达（如「海鲜市场」应说明是二手平台）\n4. 优化逻辑顺序：从基础认知到具体操作，最后补充特色玩法\n5. 移除口语化表达（如「起飞！！」「好玩咩？」），保持专业但易懂\n<｜end▁of▁sentence｜>


In [61]:
dataset

DatasetDict({
    train: Dataset({
        features: ['instruction', 'input', 'output', 'history', 'text'],
        num_rows: 1933
    })
})

In [62]:
train_dataset = dataset["train"]

In [63]:
train_dataset

Dataset({
    features: ['instruction', 'input', 'output', 'history', 'text'],
    num_rows: 1933
})

## Train the model
Now let's use Huggingface TRL's `SFTTrainer`.

In [64]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth: Already have LoRA adapters! We shall skip this step.


In [65]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 4,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 200,
        num_train_epochs = 10, # For longer training runs!
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=4):   0%|          | 0/1933 [00:00<?, ? examples/s]

In [66]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,933 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 41,943,040/4,582,543,360 (0.92% trained)


Step,Training Loss
1,1.5207
2,1.27
3,1.3595
4,1.2937
5,1.5723
6,1.1837
7,1.4509
8,1.3305
9,1.2439
10,1.5054


## Inference after fine-tuning

Let's inference with same question again and see the difference.

In [67]:
print(question)

一个患有急性阑尾炎的病人已经发病5天，腹痛稍有减轻但仍然发热，在体检时发现右下腹有压痛的包块，此时应如何处理？


In [70]:
question='对于皮卡丘你有什么好建议的？'

In [71]:
FastLanguageModel.for_inference(model)  # Unsloth has 2x faster inference!
inputs = tokenizer([prompt_style.format(question, "")], return_tensors="pt").to("cuda")

outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=1200,
    use_cache=True,
)
response = tokenizer.batch_decode(outputs)
print(response[0].split("### Response:")[1])


<think>

</think>
首先需要明确用户的核心需求是寻找皮卡丘的训练策略改进方向。原始回答中存在三个要点：1）建议携带电磁波之石 2）强调电磁波的威力 3）强调电磁波对战的高效性。分析表明：1）电磁波之石是提升皮卡丘电系属性的重要道具；2）电磁波作为特殊攻击具有克制特性；3）电磁波技能的普及性在对战中具有战略价值。需要将这些要素整合成逻辑连贯的建议，同时补充一些训练基础知识：例如电磁波的学习条件（需携带电磁波之石并持有电磁波技能）以及推荐的技能组合（如电磁波+电磁波+电磁波）。<｜end▁of▁sentence｜>


## Upload Model to HuggingFace

Now, let's save our finetuned model and upload it to HuggingFace.

### Save the fine-tuned model to GGUF format

Choose the llama.cpp's GGUF format we prefer by setting the corresponding `if` to `True`.

In [17]:
import os
HUGGINGFACE_TOKEN = os.getenv("token")


In [72]:
# Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model", tokenizer,)

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model_f16", tokenizer, quantization_method = "f16")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 789.11 out of 1007.54 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:00<00:00, 44.76it/s]


Unsloth: Saving tokenizer... Done.
Done.


Unsloth: Converting llama model. Can use fast conversion = False.


==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at model into q8_0 GGUF format.
The output location will be /root/autodl-tmp/model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:gguf: context length = 131072
INFO:hf-to-gguf:gguf: embedding length = 4096
INFO:hf-to-gguf:gguf: feed forward length = 14336
INFO:hf-to-gguf:gguf: head count = 32
INFO:hf-to-gguf:gguf: key-value head count = 8
INFO:hf-to-gguf:gguf: rope theta = 500000.0
INFO:hf-to-gguf:gguf: 

### Push the model to HuggingFace

Create a model type repository for your model if you haven't done so.

In [19]:
from huggingface_hub import create_repo
create_repo("qwqqwq/medical-model", token=HUGGINGFACE_TOKEN, exist_ok=True)

ConnectTimeout: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/repos/create (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fef2978a320>, 'Connection to huggingface.co timed out. (connect timeout=None)'))"), '(Request ID: f3fa6279-2c18-48a2-b39d-98695ec69b33)')

In [None]:
model.push_to_hub_gguf("qwqqwq/medical-model", tokenizer, token = HUGGINGFACE_TOKEN)

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### Your chat template has a BOS token. We shall remove it temporarily.


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 33.04 out of 52.96 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:02<00:00, 13.40it/s]


Unsloth: Saving tokenizer... Done.
Done.
==((====))==  Unsloth: Conversion from QLoRA to GGUF information
   \\   /|    [0] Installing llama.cpp might take 3 minutes.
O^O/ \_/ \    [1] Converting HF to GGUF 16bits might take 3 minutes.
\        /    [2] Converting GGUF 16bits to ['q8_0'] might take 10 minutes each.
 "-____-"     In total, you will have to wait at least 16 minutes.

Unsloth: Installing llama.cpp. This might take 3 minutes...
Unsloth: [1] Converting model at wyang14/medical-model into q8_0 GGUF format.
The output location will be /content/wyang14/medical-model/unsloth.Q8_0.gguf
This might take 3 minutes...
INFO:hf-to-gguf:Loading model: medical-model
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {64}
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-0

  0%|          | 0/1 [00:00<?, ?it/s]

unsloth.Q8_0.gguf:   0%|          | 0.00/8.54G [00:00<?, ?B/s]

Unsloth: ##### The current model auto adds a BOS token.
Unsloth: ##### We removed it in GGUF's chat template for you.


Saved GGUF to https://huggingface.co/wyang14/medical-model


### Ollama run HuggingFace model

```bash
ollama run hf.co/{username}/{repository}:{quantization}
```

### Ollama inference

```bash
curl http://localhost:11434/api/chat -d '{ \
  "model": "", \
  "messages": [ \
    { "role": "user", "content": "一个患有急性阑尾炎的病人已经发病5天，腹痛稍有减轻但仍然发热，在体检时发现右下腹有压痛的包块，此时应如何处理？" } \
  }'
```