## LLaMA 2 指令微调（Alpaca-Style on Dolly-15K Dataset)

示例代码关键训练要素：
- 使用 Dolly-15K 数据集，以 Alpaca 指令风格生成训练数据
- 以 4-bit（NF4）量化精度加载 `LLaMA 2-7B` 模型
- 使用 QLoRA 以 `bf16` 混合精度训练模型
- 使用 `HuggingFace TRL` 的 `SFTTrainer` 实现监督指令微调
- 使用 Flash Attention 快速注意力机制加速训练（需硬件支持）

### 下载 databricks-dolly-15k 数据集

In [1]:
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer, SFTConfig  # 修改导入
import datetime

# 1. 加载数据集
dataset = load_dataset(
    "json",
    data_files="/root/autodl-tmp/dir/databricks-dolly-15k.jsonl",
    split="train"
)

In [2]:
# 数据集样例总数: 15011
dataset

Dataset({
    features: ['instruction', 'context', 'response', 'category'],
    num_rows: 15011
})

In [4]:
# 随机抽选一个数据样例打印
from random import randrange
print(dataset[randrange(len(dataset))])

{'instruction': 'What Does the Referee Do?', 'context': '', 'response': "In a game of soccer, the referee is the appointed person of authority on the field for the players. Football referee responsibilities include presiding over the game and officiating with a neutral point of view. The chief role centers around making 'on-the-fly' decisions.\nfootball referee must enforce all the rules of infringements and sanctions in the sport while maintaining fairness and safety.", 'category': 'general_qa'}


### 以 Alpaca-Style 格式化指令数据

`Alpacca-style` 格式：https://github.com/tatsu-lab/stanford_alpaca#data-release

In [5]:
# 2. 数据格式化函数
def format_instruction(sample):
    text = f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM. 

### Input:
{sample['response']}

### Response:
{sample['instruction']}
"""
    encoded = tokenizer(
        text, 
        truncation=True, 
        max_length=2048,
        padding=False
    )
    return tokenizer.decode(encoded["input_ids"], skip_special_tokens=True)

### 使用快速注意力（Flash Attention）加速训练

检查你的 GPU 是否支持 `flash-attn` 加速：

```shell
$ python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError: Hardware not supported for Flash Attention
```
**运行结果：演示使用的 NVIDIA T4 硬件不支持 Flash Attention**

#### 安装 flash-attn 加速包（需要GPU硬件支持）

```shell
$ MAX_JOBS=4 pip install flash-attn --no-build-isolation
```

### 加载模型

In [8]:
# 3. 加载模型与分词器
model_id = "/root/autodl-tmp/Llama"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16
    ),
    device_map="auto",
    use_cache=False
)




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### 使用 QLoRA 配置加载 PEFT 模型

In [9]:
# 4. 配置LoRA
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=16,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj", "v_proj"]
)
model = prepare_model_for_kbit_training(model)
qlora_model = get_peft_model(model, peft_config)

In [10]:
qlora_model.print_trainable_parameters()

trainable params: 8,388,608 || all params: 6,746,804,224 || trainable%: 0.1243


### 训练超参数

In [11]:
# 5. 训练配置（使用SFTConfig）
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
demo_train = True
output_dir = f"models/llama-7-int4-dolly-{timestamp}"

In [12]:
args = SFTConfig(  # 替换为SFTConfig
    output_dir=output_dir,
    num_train_epochs=1 if demo_train else 3,
    max_steps=100,
    per_device_train_batch_size=3,
    gradient_accumulation_steps=1 if demo_train else 4,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    save_strategy="steps" if demo_train else "epoch",
    save_steps=10,
    learning_rate=2e-4,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    packing=True,  # 启用序列打包功能
)

### 实例化 SFTTrainer

In [13]:
# 6. 初始化SFTTrainer
trainer = SFTTrainer(
    model=qlora_model,
    train_dataset=dataset,
    peft_config=peft_config,
    formatting_func=format_instruction,
    args=args,  # 传递配置对象
)



Applying formatting function to train dataset:   0%|          | 0/15011 [00:00<?, ? examples/s]

Adding EOS to train dataset:   0%|          | 0/15011 [00:00<?, ? examples/s]

Tokenizing train dataset:   0%|          | 0/15011 [00:00<?, ? examples/s]

Packing train dataset:   0%|          | 0/15011 [00:00<?, ? examples/s]

### 训练模型

In [14]:
trainer.train()

Step,Training Loss
10,1.5739
20,1.3376
30,1.2881
40,1.3397
50,1.2568
60,1.2804
70,1.2711
80,1.123
90,1.2102
100,1.0988


TrainOutput(global_step=100, training_loss=1.2779588031768798, metrics={'train_runtime': 366.793, 'train_samples_per_second': 0.818, 'train_steps_per_second': 0.273, 'total_flos': 1.2045223965843456e+16, 'train_loss': 1.2779588031768798})

### 保存模型

In [15]:
trainer.save_model()

### 模型推理（测试）