## 前言

模型训练是一个不断调优的过程，这注定了我们的需要多次跑同一个训练过程。在前文[欺诈文本分类微调（六）：Lora单卡](https://golfxiao.blog.csdn.net/article/details/141440847)跑的整个训练过程中，基本可以分为几步：
1. 数据加载
2. 数据预处理
3. 模型加载
4. 定义lora参数
5. 插入微调矩阵
6. 定义训练参数
7. 构建训练器开始训练

这个流程基本是固定的，而训练调优过程中需要调整的主要是以下这些项：
1. 输入和输出：数据路径，模型路径，输出路径
2. 参数：lora参数，训练参数

因此，我们将整个训练过程中基本不变的部分提取到trainer.py中。内容如下所示：

In [None]:
def load_jsonl(path):
    with open(path, 'r') as file:
        data = [json.loads(line) for line in file]
        return pd.DataFrame(data)

def preprocess(item, tokenizer, max_length=2048):
    input_ids, attention_mask, labels = [], [], []
    system_message = "You are a helpful assistant."
    user_message = item['instruction'] + item['input']
    assistant_message = json.dumps({"is_fraud":item["label"]}, ensure_ascii=False)

    instruction = tokenizer(f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant\n", add_special_tokens=False)  
    response = tokenizer(assistant_message, add_special_tokens=False)
    input_ids = instruction["input_ids"] + response["input_ids"] + [tokenizer.pad_token_id]
    attention_mask = instruction["attention_mask"] + response["attention_mask"] + [1]  
    # -100是一个特殊的标记，用于指示指令部分的token不应参与损失计算
    labels = [-100] * len(instruction["input_ids"]) + response["input_ids"] + [tokenizer.pad_token_id]  
    
    # 对输入长度做一个限制保护，超出截断
    return {
        "input_ids": input_ids[:max_length],
        "attention_mask": attention_mask[:max_length],
        "labels": labels[:max_length]
    }

def load_dataset(train_path, eval_path, tokenizer):
    train_df = load_jsonl(train_path)
    train_ds = Dataset.from_pandas(train_df)
    train_dataset = train_ds.map(lambda x: preprocess(x, tokenizer), remove_columns=train_ds.column_names)
    
    eval_df = load_jsonl(eval_path)
    eval_ds = Dataset.from_pandas(eval_df)
    eval_dataset = eval_ds.map(lambda x: preprocess(x, tokenizer),  remove_columns=eval_ds.column_names)
    return train_dataset, eval_dataset

def load_model(model_path, device='cuda'):
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16)
    model.enable_input_require_grads() # 开启梯度检查点时，要执行该方法
    return model.to(device), tokenizer

def build_loraconfig():
    return LoraConfig(
        task_type=TaskType.CAUSAL_LM, 
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
        inference_mode=False, # 训练模式
        r=8, 
        lora_alpha=16,   
        lora_dropout=0.05
    )

def build_train_arguments(output_path):
    return TrainingArguments(
        output_dir=output_path,
        per_device_train_batch_size=4,  # 每个设备（如每个GPU）的训练批次大小
        gradient_accumulation_steps=4,  # 梯度累积的步骤数，相当于增大批次大小
        log_level="debug",              # 日志级别
        log_level_replica="info",       # 多卡训练时其它GPU设备上训练进程的日志级别
        logging_steps=10,               
        logging_first_step=True,        # 是否在训练的第一步就记录日志
        logging_dir=os.path.join(output_path, "logs"),
        num_train_epochs=3,             # 训练的总轮数
        per_device_eval_batch_size=8,   # 每个设备（如每个GPU）的预测批次大小
        eval_strategy="steps",          # 设置评估策略为steps
        eval_on_start=False,            # 在训练开始时就进行模型评估（设置为True是会报错，暂时保持默认）
        eval_steps=100,                 # 设置评估的步数，与保存步数一致
        save_steps=100,                 # 为了快速演示，这里设置10，建议你设置成100
        learning_rate=1e-4,             # 学习率
        save_on_each_node=True,         # 分布式训练时是否在每个节点上都保存checkpoint，用于特定节点失败时从指定点恢复训练
        load_best_model_at_end=True,    # 在训练结束时加载最佳模型
        remove_unused_columns=False,    # 是否移除数据集中模型训练未使用到的列，以减少内存使用
        dataloader_drop_last=True,      # 抛弃最后一批迭代数据（数量可能不满足一批，会影响训练效果）
        gradient_checkpointing=True     # 启用梯度检查点以节省内存
    )

def build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset):
    peft_model = get_peft_model(model, lora_config)
    peft_model.print_trainable_parameters()
    return Trainer(
        model=peft_model,
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
        #callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # 早停回调
    )

## 初始化

主要是声明每次模型训练时的公共配置，以及加载对每次训练都适用的模型和数据集。

首先，加载上面刚封装的trainer.py，用jupyter中的魔法指令`%run`来嵌入一个python脚本到当前notebook。

In [11]:
%run trainer.py

In [None]:
定义训练、验证数据集路径，以及输入和输出模型的路径。

In [12]:
traindata_path = '/data2/anti_fraud/dataset/train0819.jsonl'
evaldata_path = '/data2/anti_fraud/dataset/eval0819.jsonl'
model_path = '/data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct'
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_1'

In [None]:
指定可以使用的GPU设备。

In [13]:
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = 'cuda'

In [None]:
加载模型和数据集。

In [4]:
model, tokenizer = load_model(model_path, device)
train_dataset, eval_dataset = load_dataset(traindata_path, evaldata_path, tokenizer)

Map:   0%|          | 0/18787 [00:00<?, ? examples/s]

Map:   0%|          | 0/2348 [00:00<?, ? examples/s]

## 调优-1（去掉提前结束）
[前文](https://golfxiao.blog.csdn.net/article/details/141440847)最后有提到，每次模型在尚未训练完所有数据时就提前结束，可能与提前结束的配置有关，所以先从提前结束开始调整，。
- 调整点：在构建训练器时暂时先去掉提前结束的配置，让模型跑完预设的3个epoch。
- 目的：让数据被充分的训练，避免因损失陷入局部最小值而提前结束。

> 像这种模型训练在验证损失还没有完全收敛时就提前停止的现象被称为Premature Early Stopping（提前停止的潜在过早），往往是由于验证损失存在短期波动，而这个波动的程度不同的模型和数据集都不相同，一般需要观察验证损失在更长时间内的变化趋势，来合理的设置early_stopping_patience。

In [7]:
def build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset):
    peft_model = get_peft_model(model, lora_config)
    peft_model.print_trainable_parameters()
    return Trainer(
        model=peft_model,
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
        #callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],  # 早停回调
    )

In [None]:
lora参数、训练参数都先不作调整，直接复用上次的值，开始训练。

In [8]:
lora_config = build_loraconfig()
train_args = build_train_arguments(output_path)
trainer = build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset)

Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 9,232,384 || all params: 1,552,946,688 || trainable%: 0.5945


In [9]:
trainer.train()

[2024-08-23 12:27:54,231] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/data2/anaconda3/envs/python3_10/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,0.0266,0.030008
200,0.0343,0.031493
300,0.0147,0.023997
400,0.0254,0.021837
500,0.0231,0.021945
600,0.0254,0.023164
700,0.0211,0.020423
800,0.018,0.021873
900,0.017,0.019083
1000,0.018,0.018249


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


TrainOutput(global_step=3522, training_loss=0.015486650269114704, metrics={'train_runtime': 8113.6425, 'train_samples_per_second': 6.946, 'train_steps_per_second': 0.434, 'total_flos': 1.1460762211897958e+17, 'train_loss': 0.015486650269114704, 'epoch': 2.9993612944432617})

验证损失基本在0.020上下波动，而上一次训练结束时验证损失最小的是0.0276，单从损失数值来看扩大训练步数是比上一次有提高的，具体还看运行下评估测试。



#### 评测

In [11]:
%run evaluate.py
evaluate_with_model(peft_model, tokenizer, evaldata_path, device, debug=True)

progress: 100%|██████████| 2348/2348 [21:13<00:00,  1.84it/s]

tn：1150, fp:15, fn:238, tp:945
precision: 0.984375, recall: 0.7988165680473372





精确率precision相比前文（0.9347）提高了0.5个百分点，召回率recall相比于前文（0.7275）提高了7个百分点。

**此次训练小结**：在训练前期，喂给模型的数据越多，模型学到的信息越多，让所有数据被充分的训练是前期提高模型性能的最基础途径。

## 调优-2（增大batchsize到16）

到现在为止，小批量大小batch_size一直使用的是默认值4，业界经验是：较大的batch_size有助于让梯度下降更稳定。

我们这里将batch_size调整为16，并把梯度累积由4降到1，总的梯度下降的batch_size其实没有改变（16*1=4*4)，正好借此对比下两种参数设置的效果。
> 注：batch_size很消耗GPU显存，需要找到适合自己GPU的尽可能大的值。一般方法是从一个小值（例如4）开始，在GPU没有报OOM的前提下逐步增大。

In [8]:
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_1'
lora_config = build_loraconfig()

train_args = build_train_arguments(output_path)
train_args.per_device_train_batch_size = 16
train_args.gradient_accumulation_steps = 1
print(train_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=100,
eval_strategy=steps,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16

In [None]:
构建训练器开始训练。

In [10]:
trainer = build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset)
trainer.train()

Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 9,232,384 || all params: 1,552,946,688 || trainable%: 0.5945
[2024-08-23 15:10:31,018] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/data2/anaconda3/envs/python3_10/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,0.0266,0.031856
200,0.0345,0.02862
300,0.0147,0.023571
400,0.0259,0.022186
500,0.0224,0.021919
600,0.0263,0.022612
700,0.0204,0.02043
800,0.019,0.021082
900,0.017,0.019088
1000,0.019,0.01823


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)


TrainOutput(global_step=3525, training_loss=0.015297001443724048, metrics={'train_runtime': 8525.6414, 'train_samples_per_second': 6.611, 'train_steps_per_second': 0.413, 'total_flos': 1.5421390082188186e+17, 'train_loss': 0.015297001443724048, 'epoch': 3.0})

#### 评测

In [13]:
%run evaluate.py
evaluate_with_model(trainer.model, tokenizer, evaldata_path, device, debug=True)
# checkpoint_path = "/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_1/checkpoint-3500"

progress: 100%|██████████| 2348/2348 [19:29<00:00,  2.01it/s]

tn：1148, fp:17, fn:197, tp:986
precision: 0.9830508474576272, recall: 0.8334742180896028





相比于上一次，精确率precision基本没有变化，但召回率recall双提升了3.5个百分点。

此次训练小结：大的batch_size要比小的batch_size效果要好，总的（梯度下降）batch_size相同的情况下，不带梯度累积（16*1）要比使用梯度累积（4*4)的效果要好。



## 调优-3（引入dropout和学习率调度器）

前面的调参效果看起来都比较顺利，可能是因为这些是比较容易获得的经验，放到大部分模型中都适用。那是否还有提升空间呢？

如果细心观察训练过程中两个损失的变化数据，可以发现一个比较明显的问题，训练后期（大概是2200个step之后)当训练损失不断下降时，验证损失是没有再下降，反而到3000+step时有明显的上升，这至少能说明两个问题：
1. 模型在训练集上产生了过拟合，训练损失在后期下降的太快。
2. 模型在验证集上的泛化能力有限。

与chatgpt进行了简单对话后，它给了如下建议：
- 增加模型的正则化力度，对于当前的lora微调场景来说，也就是lora_dropout。
- 引入学习率调度器，让学习率动态调整，特别是在训练后期，它能让学习率逐渐减小，有助于缓解训练损失的过拟合。

> 注：使用lora进行微调时，原始基座模型的参数是冻结不变的，那dropout操作就只能在插入的低秩矩阵上进行，即lora_config中的lora_dropout参数。

训练参数部分，引入两项：
1. lr_scheduler_type：学习率调度器，这里使用余弦退火调度器cosine，它能在训练过程中逐渐减小学习率，有助于模型稳定收敛。
2. warmup_ratio: 学习率预热比例，0.05表示前5%的steps用于预热，针对的是前面损失波动大的问题。

In [14]:
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_8'
train_args = build_train_arguments(output_path)
train_args.per_device_train_batch_size = 16
train_args.gradient_accumulation_steps = 1
train_args.warmup_ratio=0.05     
train_args.lr_scheduler_type="cosine"  

print(train_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=True,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=100,
eval_strategy=steps,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=

lora参数调整一项：将lora_dropout从0.05增加到0.2,提高模型训练过程中的泛化能力。

In [6]:
lora_config = build_loraconfig()
lora_config.lora_dropout = 0.2   # 增加泛化能力

In [None]:
构建训练器开始训练。

In [7]:
trainer = build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset)
trainer.train()

Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 9,232,384 || all params: 1,552,946,688 || trainable%: 0.5945


Currently training with a batch size of: 16


[2024-08-23 23:19:25,861] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/data2/anaconda3/envs/python3_10/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




***** Running training *****
  Num examples = 18,787
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3,522
  Number of trainable parameters = 9,232,384
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,0.0327,0.030737
200,0.0429,0.044557
300,0.0184,0.026752
400,0.027,0.022941
500,0.0234,0.023118
600,0.0265,0.024059
700,0.0211,0.020844
800,0.0191,0.022903
900,0.0175,0.019756
1000,0.0194,0.019162



***** Running Evaluation *****
  Num examples = 2348
  Batch size = 8
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Saving model checkpoint to /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_7/checkpoint-100
loading configuration file /data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "intermediate_size": 8960,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 12,
  "num_hidden_layers": 28,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_thet

TrainOutput(global_step=3522, training_loss=0.02731521469741861, metrics={'train_runtime': 8523.0882, 'train_samples_per_second': 6.613, 'train_steps_per_second': 0.413, 'total_flos': 1.5420052826918093e+17, 'train_loss': 0.02731521469741861, 'epoch': 3.0})

TrainOutput(global_step=3522, training_loss=0.02731521469741861, metrics={'train_runtime': 8523.0882, 'train_samples_per_second': 6.613, 'train_steps_per_second': 0.413, 'total_flos': 1.5420052826918093e+17, 'train_loss': 0.02731521469741861, 'epoch': 3.0})

#### 评测

In [None]:
%run evaluate.py
evaluate_with_model(trainer.model, tokenizer, evaldata_path, device, debug=True)

progress: 100%|██████████| 2348/2348 [19:49<00:00,  1.97it/s]

tn：1147, fp:18, fn:202, tp:981

precision: 0.9819819819819819, recall: 0.8292476754015216

几乎没什么变化，召回率还略有下降。

> 注：这里同时调整学习率调度器和dropout的作法并不太好，我们难以分辨两者分别带来的影响。所以后来还补测了一个版本：只调整lora_dropout=0.2，评测结果是：precision: 0.9864, recall: 0.8021, 通过这个微小的对比测试，可以发现学习率调度器起到了正向作用，而Lora_dropout起到了负面作用。

为什么会这样呢？
- 网上有一种说法是：dropout增加会带来模型容量的减少，可能需要同步增加lora低秩矩阵的大小r，以补偿模型容量的缩减。

## 调优-4（增加秩到16）

训练参数的修改不变。

In [15]:
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_9'

train_args = build_train_arguments(output_path)
train_args.per_device_train_batch_size = 16
train_args.gradient_accumulation_steps = 1
train_args.warmup_ratio=0.05     # 引入学习率预热，前5%的steps用于预热，针对的是前面损失波动大的问题
train_args.lr_scheduler_type="cosine"  # 使用余弦退火调度器，在训练过程中逐渐减小学习率，有助于模型稳定收敛。

print(train_args)

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=True,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=100,
eval_strategy=steps,
eval_use_gather_object=False,
evaluation_strategy=None,
fp16=

In [None]:
lora参数部分，保持dropout=0.2不变的同时，将微调矩阵的秩r由8增加到16，缩放因子lora_alpha也保持2倍的比例增加到32。

In [16]:
lora_config = build_loraconfig()
lora_config.lora_dropout = 0.2   # 0.05-> 0.2增加泛化能力
lora_config.r = 16
lora_config.lora_alpha = 32
lora_config

LoraConfig(peft_type=<PeftType.LORA: 'LORA'>, auto_mapping=None, base_model_name_or_path=None, revision=None, task_type=<TaskType.CAUSAL_LM: 'CAUSAL_LM'>, inference_mode=False, r=16, target_modules={'k_proj', 'v_proj', 'o_proj', 'down_proj', 'q_proj', 'gate_proj', 'up_proj'}, lora_alpha=32, lora_dropout=0.2, fan_in_fan_out=False, bias='none', use_rslora=False, modules_to_save=None, init_lora_weights=True, layers_to_transform=None, layers_pattern=None, rank_pattern={}, alpha_pattern={}, megatron_config=None, megatron_core='megatron.core', loftq_config={}, use_dora=False, layer_replication=None, runtime_config=LoraRuntimeConfig(ephemeral_gpu_offload=False))

In [None]:
trainer = build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset)
trainer.train()

#### 评测

In [None]:
%run evaluate.py
evaluate_with_model(trainer.model, tokenizer, evaldata_path, device, debug=True)

progress: 100%|██████████| 2348/2348 [19:15<00:00,  2.03it/s]

tn：1143, fp:22, fn:158, tp:1025

precision: 0.9789875835721108, recall: 0.8664412510566357

召回率上升了3个百分点，精确率略有下降但影响不大，这两个指标本身带有一定的互斥性。

**此次训练小结**：dropout需要和秩的大小配合调整，增加秩的大小能够让模型学习到更多的参数，配合dropout一起调整能够提高模型训练中的泛化能力。

## 测试-5（降低学习率）
> 基于`调优-2`的基础上验证。

In [None]:
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_3'
lora_config = build_loraconfig()
train_args = build_train_arguments(output_path)
train_args.per_device_train_batch_size = 16
train_args.gradient_accumulation_steps = 1
train_args.weight_decay = 0.01  # 引入权重衰减，增加泛化
train_args.learning_rate=5e-5  # 降低学习率，增加训练的稳定性 1e-4-->5e-5
train_args.warmup_ratio=0.05     # 引入学习率预热，前5%的steps用于预热，针对的是前面损失波动大的问题
train_args.lr_scheduler_type="cosine"  # 使用余弦退火调度器，在训练过程中逐渐减小学习率，有助于模型稳定收敛。


In [None]:
trainer = build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset)
trainer.train()

#### 评测

In [None]:
%run evaluate.py
evaluate_with_model(trainer.model, tokenizer, evaldata_path, device, debug=True)

progress: 100%|██████████| 2348/2348 [19:05<00:00,  2.05it/s]

tn：1152, fp:13, fn:318, tp:865

precision: 0.9851936218678815, recall: 0.7311918850380389

In [None]:
从结果来看，降低学习率后模型的性能显著下降，此实验方向放弃。

## 测试-6（单独验证学习率预热）
单独验证学习率预热，基于`调优-2`的基础上验证

In [5]:
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_12'
lora_config = build_loraconfig()
train_args = build_train_arguments(output_path)
train_args.per_device_train_batch_size = 16
train_args.gradient_accumulation_steps = 1
train_args.warmup_ratio=0.05     # 引入学习率预热，前5%的steps用于预热，针对的是前面损失波动大的问题1e-4

In [6]:
lora_config = build_loraconfig()
trainer = build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset)
trainer.train()

Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 9,232,384 || all params: 1,552,946,688 || trainable%: 0.5945


Currently training with a batch size of: 16


[2024-08-24 17:32:15,873] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/data2/anaconda3/envs/python3_10/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




***** Running training *****
  Num examples = 18,787
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3,522
  Number of trainable parameters = 9,232,384
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,0.0304,0.031623
200,0.0423,0.050258
300,0.0168,0.025361
400,0.0274,0.022586
500,0.0238,0.023494
600,0.027,0.02413
700,0.0206,0.020899
800,0.0197,0.02234
900,0.0172,0.019625
1000,0.019,0.019108



***** Running Evaluation *****
  Num examples = 2348
  Batch size = 8
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Saving model checkpoint to /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_12/checkpoint-100
loading configuration file /data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "intermediate_size": 8960,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 12,
  "num_hidden_layers": 28,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_the

TrainOutput(global_step=3522, training_loss=0.0267889589493833, metrics={'train_runtime': 8522.5144, 'train_samples_per_second': 6.613, 'train_steps_per_second': 0.413, 'total_flos': 1.5420052826918093e+17, 'train_loss': 0.0267889589493833, 'epoch': 3.0})

#### 评测

In [7]:
%run evaluate.py
evaluate_with_model(trainer.model, tokenizer, evaldata_path, device, debug=True)

progress: 100%|██████████| 2348/2348 [19:05<00:00,  2.05it/s]

tn：1153, fp:12, fn:256, tp:927
precision: 0.987220447284345, recall: 0.7836010143702451





对比调优2（recall=0.8334)和测试6这个结果来看，单独增加学习率预热对模型性能的提升似乎并没有起到正向作用，反而还降低了召回率。

## 测试-7 （去掉学习率预热）
> 基于调优-4的基础上验证。

In [None]:
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_14'

train_args = build_train_arguments(output_path)
train_args.per_device_train_batch_size = 16
train_args.gradient_accumulation_steps = 1
# train_args.weight_decay=0.01     # 引入权重衰减，它会迫使模型参数保持较小的值，从而避免模型过拟合
# train_args.warmup_ratio=0.05     # 引入学习率预热，前5%的steps用于预热，针对的是前面损失波动大的问题
train_args.lr_scheduler_type="cosine"  # 使用余弦退火调度器，在训练过程中逐渐减小学习率，有助于模型稳定收敛。

lora_config = build_loraconfig()
lora_config.lora_dropout = 0.2   # 增加泛化能力
lora_config.r = 16
lora_config.lora_alpha = 32

In [None]:
trainer = build_trainer(model, tokenizer, train_args, lora_config, train_dataset, eval_dataset)
trainer.train()

Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820
[2024-08-24 21:29:36,479] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/data2/anaconda3/envs/python3_10/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




***** Running training *****
  Num examples = 18,787
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 3,522
  Number of trainable parameters = 18,464,768
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Step,Training Loss,Validation Loss
100,0.0264,0.03008
200,0.0363,0.031037
300,0.016,0.023586
400,0.0263,0.021576
500,0.0243,0.023018
600,0.0305,0.022208
700,0.0194,0.019775
800,0.017,0.022859
900,0.0179,0.018533
1000,0.0181,0.018786



***** Running Evaluation *****
  Num examples = 2348
  Batch size = 8
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
Saving model checkpoint to /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_14/checkpoint-100
loading configuration file /data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct/config.json
Model config Qwen2Config {
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 1536,
  "initializer_range": 0.02,
  "intermediate_size": 8960,
  "max_position_embeddings": 32768,
  "max_window_layers": 28,
  "model_type": "qwen2",
  "num_attention_heads": 12,
  "num_hidden_layers": 28,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_the

In [6]:
print(output_path)

/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0822_12


#### 评测

In [None]:
%run evaluate.py
checkpoint_path = os.path.join(output_path, "checkpoint-2300")
evaluate(model_path, checkpoint_path, evaldata_path, device, debug=True)

progress: 100%|██████████| 2348/2348 [19:24<00:00,  2.02it/s]

tn：1144, fp:21, fn:188, tp:995

precision: 0.9793307086614174, recall: 0.8410819949281487

对比调优4（recall=0.86)和此次测试7的实验结果，单纯`去掉`学习率预热导致模型性能有一定程度的降低。

而测试6得出的结论是：单纯`增加`学习率预热也在一定程度上降低了模型的性能。两个结论似乎看起来是相反的，这说明什么呢？

想通过保持其它参数不动，只测试一个参数改变对模型的影响似乎很难凑效，模型的各个参数之间像是相互作用的一个整体，这种现象可以称之为`参数耦合`。这意味着不同参数之间的值需要协调好才能达到最佳效果。