## 引言
前文[数据校正与增强](https://golfxiao.blog.csdn.net/article/details/142333893)进行了数据增强，本文将使用增强后的数据对模型进行进一步训练，以便得到能同时预测出分类标签、欺诈者、分类原因多个信息的模型。

为此，我们需要对整个训练过程进行调整，包括：
1. 交叉训练逻辑封装
2. 数据序列化的改造
3. 评测方法改造想

## 训练过程

#### 初始化

In [11]:
%run trainer.py

In [8]:
traindata_path = '/data2/anti_fraud/dataset/train0902.jsonl'
evaldata_path = '/data2/anti_fraud/dataset/eval0902.jsonl'
model_path = '/data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct'
output_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0913_1'

In [9]:
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
device = 'cuda'

In [12]:
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
train_dataset, eval_dataset = load_dataset(traindata_path, evaldata_path, tokenizer, with_reason=True, lazy=False)

Map:   0%|          | 0/24246 [00:00<?, ? examples/s]

Map:   0%|          | 0/3031 [00:00<?, ? examples/s]

#### 数据处理

In [13]:
import glob
import gc
import numpy as np
from datasets import Dataset, concatenate_datasets
from sklearn.model_selection import KFold

拼接训练集和验证集作为一个数据集。

In [14]:
datasets = concatenate_datasets([train_dataset, eval_dataset])
len(datasets)

27277

创建KFold对象用于按折子划分数据集。
- n_splits=5：表示将数据集划分为5份。
- shuffle=True：表示调用`kf.split`划分数据集前先将顺序打乱。

> KFold是由sklearn库提供的k折交叉验证方法，它通过将数据集分成k个相同大小的子集（称为折），每次迭代数据集时，使用其中一个作为验证集，其余4个作为训练集，并重复这个过程k次。

In [15]:
kf = KFold(n_splits=5, shuffle=True)
kf

KFold(n_splits=5, random_state=None, shuffle=True)

用kfold划分数据集时，实际拿到的是数据在数据集中的索引顺序，如下面示例的效果。

In [16]:
indexes = kf.split(np.arange(len(datasets)))
train_indexes, val_indexes = next(indexes)
train_indexes, val_indexes, len(train_indexes), len(val_indexes)

(array([    0,     1,     2, ..., 27273, 27275, 27276]),
 array([    5,     7,     8, ..., 27262, 27270, 27274]),
 21821,
 5456)

#### 超参数定义

定义超参构造函数，包括训练参数和Lora微调参数。这里相对于之前作的调整在于：
- 修改评估和保存模型的策略，由每100step改为每个epoch，原因是前者保存的checkpoint有太多冗余。
- 将num_train_epochs调整为2，表示每个折子的数据集训练2遍，k=5时数据总共会训练10遍。

> 注：当`per_device_train_batch_size=16`时训练过程中会意外发生OOM，所以临时将批次大小per_device_train_batch_size改为8.

In [17]:
def build_arguments(output_path):
    train_args = build_train_arguments(output_path)
    train_args.eval_strategy='epoch'
    train_args.save_strategy='epoch'
    train_args.num_train_epochs = 2
    train_args.per_device_train_batch_size = 8
    
    lora_config = build_loraconfig()
    lora_config.lora_dropout = 0.2   # 增加泛化能力
    lora_config.r = 16
    lora_config.lora_alpha = 32
    return train_args, lora_config

由于训练过程中需要迭代更换不同的训练集和验证集组合，而更换数据集就需要重新创建训练器，传入新的模型实例。除了第一次训练是从0开始训练，后面几次都需要加载前一轮训练保存的最新checkpoint，以接着之前的结果继续训练。

In [None]:
定义一个`find_last_checkpoint`方法，用于从一个目录中查找最新的checkpoint。
 - glob.glob 函数可以在指定目录下查找所有匹配 `checkpoint-*` 模式的子目录
 - os.path.getctime 返回文件的创建时间（或最近修改时间）
 - max 函数根据这些时间找出最后创建的目录，也就是最新的checkpoint。

In [18]:
# 确定最后的checkpoint目录
def find_last_checkpoint(output_dir):
    checkpoint_dirs = glob.glob(os.path.join(output_dir, 'checkpoint-*'))
    last_checkpoint_dir = max(checkpoint_dirs, key=os.path.getctime)
    return last_checkpoint_dir

find_last_checkpoint("/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0830_1")

'/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0830_1/checkpoint-3522'

定义一个新的加载模型的方法，用于从基座模型和指定的checkpoint中加载最新训练的模型，并根据训练目标来设置参数的require_grad属性，这里将来自lora的参数都设置为需要梯度，其余参数设置不可训练。

In [19]:
def load_model_v2(model_path, checkpoint_path='', device='cuda'):
    # 加载模型
    model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, trust_remote_code=True).to(device)
    # 加载lora权重
    if checkpoint_path: 
        model = PeftModel.from_pretrained(model, model_id=checkpoint_path).to(device)
    # 将基础模型的参数设置为不可训练
    for param in model.base_model.parameters():
        param.requires_grad = False
    
    # 将 LoRA 插入模块的参数设置为可训练
    for name, param in model.named_parameters():
        if 'lora' in name:
            param.requires_grad = True
    return model

在这个训练过程中，第一次训练用的是从零初始化的微调秩，而后面几次训练则需要从指定checkpoint来初始化微调秩，这导致了[原先的build_trainer方法](https://golfxiao.blog.csdn.net/article/details/141500352)不通用。所以定义一个新的训练器构建方法，将加载微调参数的逻辑移到外面。

In [20]:
def build_trainer_v2(model, tokenizer, train_args, train_dataset, eval_dataset):
    # 开启梯度检查点时，要执行该方法
    if train_args.gradient_checkpointing:
        model.enable_input_require_grads()
    return Trainer(
        model=model,
        args=train_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
        callbacks=[EarlyStoppingCallback(early_stopping_patience=5)],  # 早停回调
    )

定义交叉训练的主循环。
- kf.split函数划分了5份数据索引，以这5份数据索引进行5次迭代。
- 使用`datasets.select`基于索引在每次迭代时选择不同的数据作为训练集和验证集。
- 为了避免前次迭代训练的结果被下次迭代的结果给覆盖，每次迭代训练通过fold来拼接不同的输出目录output_path。
- 如果存在last_checkpoint_path,则从checkpoint来加载模型，如果不存在，则使用get_peft_model向模型中插入一个新的Lora微调秩。
- 使用新的build_trainer_v2方法来构建训练器并开始训练。
- 每次迭代完都找出此次训练中最新的checkpoint，作为下次训练的起点。

In [23]:
results = []
last_checkpoint_path = ''

for fold, (train_index, val_index) in enumerate(kf.split(np.arange(len(datasets)))):
    print(f"fold={fold} start, train_index={train_index}, val_index={val_index}")
    train_dataset = datasets.select(train_index)
    eval_dataset = datasets.select(val_index)
    print(f"train data: {len(train_dataset)}, eval: {len(eval_dataset)}")

    output_path = f'/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0913_{fold}'
    train_args, lora_config = build_arguments(output_path)
    if last_checkpoint_path:
        model = load_model_v2(model_path, last_checkpoint_path, device)
    else:
        model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
        model = get_peft_model(model, lora_config)

    model.print_trainable_parameters()
    trainer = build_trainer_v2(model, tokenizer, train_args, train_dataset, eval_dataset)
    train_result = trainer.train()
    print(f"fold={fold}, result = {train_result}")
    results.append(train_result)
    
    last_checkpoint_path = find_last_checkpoint(output_path)


fold=0 start, train_index=[    1     2     3 ... 27274 27275 27276], val_index=[    0    13    18 ... 27269 27270 27271]
train data: 21821, eval: 5456


Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820
[2024-09-13 09:45:45,557] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/data2/anaconda3/envs/python3_10/compiler_compat/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status




  def forward(ctx, input, weight, bias=None):
  def backward(ctx, grad_output):
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
1,0.7801,0.825167
2,0.6967,0.813522


We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


fold=0, result = TrainOutput(global_step=5454, training_loss=0.8096381610769643, metrics={'train_runtime': 5525.3768, 'train_samples_per_second': 7.898, 'train_steps_per_second': 0.987, 'total_flos': 1.3732846378074931e+17, 'train_loss': 0.8096381610769643, 'epoch': 2.0})
fold=1 start, train_index=[    0     1     2 ... 27272 27274 27275], val_index=[    8     9    10 ... 27265 27273 27276]
train data: 21821, eval: 5456


Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
1,0.7854,0.738886
2,0.6662,0.731676


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


fold=1, result = TrainOutput(global_step=5454, training_loss=0.694154640159023, metrics={'train_runtime': 5531.5749, 'train_samples_per_second': 7.89, 'train_steps_per_second': 0.986, 'total_flos': 1.3742528585566618e+17, 'train_loss': 0.694154640159023, 'epoch': 2.0})
fold=2 start, train_index=[    0     2     4 ... 27273 27275 27276], val_index=[    1     3     5 ... 27252 27268 27274]
train data: 21822, eval: 5455


Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
1,0.6794,0.619393
2,0.5589,0.610776


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


fold=2, result = TrainOutput(global_step=5454, training_loss=0.5996930905323042, metrics={'train_runtime': 5539.5927, 'train_samples_per_second': 7.879, 'train_steps_per_second': 0.985, 'total_flos': 1.3758353063028326e+17, 'train_loss': 0.5996930905323042, 'epoch': 2.0})
fold=3 start, train_index=[    0     1     3 ... 27274 27275 27276], val_index=[    2     6    25 ... 27256 27260 27272]
train data: 21822, eval: 5455


Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
1,0.5821,0.503672
2,0.4297,0.490893


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


fold=3, result = TrainOutput(global_step=5454, training_loss=0.4972909716598971, metrics={'train_runtime': 5548.4171, 'train_samples_per_second': 7.866, 'train_steps_per_second': 0.983, 'total_flos': 1.3795531974404506e+17, 'train_loss': 0.4972909716598971, 'epoch': 2.0})
fold=4 start, train_index=[    0     1     2 ... 27273 27274 27276], val_index=[    4     7    12 ... 27254 27263 27275]
train data: 21822, eval: 5455


Detected kernel version 4.15.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


trainable params: 18,464,768 || all params: 1,562,179,072 || trainable%: 1.1820


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


Epoch,Training Loss,Validation Loss
1,0.4833,0.394778
2,0.308,0.372799


  return fn(*args, **kwargs)
  with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):  # type: ignore[attr-defined]


fold=4, result = TrainOutput(global_step=5454, training_loss=0.3999929018605529, metrics={'train_runtime': 5524.1292, 'train_samples_per_second': 7.901, 'train_steps_per_second': 0.987, 'total_flos': 1.3724994732869222e+17, 'train_loss': 0.3999929018605529, 'epoch': 2.0})


收集5轮训练的数据。

第0轮训练数据：
| Epoch |	Training Loss	| Validation Loss |
| --- | --- | --- |
|1	| 0.0233 |	0.02189 | 
|2	| 0.0138 | 0.01614 |
|3	| 0.008800 |	0.011420 |
|4	| 0.004600 |	0.013666 |
|5	| 0.003200 |	0.004718 |
|6	| 0.003000 |	0.004082 |
|7	| 0.007200 |	0.001999 |
|8	| 0.000000 |	0.000814 |
|9	| 0.004900 | 0.002273 |
|10	| 0.010200 | 0.002139 |



对比前面[欺诈文本分类微调（七）—— lora单卡二次调优](https://golfxiao.blog.csdn.net/article/details/141500352)训练进行到2300步左右（大概两遍数据）就开始过拟合，主要现象是验证损失到0.0161就不再下降反而开始升高，K折交叉训练直到第4次迭代（大概八遍数据）过后才达到损失最低点，第5次迭代才出现了略微的过拟合（相比于第4次），过拟合的现象得到了极大的缓解，验证损失也降到了一个更低的值0.000814，这说明数据在训练和验证中被充分的使用。

In [25]:
!ls -l /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0913_4/checkpoint-5454

total 216896
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua       768 Sep 13 17:27 adapter_config.json
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua  73911112 Sep 13 17:27 adapter_model.safetensors
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua 148047722 Sep 13 17:27 optimizer.pt
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua      5140 Sep 13 17:27 README.md
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua     14244 Sep 13 17:27 rng_state.pth
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua      1064 Sep 13 17:27 scheduler.pt
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua     96126 Sep 13 17:27 trainer_state.json
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua      5240 Sep 13 17:27 training_args.bin


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
%%time
%run evaluate_v2.py
testdata_path = '/data2/anti_fraud/dataset/test0902.jsonl'
checkpoint_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0913_4/checkpoint-5454'
evaluate_v2(model_path, checkpoint_path, testdata_path, device, debug=True)

输出的指标信息如下（reason采用的是rouge-1分数）：
```
is_fraud字段指标:
tn：1403, fp:90, fn:88, tp:1450
precision: 0.9415584415584416, recall: 0.9427828348504551, accuracy: 0.9412735070933685
fraud_speaker字段指标:
accuracy: 0.9168591224018475
reason字段指标:
precision: 0.44713901544014767, recall: 0.46087901443683615, f1-score: 0.4438610370678192
CPU times: user 37min 5s, sys: 36.8 s, total: 37min 42s
Wall time: 37min 13s
```


三个字段的评测指标分别如下：
| 字段 | 指标  |
| --- | --- |
| is_fraud |  precision: 0.9415, recall: 0.9427, accuracy: 0.9412 |
|fraud_speaker| accuracy: 0.9168 |
|reason| precision: 0.4471, recall: 0.4608, f1-score: 0.4438 |

**小结**：本文通过引入K折交叉验证方法，循环选择不同的训练集和验证集进行多次迭代训练，将损失降到了一个更低的值，也在很大程度上缓解了[前面每次训练]过程中都出现的过拟合现象。最终在从未见过的测试数据集上进行评测时，精确率和召回率指标也有了一个大的提升，K折交叉验证这种方法确实能让模型对数据学习的更充分，最终得到的模型泛化能力也更好。

## 参考文献
- [交叉验证方法汇总](https://blog.csdn.net/WHYbeHERE/article/details/108192957)