## 引言
[前文](https://golfxiao.blog.csdn.net/article/details/141440847)训练时都做了一定的编码工作，其实有一些框架可以支持我们零代码微调，[LLama-Factory](https://llamafactory.readthedocs.io/zh-cn/latest/)就是其中一个。这是一个专门针对大语言模型的微调和训练平台，有如下特性：
- 支持常见的模型种类：LLaMA、LLaVA、Mistral、Mixtral-MoE、Qwen、Yi、Gemma、Baichuan、ChatGLM、Phi 等等。 
- 支持单GPU和多GPU训练。
- 支持全参微调、Lora微调、QLora微调。
……
还有很多优秀的特性，详细参考：[https://llamafactory.readthedocs.io/zh-cn/latest/](https://llamafactory.readthedocs.io/zh-cn/latest/)

本文会尝试用LLamaFactory进行一次多GPU训练。

## 参数配置
LLamaFactory的训练参数采用yaml文件保存，在安装目录下的`examples`子目录下有各种微调方法的示例配置，可以直接拷贝一份进行修改。

![在这里插入图片描述](https://i-blog.csdnimg.cn/direct/95c191a523e840fc969c0d014c82047e.png)

查看配置文件

In [4]:
!cat /data2/anti_fraud/train/sft-0910/qwen2_lora_sft.yaml 

### model
model_name_or_path: /data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct
# resume_from_checkpoint: /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0826/checkpoint-1200

### method
stage: sft
do_train: true
finetuning_type: lora
lora_target: q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.2


### dataset
dataset_dir: /data2/anti_fraud/dataset/
dataset: anti_fraud_0902
template: qwen
cutoff_len: 1024
max_samples: 200000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0910-3
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

### train
per_device_train_batch_size: 8
gradient_accumulation_steps: 1
gradient_checkpointing: true
learning_rate: 1.0e-4
num_train_epochs: 10.0
lr_scheduler_type: cosine
warmup_ratio: 0.05
bf16: true
ddp_timeout: 180000000

### eval
val_size: 0.1
per_device_eval_batch_size: 8
eval_stra

## 训练-1：低学习率

设置环境变量CUDA_VISIBLE_DEVICES声明训练过程中允许使用4张显卡，显卡编号分别为1、2、3、4。

使用	`llamafactory-cli`命令启动训练。

In [5]:
import os 

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3,5"

In [None]:
!llamafactory-cli train /data2/anti_fraud/train/sft-0910/qwen2_lora_sft.yaml 

[2024-09-10 22:43:56,919] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
  def forward(ctx, input, weight, bias=None):
  def backward(ctx, grad_output):
09/10/2024 22:44:04 - INFO - llamafactory.cli - Initializing distributed tasks at: 127.0.0.1:21321
W0910 22:44:05.753000 140124487410624 torch/distributed/run.py:779] 
W0910 22:44:05.753000 140124487410624 torch/distributed/run.py:779] *****************************************
W0910 22:44:05.753000 140124487410624 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0910 22:44:05.753000 140124487410624 torch/distributed/run.py:779] *****************************************
[2024-09-10 22:44:10,233] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-10 22

## 验证数据集上评估

In [4]:
!ls -l /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0910

total 83896
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua      768 Sep 10 20:10 adapter_config.json
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua 73911112 Sep 10 20:10 adapter_model.safetensors
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua       80 Sep 10 20:10 added_tokens.json
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua      349 Sep 10 20:11 all_results.json
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 17:45 checkpoint-1000
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 17:57 checkpoint-1500
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 18:10 checkpoint-2000
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 18:22 checkpoint-2500
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 18:35 checkpoint-3000
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 18:47 checkpoint-3500
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 19:00 checkpoint-4000
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 19:12 checkpoint-4500
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [1]:
%run evaluate.py
testdata_path = '/data2/anti_fraud/dataset/eval0902.jsonl'
model_path = '/data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct'
device = 'cuda:1'

In [3]:
%%time
## eval_loss=0.0152
checkpoint_path_6500 = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0910/checkpoint-6500'
evaluate(model_path, checkpoint_path_6500, testdata_path, device, batch=True, debug=True)

progress: 100%|██████████| 3031/3031 [04:16<00:00, 11.83it/s]

tn：1477, fp:52, fn:220, tp:1282
precision: 0.9610194902548725, recall: 0.8535286284953395
CPU times: user 4min 17s, sys: 25.9 s, total: 4min 42s
Wall time: 4min 19s





In [7]:
%%time
## eval_loss=0.0152
checkpoint_path_6500 = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0910-3/checkpoint-6500'
evaluate(model_path, checkpoint_path_6500, testdata_path, device, batch=True, debug=True)

progress:  61%|██████    | 1856/3031 [02:48<04:21,  4.49it/s]

invalid json: {"input_text": "在对话中提到的某公司招聘，要求提供个人敏感信息（如身份证号码、银行卡号等），这是典型的诈骗行为。"} {"is_fraud": true}


progress: 100%|██████████| 3031/3031 [04:28<00:00, 11.29it/s]

tn：1451, fp:78, fn:150, tp:1352
precision: 0.9454545454545454, recall: 0.9001331557922769
CPU times: user 4min 26s, sys: 28.5 s, total: 4min 54s
Wall time: 4min 31s





In [8]:
!ls -l /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0910-3

total 83896
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua      768 Sep 11 01:36 adapter_config.json
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua 73911112 Sep 11 01:36 adapter_model.safetensors
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua       80 Sep 11 01:36 added_tokens.json
-rw-rw-r-- 1 xiaoguanghua xiaoguanghua      349 Sep 11 01:36 all_results.json
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 23:11 checkpoint-1000
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 23:23 checkpoint-1500
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 23:35 checkpoint-2000
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 10 23:48 checkpoint-2500
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 11 00:00 checkpoint-3000
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 11 00:13 checkpoint-3500
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 11 00:25 checkpoint-4000
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096 Sep 11 00:38 checkpoint-4500
drwxrwxr-x 2 xiaoguanghua xiaoguanghua     4096

In [2]:
%%time
%run evaluate_v2.py
checkpoint_path = '/data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0910-3/checkpoint-6500'
evaluate_v2(model_path, checkpoint_path, testdata_path, device, debug=True)

progress:  35%|███▍      | 1048/3031 [12:42<23:29,  1.41it/s]

invalid json: {"is_fraud": true, "fraud_speaker": "王强", "reason": "王强通过引导李丽进行充值，并承诺稳赚不赔的收益，属于典型的网络投资诈骗手法。"}, {"is_fraud": true, "fraud_speaker": "张华", "reason": "张华冒充电商客服，要求覃军通过支付宝备用金功能将资金转到指定的银行卡账户上，这种操作方式具有明显的诈骗特征。"}


progress:  37%|███▋      | 1128/3031 [13:42<23:41,  1.34it/s]

invalid json: {"is_fraud": true, "fraud_speaker": "诈骗者", "reason": "在对话中，'诈骗者'引导'小梅'进行提现操作，并要求其提供银行账号信息。这是典型的网络博彩诈骗手段之一，通过获取受害者的银行信息进行进一步的欺诈行为。'}


progress:  68%|██████▊   | 2072/3031 [25:11<15:33,  1.03it/s]

invalid json: {"is_fraud": true, "fraud_speaker": "张伟", "reason": "张伟试图通过私下退款并额外返还20元的方式吸引李婷加入微信，并可能进一步获取她的个人信息。这种操作方式具有明显的诈骗特征，因为通常情况下，正规平台不会通过私下退款的方式处理问题，且张伟提到的信息可能是虚假的。"}, {"is_fraud": true, "fraud_speaker": "姜丽", "reason": "姜丽声称新出的一个套餐只需要支付20元并且可以每月获得120G的流量，但这种说法并不真实可信，可能是诱导用户充值或付款的骗局。"}, {"is_fraud": true, "fraud_speaker": "刘志强", "reason": "刘志强声称自己是某银行的客服，但银行一般不会通过这种方式通知客户账户问题，而且5万元转账境外的问题也不符合常理，可能是冒用银行名义进行诈骗。"}


progress: 100%|██████████| 3031/3031 [36:49<00:00,  1.37it/s]


is_fraud字段指标:
tn：1421, fp:108, fn:93, tp:1409
precision: 0.928806855636124, recall: 0.9380825565912118, accuracy: 0.9336852523919499
fraud_speaker字段指标:
accuracy: 0.90498185417354
reason字段指标:
precision: 0.44658668846108157, recall: 0.4553686294916234, f1-score: 0.4406380456120786
CPU times: user 36min 42s, sys: 39.4 s, total: 37min 22s
Wall time: 36min 56s
