## 前言

LLama.cpp是由Georgi Gerganov开发的一个开源工具，主要用于将大语言模型（LLM）转换为C++代码，使它们可以在任意的CPU设备上运行。

它的优势在于：
- 无需依赖pytorch和python，而是以c++编译的可执行文件来运行。
- 支持丰富的硬件设备，包括Nvidia、AMD、Intel、Apple Silicon、华为昇腾等芯片。
- 支持f16和f32混合精度，也支持8位、4位甚至1位的量化来加快推理。
- 无需GPU，可只用CPU运行，甚至可以在Android设备上运行。

本文我们将用llama.cpp来运行之前微调过的欺诈文本分类模型。

使用llama-server运行gguf, 并通过`http://192.168.31.200:8080/`来访问 

In [None]:
!/data2/downloads/llama.cpp/llama-server -m /data2/anti_fraud/models/Qwen2-1___5B-anti_fraud_1__1/model-BF16.gguf -ngl 28 -fa --host 0.0.0.0 --port 8080

## 模型文件转换
我们微调后的模型由两部分组成：基座模型和Lora适配器，需要对这两者分别转换，最后再合并。

先用`convert_hf_to_gguf.py`工具转换基座模型。

> 注：convert_hf_to_gguf.py是llama.cpp提供的工具脚本，位于安装目录下，用于将huggingface上下载的safetensors模型格式转换为gguf文件。

In [4]:
!python /data2/downloads/llama.cpp/convert_hf_to_gguf.py \
    --outtype bf16 \
    --outfile /data2/anti_fraud/models/anti_fraud_v11/qwen2_bf16.gguf \
    --model-name qwen2 \
    /data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct

INFO:hf-to-gguf:Loading model: Qwen2-1___5B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight,         torch.bfloat16 --> BF16, shape = {1536, 151936}
INFO:hf-to-gguf:blk.0.attn_norm.weight,    torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.ffn_down.weight,     torch.bfloat16 --> BF16, shape = {8960, 1536}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,     torch.bfloat16 --> BF16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_up.weight,       torch.bfloat16 --> BF16, shape = {1536, 8960}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,     torch.bfloat16 --> F32, shape = {1536}
INFO:hf-to-gguf:blk.0.attn_k.bias,         torch.bfloat16 --> F32, shape = {256}
INFO:hf-to-gguf:blk.0.attn_k.weight,       torch.bfloat16 --> BF16, shape = {1536, 256}
INFO:hf-to-gguf:blk.0.attn_output.weight,  torch.bfloat16 --> BF16, shape = {1536, 1536}
IN

接下来使用`convert_lora_to_gguf.py `脚本工具来转换lora适配器。

In [5]:
!python /data2/downloads/llama.cpp/convert_lora_to_gguf.py \
    --base /data2/anti_fraud/models/modelscope/hub/Qwen/Qwen2-1___5B-Instruct \
    --outfile /data2/anti_fraud/models/anti_fraud_v11/lora_0913_4_bf16.gguf \
    /data2/anti_fraud/models/Qwen2-1___5B-Instruct_ft_0913_4/checkpoint-5454 \
    --outtype bf16 --verbose

INFO:lora-to-gguf:Loading base model: Qwen2-1___5B-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:lora-to-gguf:Exporting model...
INFO:hf-to-gguf:blk.0.ffn_down.weight.lora_a, torch.float32 --> BF16, shape = {8960, 16}
INFO:hf-to-gguf:blk.0.ffn_down.weight.lora_b, torch.float32 --> BF16, shape = {16, 1536}
INFO:hf-to-gguf:blk.0.ffn_gate.weight.lora_a, torch.float32 --> BF16, shape = {1536, 16}
INFO:hf-to-gguf:blk.0.ffn_gate.weight.lora_b, torch.float32 --> BF16, shape = {16, 8960}
INFO:hf-to-gguf:blk.0.ffn_up.weight.lora_a, torch.float32 --> BF16, shape = {1536, 16}
INFO:hf-to-gguf:blk.0.ffn_up.weight.lora_b, torch.float32 --> BF16, shape = {16, 8960}
INFO:hf-to-gguf:blk.0.attn_k.weight.lora_a, torch.float32 --> BF16, shape = {1536, 16}
INFO:hf-to-gguf:blk.0.attn_k.weight.lora_b, torch.float32 --> BF16, shape = {16, 256}
INFO:hf-to-gguf:blk.0.attn_output.weight.lora_a, torch.float32 --> BF16, shape = {1536, 16}
INFO:hf-to-gguf:blk.0.attn_output.weigh

执行完后，得到一个Lora适配器的gguf文件`lora_0913_4_bf16.gguf`。

使用`llama-export-lora`工具将基座模型和Lora适配器合并为一个gguf文件。

In [6]:
!/data2/downloads/llama.cpp/llama-export-lora \
    -m /data2/anti_fraud/models/anti_fraud_v11/qwen2_bf16.gguf \
    -o /data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf \
    --lora /data2/anti_fraud/models/anti_fraud_v11/lora_0913_4_bf16.gguf

file_input: loaded gguf from /data2/anti_fraud/models/anti_fraud_v11/qwen2_bf16.gguf
file_input: loaded gguf from /data2/anti_fraud/models/anti_fraud_v11/lora_0913_4_bf16.gguf
copy_tensor :  blk.0.attn_k.bias [256, 1, 1, 1]
merge_tensor : blk.0.attn_k.weight [1536, 256, 1, 1]
merge_tensor :   + dequantize base tensor from bf16 to F32
merge_tensor :   + merging from adapter[0] type=bf16
merge_tensor :     input_scale=1.000000 calculated_scale=2.000000 rank=16
merge_tensor :   + output type is f16
copy_tensor :  blk.0.attn_norm.weight [1536, 1, 1, 1]
merge_tensor : blk.0.attn_output.weight [1536, 1536, 1, 1]
merge_tensor :   + dequantize base tensor from bf16 to F32
merge_tensor :   + merging from adapter[0] type=bf16
merge_tensor :     input_scale=1.000000 calculated_scale=2.000000 rank=16
merge_tensor :   + output type is f16
copy_tensor :  blk.0.attn_q.bias [1536, 1, 1, 1]
merge_tensor : blk.0.attn_q.weight [1536, 1536, 1, 1]
merge_tensor :   + dequantize base tensor from bf16 to F32



查看导出的文件：

```python
-rw-rw-r-- 1   42885408 Nov  9 14:57 lora_0913_4_bf16.gguf
-rw-rw-r-- 1 3093666720 Nov  9 14:58 model_bf16.gguf
-rw-rw-r-- 1 3093666720 Nov  9 14:56 qwen2_bf16.gguf
```


经过上面三步，我们就将safetensors格式的基座模型和lora适配器导出为gguf格式的模型文件`model_bf16.gguf`，此时模型文件大小并没有变化，仍然有3G。

用`llama-cli`命令验证此模型文件是否能正常work。
> llama-cli是一种命令行接口，允许用户只通过一条命令完成模型启动和模型访问，用于快速测试和调试。

In [3]:
!/data2/downloads/llama.cpp/llama-cli --log-disable \
	-m /data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf \
	-p "我是一个来自太行山下小村庄家的孩子" \
	-n 100

我是一个来自太行山下小村庄家的孩子，从小生活在乡下，对乡下充满了深深的思念，也对乡下人有着深深的敬仰。从小我便梦想着能够走出大山，去看看外面的世界，见识不一样的风景。如今，这个梦想终于成真，我如愿以偿地来到了这个繁华的城市，开始了全新的生活。
初到城市的时候，我感到既兴奋又迷茫，兴奋的是能够接触到不同的文化，接触到外面的世界；迷茫的是，自己曾经的生活环境和习惯突然改变，需要

## 量化

使用`llama-quantize`工具将模型文件由16位量化为8位。

In [14]:
!/data2/downloads/llama.cpp/llama-quantize \
/data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf /data2/anti_fraud/models/anti_fraud_v11/model_bf16_q8_0.gguf q8_0

main: build = 3646 (cddae488)
main: built with cc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 for x86_64-linux-gnu
main: quantizing '/data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf' to '/data2/anti_fraud/models/anti_fraud_v11/model_bf16_q8_0.gguf' as Q8_0
llama_model_loader: loaded meta data with 28 key-value pairs and 338 tensors from /data2/anti_fraud/models/anti_fraud_v11/model_bf16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = qwen2
llama_model_loader: - kv   3:                           general.finetune str              = 1___5B-Instruct
llama_model_loader: - kv   4:                           general.basename str   

经过量化后，模型文件由`2944.68MB`减小到`1564.62MB`，几乎缩小了一倍。

## 转换为ollama文件

In [5]:
!cat /data2/anti_fraud/models/anti_fraud_v11/anti_fraud.modelfile

FROM ./model_bf16_q8_0.gguf

TEMPLATE """{% set system_message = 'You are a helpful assistant.' %}{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ '<|im_start|>system\n' + system_message + '<|im_end|>\n' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ '<|im_start|>user\n' + content + '<|im_end|>\n<|im_start|>assistant\n' }}{% elif message['role'] == 'assistant' %}{{ content + '<|im_end|>' + '\n' }}{% endif %}{% endfor %}"""

PARAMETER stop "<|im_end|>"




> 注：ollama不支持bf16，相关报错信息：Error: invalid file magic
[不支持bf16的说明](https://github.com/ollama/ollama/issues/4670)

In [1]:
!ollama create qwen2:1.5b -f /data2/anti_fraud/models/anti_fraud_v11/qwen2.modelfile

[?25ltransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠏ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [?25h[?25l[2K[1Gtransferring model data ⠸ [?25h[?25l[2K[1Gtransferring model data ⠼ [?25h[?25l[2K[1Gtransferring model data ⠴ [?25h[?25l[2K[1Gtransferring model data ⠦ [?25h[?25l[2K[1Gtransferring model data ⠧ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠇ [?25h[?25l[2K[1Gtransferring model data ⠋ [?25h[?25l[2K[1Gtransferring model data ⠙ [?25h[?25l[2K[1Gtransferring model data ⠹ [

In [None]:
删除模型：

In [None]:
ollama rm qwen2:1.5b