<a href="https://colab.research.google.com/github/lunaczp/learn-ai/blob/main/notebooks/llama-light/convert_and_quantiz.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 转换并量化中文LLaMA和Alpaca模型

项目地址：https://github.com/ymcui/Chinese-LLaMA-Alpaca

⚠️ 内存消耗提示（确保刷出来的机器RAM大于以下要求）：
- 7B模型：15G+
- 13B模型：18G+
- 33B模型：22G+

💡 提示和小窍门：
- 免费用户默认的内存只有12G左右，不足以转换模型。**实测选择TPU的话有机会随机出35G内存**，建议多试几次
- Pro(+)用户请选择 “代码执行程序” -> “更改运行时类型” -> “高RAM”
- 程序莫名崩掉或断开连接就说明内存爆了
- 如果选了“高RAM”之后内存还是不够大的话，选择以下操作，有的时候会分配出很高内存的机器，祝你好运😄！
    - 可以把GPU或者TPU也选上（虽然不会用到）
    - 选GPU时，Pro(+)用户可选“A100”类型GPU

*温馨提示：用完之后注意断开运行时，选择满足要求的最低配置即可，避免不必要的计算单元消耗（Pro只给100个计算单元）。*

## 安装相关依赖

In [1]:
!nvidia-smi
!pip install torch==1.13.1
!pip install transformers==4.30.2
!pip install peft==0.3.0
!pip install sentencepiece

Mon May 20 10:53:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   50C    P8              13W /  72W |      1MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## 克隆目录和代码

In [2]:
!git clone https://github.com/ymcui/Chinese-LLaMA-Alpaca
!git clone https://github.com/ggerganov/llama.cpp

Cloning into 'Chinese-LLaMA-Alpaca'...
remote: Enumerating objects: 2178, done.[K
remote: Counting objects: 100% (520/520), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 2178 (delta 444), reused 409 (delta 402), pack-reused 1658[K
Receiving objects: 100% (2178/2178), 23.51 MiB | 22.23 MiB/s, done.
Resolving deltas: 100% (1356/1356), done.
Cloning into 'llama.cpp'...
remote: Enumerating objects: 25003, done.[K
remote: Counting objects: 100% (9011/9011), done.[K
remote: Compressing objects: 100% (518/518), done.[K
remote: Total 25003 (delta 8780), reused 8521 (delta 8492), pack-reused 15992[K
Receiving objects: 100% (25003/25003), 44.43 MiB | 19.33 MiB/s, done.
Resolving deltas: 100% (17802/17802), done.


## 合并模型（以Alpaca-7B为例）

此处使用的是🤗模型库中提供的基模型（已是HF格式），而不是Facebook官方的LLaMA模型，因此略去将原版LLaMA转换为HF格式的步骤。
**这里直接运行第二步：合并LoRA权重**，生成全量模型权重。可以直接指定🤗模型库的地址，也可以是本地存放地址。
- 基模型：`elinas/llama-7b-hf-transformers-4.29` *（use at your own risk，我们比对过SHA256和正版一致，但你应确保自己有权使用该模型）*
- LoRA模型：`ziqingyang/chinese-alpaca-lora-7b`
   - 如果是Alpaca-Plus模型，记得要同时传入llama和alpaca的lora，教程：[这里](https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/手动模型合并与转换#多lora权重合并适用于chinese-alpaca-plus)
- 输出格式：可选pth或者huggingface，这里选择pth，因为后面要用llama.cpp量化

由于要下载模型，所以需要耐心等待一下，尤其是33B模型。
转换好的模型存放在`alpaca-combined`目录。
如果你不需要量化模型，那么到这一步就结束了，可自行下载或者转存到Google Drive。

In [3]:
!python ./Chinese-LLaMA-Alpaca/scripts/merge_llama_with_chinese_lora_low_mem.py \
    --base_model 'elinas/llama-7b-hf-transformers-4.29' \
    --lora_model 'hfl/chinese-alpaca-lora-7b' \
    --output_type pth \
    --output_dir alpaca-combined

Base model: elinas/llama-7b-hf-transformers-4.29
LoRA model(s) ['hfl/chinese-alpaca-lora-7b']:
Loading hfl/chinese-alpaca-lora-7b
Cannot find lora model on the disk. Downloading lora model from hub...
Fetching 7 files:   0% 0/7 [00:00<?, ?it/s]
.gitattributes: 100% 1.48k/1.48k [00:00<00:00, 9.68MB/s]
Fetching 7 files:  14% 1/7 [00:00<00:02,  2.06it/s]
adapter_config.json: 100% 472/472 [00:00<00:00, 3.61MB/s]

tokenizer_config.json: 100% 166/166 [00:00<00:00, 1.22MB/s]

special_tokens_map.json: 100% 96.0/96.0 [00:00<00:00, 621kB/s]

README.md: 100% 316/316 [00:00<00:00, 2.25MB/s]
Fetching 7 files:  29% 2/7 [00:00<00:01,  3.01it/s]
adapter_model.bin:   0% 0.00/858M [00:00<?, ?B/s][A

tokenizer.model:   0% 0.00/758k [00:00<?, ?B/s][A[A
adapter_model.bin:   1% 10.5M/858M [00:00<00:18, 46.1MB/s][A

tokenizer.model: 100% 758k/758k [00:00<00:00, 2.47MB/s]

adapter_model.bin:   2% 21.0M/858M [00:00<00:16, 52.3MB/s][A
adapter_model.bin:   4% 31.5M/858M [00:00<00:12, 63.7MB/s][A
adapter_mo

## 比对SHA256

完整值：https://github.com/ymcui/Chinese-LLaMA-Alpaca/blob/main/SHA256.md

其中本示例生成的Alpaca-7B的标准SHA256：
- fbfccc91183169842aac8d093379f0a449b5a26c5ee7a298baf0d556f1499b90

使用下述命令评测后发现两者相同，合并无误。

In [4]:
!sha256sum alpaca-combined/consolidated.*.pth

fbfccc91183169842aac8d093379f0a449b5a26c5ee7a298baf0d556f1499b90  alpaca-combined/consolidated.00.pth


## 量化模型
接下来我们使用[llama.cpp](https://github.com/ggerganov/llama.cpp)工具对上一步生成的全量版本权重进行转换，生成4-bit量化模型。

### 编译工具

首先对llama.cpp工具进行编译。

In [5]:
!cd llama.cpp && make

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_LLAMAFILE 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       c++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D

In [7]:
!cd llama.cpp && python convert.py --help

usage: convert.py [-h] [--dump] [--dump-single] [--vocab-only] [--no-vocab]
                  [--outtype {f32,f16,q8_0}] [--vocab-dir VOCAB_DIR] [--vocab-type VOCAB_TYPE]
                  [--outfile OUTFILE] [--ctx CTX] [--concurrency CONCURRENCY] [--big-endian]
                  [--pad-vocab] [--skip-unknown] [--verbose] [--metadata METADATA] [--get-outfile]
                  model

Convert a LLaMA model to a GGML compatible file

positional arguments:
  model                 directory containing model file, or model file itself (*.pth, *.pt, *.bin)

options:
  -h, --help            show this help message and exit
  --dump                don't convert, just show what's in the model
  --dump-single         don't convert, just show what's in a single model file
  --vocab-only          extract only the vocab
  --no-vocab            store model without the vocab
  --outtype {f32,f16,q8_0}
                        output format - note: q8_0 may be very slow (default: f16 or f32 based on
  

### 模型转换为ggml格式（FP16）

这一步，我们将模型转换为ggml格式（FP16）。
- 在这之前需要把`alpaca-combined`目录挪个位置，把模型文件放到`llama.cpp/zh-models/7B`下，把`tokenizer.model`放到`llama.cpp/zh-models`
- tokenizer在哪里？
    - `alpaca-combined`目录下有
    - 或者从以下网址下载：https://huggingface.co/ziqingyang/chinese-alpaca-lora-7b/resolve/main/tokenizer.model （注意，Alpaca和LLaMA的`tokenizer.model`不能混用！）

💡 转换13B/33B模型提示：
- tokenizer可以直接用7B的，13B/33B和7B的相同
- Alpaca和LLaMA的`tokenizer.model`不能混用！
- 以下看到7B字样的都是文件夹名，与转换过程没有关系了，改不改都行

In [8]:
!mv alpaca-combined llama-7b-alpaca-light
!ls llama-7b-alpaca-light

consolidated.00.pth  params.json  special_tokens_map.json  tokenizer_config.json  tokenizer.model


In [9]:
!mkdir models
!python llama.cpp/convert.py --outtype f16 --outfile models/llama-7b-alpaca-light-f16.gguf llama-7b-alpaca-light/

INFO:convert:Loading model file llama-7b-alpaca-light/consolidated.00.pth
INFO:convert:model parameters count : 6885494784 (7B)
INFO:convert:params = Params(n_vocab=49954, n_embd=4096, n_layer=32, n_ctx=2048, n_ff=11008, n_head=32, n_head_kv=32, n_experts=None, n_experts_used=None, f_norm_eps=1e-06, rope_scaling_type=None, f_rope_freq_base=None, f_rope_scale=None, n_orig_ctx=None, rope_finetuned=None, ftype=<GGMLFileType.MostlyF16: 1>, path_model=PosixPath('llama-7b-alpaca-light'))
INFO:convert:Loaded vocab file PosixPath('llama-7b-alpaca-light/tokenizer.model'), type 'spm'
INFO:convert:Vocab info: <SentencePieceVocab with 49954 base tokens and 0 added tokens>
INFO:convert:Special vocab info: <SpecialVocab with 0 merges, special tokens unset, add special tokens {'bos': True, 'eos': False}>
INFO:convert:Writing models/llama-7b-alpaca-light-f16.gguf, format 1
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab

### 将FP16模型量化为4-bit

我们进一步将FP16模型转换为4-bit量化模型，此处选择的是新版Q4_K方法。

In [10]:
!cd llama.cpp && ./quantize ../models/llama-7b-alpaca-light-f16.gguf ../models/llama-7b-alpaca-light-Q4_K_M.gguf Q4_K_M

main: build = 2947 (26cd4237)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '../models/llama-7b-alpaca-light-f16.gguf' to '../models/llama-7b-alpaca-light-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 18 key-value pairs and 291 tensors from ../models/llama-7b-alpaca-light-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama-7b-alpaca-light
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 49954
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: 

### （可选）测试量化模型解码
至此已完成了所有转换步骤。
我们运行一条命令测试一下是否能够正常加载并进行对话。

FP16和Q4量化文件存放在./llama.cpp/zh-models/7B下，可按需下载使用。

In [14]:
!cd llama.cpp && ./main -m ../models/llama-7b-alpaca-light-Q4_K_M.gguf --color -p "详细介绍一下北京的名胜古迹：" -n 1280

Log start
main: build = 2947 (26cd4237)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1716204254
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../models/llama-7b-alpaca-light-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = llama-7b-alpaca-light
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 49954
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32              = 32
llama_

# 上传

In [15]:
! pip install -U "huggingface_hub[cli]"

Collecting huggingface_hub[cli]
  Downloading huggingface_hub-0.23.0-py3-none-any.whl (401 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.2/401.2 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting InquirerPy==0.3.4 (from huggingface_hub[cli])
  Downloading InquirerPy-0.3.4-py3-none-any.whl (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.7/67.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pfzy<0.4.0,>=0.3.1 (from InquirerPy==0.3.4->huggingface_hub[cli])
  Downloading pfzy-0.3.4-py3-none-any.whl (8.5 kB)
Installing collected packages: pfzy, InquirerPy, huggingface_hub
  Attempting uninstall: huggingface_hub
    Found existing installation: huggingface-hub 0.20.3
    Uninstalling huggingface-hub-0.20.3:
      Successfully uninstalled huggingface-hub-0.20.3
Successfully installed InquirerPy-0.3.4 huggingface_hub-0.23.0 pfzy-0.3.4


In [16]:
! huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) 
Token is valid (permission: write).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

上传原始模型

In [17]:
! cd llama-7b-alpaca-light/ && huggingface-cli upload LightXXXXX/llama-7b-alpaca-light . .

Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
consolidated.00.pth:   0% 0.00/13.8G [00:00<?, ?B/s]
tokenizer.model:   0% 0.00/758k [00:00<?, ?B/s][A

consolidated.00.pth:   0% 16.4k/13.8G [00:00<30:13:04, 127kB/s]
tokenizer.model: 100% 758k/758k [00:00<00:00, 1.26MB/s]
consolidated.00.pth: 100% 13.8G/13.8G [05:12<00:00, 44.0MB/s]


Upload 2 LFS files: 100% 2/2 [05:13<00:00, 156.72s/it]
https://huggingface.co/LightXXXXX/llama-7b-alpaca-light/tree/main/.


上传gguf

In [18]:
! cd models/ && huggingface-cli upload LightXXXXX/llama-7b-alpaca-light-gguf . .

Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.
llama-7b-alpaca-light-Q4_K_M.gguf:   0% 0.00/4.18G [00:00<?, ?B/s]
Upload 2 LFS files:   0% 0/2 [00:00<?, ?it/s][A

llama-7b-alpaca-light-f16.gguf:   0% 0.00/13.8G [00:00<?, ?B/s][A[A

llama-7b-alpaca-light-Q4_K_M.gguf:   0% 3.46M/4.18G [00:00<04:05, 17.0MB/s] 

llama-7b-alpaca-light-Q4_K_M.gguf:   0% 6.95M/4.18G [00:00<03:27, 20.2MB/s]

llama-7b-alpaca-light-Q4_K_M.gguf:   0% 10.6M/4.18G [00:00<02:52, 24.2MB/s]

llama-7b-alpaca-light-Q4_K_M.gguf:   1% 44.0M/4.18G [00:01<02:18, 29.9MB/s]

llama-7b-alpaca-light-Q4_K_M.gguf:   1% 48.0M/4.18G [00:02<03:54, 17.7MB/s]

llama-7b-alpaca-light-Q4_K_M.gguf:   1% 54.8M/4.18G [00:02<02:51, 24.0MB/s]

llama-7b-alpaca-light-Q4_K_M.gguf:   1% 59.6M/4.18G [00:02<02:41, 25.5MB/s]

llama-7b-alpaca-light-Q4_K_M.gguf:   2% 64.0M/4.18G [00:02<03:41, 18.6MB/s]

llama-7b-alpaca-light-Q4_