# EXL2
A more recent format based on the GPTQ optimization method but with mixed quantization levels. It achieves an average desired bitrate with lower errors than GPTQ while keeping the same or similar bitrate. Can have a slightly higher VRAM usage but better inference speed and quality.
### Quantizing with [exllamav2](https://github.com/turboderp/exllamav2)

Lets do a short demo and quantize Mistral 7B!

let's install `exllamav2` and all dependencies required.

In [1]:
!git clone https://github.com/turboderp/exllamav2
!(cd exllamav2 && pip install -r requirements.txt && pip install .)

Cloning into 'exllamav2'...
remote: Enumerating objects: 6958, done.[K
remote: Counting objects: 100% (1276/1276), done.[K
remote: Compressing objects: 100% (550/550), done.[K
remote: Total 6958 (delta 909), reused 1042 (delta 721), pack-reused 5682 (from 1)[K
Receiving objects: 100% (6958/6958), 19.88 MiB | 24.76 MiB/s, done.
Resolving deltas: 100% (4908/4908), done.
Collecting ninja (from -r requirements.txt (line 2))
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting fastparquet (from -r requirements.txt (line 5))
  Downloading fastparquet-2024.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting websockets (from -r requirements.txt (line 10))
  Downloading websockets-13.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting cramjam>=2.3 (from fastparquet->-r requirements.txt (line 5))
  Downloading cramja

Once everything installed we can download the model.

In [2]:
model_id = "mistralai/Mistral-7B-Instruct-v0.3"
user_name = "huggingface_username"
hf_token = "read_token"

!git lfs install
!git clone https://{user_name}:{hf_token}@huggingface.co/{model_id}

Git LFS initialized.
Cloning into 'Mistral-7B-Instruct-v0.3'...
remote: Enumerating objects: 93, done.[K
remote: Counting objects: 100% (89/89), done.[K
remote: Compressing objects: 100% (89/89), done.[K
remote: Total 93 (delta 46), reused 0 (delta 0), pack-reused 4 (from 1)[K
Unpacking objects: 100% (93/93), 734.97 KiB | 4.20 MiB/s, done.
Filtering content: 100% (5/5), 3.00 GiB | 12.59 MiB/s, done.
Encountered 4 file(s) that may not have been copied correctly on Windows:
	model-00002-of-00003.safetensors
	model-00003-of-00003.safetensors
	model-00001-of-00003.safetensors
	consolidated.safetensors

See: `git lfs help smudge` for more details.


Time to quantize! Lets go with a bitrate of 4.0

In [None]:
model_name = model_id.split('/')[-1]
quant_bpw = 4.0

!mkdir temp
!python exllamav2/convert.py \
    -i {model_name} \
    -o temp/ \
    -cf {model_name}-exl2/{quant_bpw}bpw/ \
    -b {quant_bpw}

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 -- model.layers.2.self_attn.v_proj                    0.25:3b_64g/0.75:2b_64g s4                         2.34 bpw
 -- model.layers.2.self_attn.v_proj                    0.1:4b_128g/0.9:3b_128g s4                         3.19 bpw
 -- model.layers.2.self_attn.v_proj                    0.1:4b_64g/0.9:3b_64g s4                           3.20 bpw
 -- model.layers.2.self_attn.v_proj                    1:4b_128g s4                                       4.06 bpw
 -- model.layers.2.self_attn.v_proj                    1:4b_64g s4                                        4.09 bpw
 -- model.layers.2.self_attn.v_proj                    1:4b_32g s4                                        4.16 bpw
 -- model.layers.2.self_attn.v_proj                    0.1:5b_64g/0.9:4b_64g s4                           4.20 bpw
 -- model.layers.2.self_attn.v_proj                    0.1:5b_32g/0.9:4b_32g s4                           4.26 bpw
 -- model.layer

Model quantized and saved! You can test it with the following:

In [13]:
!python exllamav2/test_inference.py -m {model_name}-exl2/{quant_bpw}bpw -p "Once upon a time,"

 -- Model: Mistral-7B-Instruct-v0.3-exl2/4.0bpw
 -- Options: []
 -- Loading model...
 -- Loaded model in 2.0031 seconds
 -- Loading tokenizer...
 -- Warmup...
 -- Generating...

Once upon a time, in a land far, far away, children had to work hard to get their three meals a day. The chores were backbreaking and tedious. However, they were grateful for their food because they knew how difficult it was to produce, process, and prepare it.

Fast forward to the present time: a century of technological advancement has made food readily available at our fingertips. We no longer have to toil in the fields to gather food, or spend hours in the kitchen to prepare it. The ease with which we can now access delicious food has led to the proliferation of fast food chains

 -- Response generated in 2.27 seconds, 128 tokens, 56.51 tokens/second (includes prompt eval.)
