# AWQ on Vicuna

In this notebook, we use Vicuna model to demonstrate the performance of AWQ on instruction-tuned models. We implement AWQ real-INT4 inference kernels, which are wrapped as Pytorch modules and can be easily used by existing models. We also provide a simple example to show how to use AWQ to quantize a model and save/load the quantized model checkpoint.

In order to run this notebook, you need to install the following packages:
- [AWQ](https://github.com/mit-han-lab/llm-awq)
- [Pytorch](https://pytorch.org/)
- [Accelerate](https://github.com/huggingface/accelerate)
- [FastChat](https://github.com/lm-sys/FastChat)
- [Transformers](https://github.com/huggingface/transformers)

In [None]:
import torch
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from awq.quantize.quantizer import real_quantize_model_weight
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from fastchat.serve.cli import SimpleChatIO
from fastchat.serve.inference import generate_stream
from fastchat.conversation import get_conv_template
import os
# This demo only support single GPU for now
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

Please get the Vicuna model from [FastChat](https://github.com/lm-sys/FastChat) and run the following command to generate a quantized model checkpoint first.

```bash
mkdir quant_cache
python -m awq.entry --model_path [vicuna-7b_model_path] \
    --w_bit 4 --q_group_size 128 \
    --load_awq awq_cache/vicuna-7b-w4-g128.pt \
    --q_backend real --dump_quant quant_cache/vicuna-7b-w4-g128-awq.pt
```

In [2]:
model_path = "" # the path of vicuna-7b model
load_quant_path = "quant_cache/vicuna-7b-w4-g128-awq.pt"

We first load a empty model and replace all the linear layers with WQLinear layers. Then we load the quantized weights from the checkpoint. 

In [3]:
config = AutoConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained(model_path, config=config,
                                                    torch_dtype=torch.float16)
q_config = {"zero_point": True, "q_group_size": 128}
real_quantize_model_weight(
    model, w_bit=4, q_config=q_config, init_only=True)

model = load_checkpoint_and_dispatch(
    model, load_quant_path,
    device_map="auto",
    no_split_module_classes=["LlamaDecoderLayer"]
)

Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.50s/it]


* skipping lm_head


real weight quantization...: 100%|██████████| 224/224 [00:26<00:00,  8.40it/s]


In [4]:
conv = get_conv_template("vicuna_v1.1")
chatio = SimpleChatIO()

inp = "How can I improve my time management skills?"
print("User:", inp)

while True:
    if not inp:
        try:
            inp = chatio.prompt_for_input(conv.roles[0])
        except EOFError:
            inp = ""
    if not inp:
        print("exit...")
        break

    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)

    generate_stream_func = generate_stream
    prompt = conv.get_prompt()

    gen_params = {
        "model": model_path,
        "prompt": prompt,
        "temperature": 0.3,
        "repetition_penalty": 1.0,
        "max_new_tokens": 512,
        "stop": conv.stop_str,
        "stop_token_ids": conv.stop_token_ids,
        "echo": False,
    }

    chatio.prompt_for_output(conv.roles[1])
    output_stream = generate_stream_func(model, tokenizer, gen_params, "cuda")
    outputs = chatio.stream_output(output_stream)
    conv.update_last_message(outputs.strip())
    
    inp = None

User: How can I improve my time management skills?
ASSISTANT: Time management skills can be improved through a combination of techniques, such as setting clear goals, prioritizing tasks, and using time-saving tools and strategies. Here are some tips to help you improve your time management skills:

1. Set clear goals: Establish clear and specific goals for what you want to achieve. This will help you prioritize your tasks and focus your efforts.
2. Prioritize tasks: Identify the most important tasks that need to be completed and prioritize them accordingly. Use the Eisenhower matrix to categorize tasks into urgent and important, important but not urgent, urgent but not important, and not urgent or important.
3. Use time-saving tools and strategies: Use tools like calendars, to-do lists, and time trackers to help you manage your time more effectively. Also, consider using time-saving strategies like batching, delegating, and automating tasks.
4. Practice time management techniques: Prac