# AWQ on LLaVA

In this notebook, we use LLaVA model to demonstrate the performance of AWQ on multi-modal models. We implement AWQ real-INT4 inference kernels, which are wrapped as Pytorch modules and can be easily used by existing models. We also provide a simple example to show how to use AWQ to quantize a model and save/load the quantized model checkpoint.

In order to run this notebook, you need to install the following packages:
- [AWQ](https://github.com/mit-han-lab/llm-awq)
- [Pytorch](https://pytorch.org/)
- [Accelerate](https://github.com/huggingface/accelerate)
- [LLaVA](https://github.com/haotian-liu/LLaVA)
- [Transformers](https://github.com/huggingface/transformers)

In [1]:
import torch
import requests
from PIL import Image
from io import BytesIO
from accelerate import init_empty_weights, load_checkpoint_and_dispatch
from transformers import AutoTokenizer, CLIPVisionModel, CLIPImageProcessor, logging
logging.set_verbosity_error()  # too many warnings
from llava.conversation import conv_templates, SeparatorStyle
from llava.utils import disable_torch_init
from llava.model import *
from llava.model.utils import KeywordsStoppingCriteria
from awq.quantize.pre_quant import apply_awq
from awq.quantize.quantizer import real_quantize_model_weight
import os
import gc

# This demo only support single GPU for now
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
DEFAULT_IMAGE_TOKEN = "<image>"
DEFAULT_IMAGE_PATCH_TOKEN = "<im_patch>"
DEFAULT_IM_START_TOKEN = "<im_start>"
DEFAULT_IM_END_TOKEN = "<im_end>"

  from .autonotebook import tqdm as notebook_tqdm


Please get the LLaVA model from [LLaVA](https://github.com/haotian-liu/LLaVA) and run the following cell to generate a quantized model checkpoint first (note that we only quantize the language decoder, which dominates the model parameters). 

In [2]:
model_path = "/dataset/llava/LLaVA-13B-v0"  # Please change here 
quant_path = "../quant_cache/LLaVA-13B-v0-w4-g128-awq.pt"  # place to dump quant weights

model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, torch_dtype=torch.float16, use_cache=True).cuda()

awq_results = torch.load("../awq_cache/llava-13b-v0-w4-g128.pt", map_location="cpu")
apply_awq(model, awq_results)

real_quantize_model_weight(model, w_bit=4, q_config={"zero_point": True, "q_group_size": 128})
torch.save(model.cpu().state_dict(), quant_path)

del model
gc.collect()
torch.cuda.empty_cache()

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:09<00:00,  3.14s/it]
real weight quantization...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [09:07<00:00, 13.69s/it]


Then input a image link and a question below.

![](https://llava.hliu.cc/file=/nobackup/haotian/code/LLaVA/llava/serve/examples/extreme_ironing.jpg)

## Q: What is unusual about this image?

In [3]:
query = "What is unusual about this image?"
image_file = "https://llava.hliu.cc/file=/nobackup/haotian/code/LLaVA/llava/serve/examples/extreme_ironing.jpg" 

We first load a empty model and replace all the linear layers with WQLinear layers. Then we load the quantized weights from the checkpoint. 

In [4]:

disable_torch_init()
tokenizer = AutoTokenizer.from_pretrained(model_path)
config = LlavaConfig.from_pretrained(model_path)
with init_empty_weights():
    model = LlavaLlamaForCausalLM.from_pretrained(model_path, config=config,
                                                    torch_dtype=torch.float16, device_map="auto")
q_config = {"zero_point": True, "q_group_size": 128}
real_quantize_model_weight(
    model, w_bit=4, q_config=q_config, init_only=True)

model = load_checkpoint_and_dispatch(
    model, quant_path, device_map="auto"
)

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:15<00:00,  5.17s/it]
real weight quantization...(init only): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:37<00:00,  1.08it/s]


In [5]:
def load_image(image_file):
    if image_file.startswith('http') or image_file.startswith('https'):
        response = requests.get(image_file)
        image = Image.open(BytesIO(response.content)).convert('RGB')
    else:
        image = Image.open(image_file).convert('RGB')
    return image

image_processor = CLIPImageProcessor.from_pretrained(model.config.mm_vision_tower, torch_dtype=torch.float16)

mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
if mm_use_im_start_end:
    tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)

vision_tower = model.get_model().vision_tower[0]
if vision_tower.device.type == 'meta':
    vision_tower = CLIPVisionModel.from_pretrained(vision_tower.config._name_or_path, torch_dtype=torch.float16, low_cpu_mem_usage=True).cuda()
    model.get_model().vision_tower[0] = vision_tower
else:
    vision_tower.to(device='cuda', dtype=torch.float16)
vision_config = vision_tower.config
vision_config.im_patch_token = tokenizer.convert_tokens_to_ids([DEFAULT_IMAGE_PATCH_TOKEN])[0]
vision_config.use_im_start_end = mm_use_im_start_end
if mm_use_im_start_end:
    vision_config.im_start_token, vision_config.im_end_token = tokenizer.convert_tokens_to_ids([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN])
image_token_len = (vision_config.image_size // vision_config.patch_size) ** 2

qs = query
if mm_use_im_start_end:
    qs = qs + '\n' + DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len + DEFAULT_IM_END_TOKEN
else:
    qs = qs + '\n' + DEFAULT_IMAGE_PATCH_TOKEN * image_token_len

conv_mode = "multimodal"

conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
inputs = tokenizer([prompt])

image = load_image(image_file)
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

input_ids = torch.as_tensor(inputs.input_ids).cuda()

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor.unsqueeze(0).half().cuda(),
        do_sample=True,
        temperature=0.2,
        max_new_tokens=1024,
        stopping_criteria=[stopping_criteria])

input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
    print(f'[Warning] {n_diff_input_output} output_ids are not the same as the input_ids')
outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
    outputs = outputs[:-len(stop_str)]
outputs = outputs.strip()
print(outputs)



The unusual aspect of this image is that a man is standing on a portable ironing board in the middle of the road, ironing clothes while traffic, including a yellow taxi, moves around him. This is not a typical scene you would expect to see in a city, as ironing is usually done in a private setting like a home, and not on the street amidst traffic. It brings attention to the unconventional and unexpected nature of the situation.
