# GPTQ-LLAMA implementation based on [article](https://www.philschmid.de/gptq-llama)

In [1]:
!pip install "transformers[sentencepiece]==4.32.1" "optimum==1.12.0" "auto-gptq==0.4.2" "accelerate==0.22.0" "safetensors>=0.3.1" --upgrade -q

In [2]:
# Dataset id from Hugging Face
dataset_id = "wikitext2"

In [3]:
from optimum.gptq import GPTQQuantizer

# GPTQ quantizer
quantizer = GPTQQuantizer(bits=4, dataset=dataset_id, model_seqlen=4096)
quantizer.quant_method = "gptq"


2023-09-03 09:41:03.486304: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Hugging Face model id
model_id = "philschmid/llama-2-7b-instruction-generator"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False) # bug with fast tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True, torch_dtype=torch.float16) # we load the model in fp16 on purpose


Downloading (…)okenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00003.bin:   0%|          | 0.00/9.88G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00003.bin:   0%|          | 0.00/9.89G [00:00<?, ?B/s]

Downloading (…)l-00003-of-00003.bin:   0%|          | 0.00/7.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/174 [00:00<?, ?B/s]



In [None]:
save_folder = "quantized_llama"

In [5]:
import os
import json

# quantize the model
quantized_model = quantizer.quantize_model(model, tokenizer)

# save the quantize model to disk
model.save_pretrained(save_folder, safe_serialization=True)

# load fresh, fast tokenizer and save it to disk
tokenizer = AutoTokenizer.from_pretrained(model_id).save_pretrained(save_folder)

# save quantize_config.json for TGI
with open(os.path.join(save_folder, "quantize_config.json"), "w", encoding="utf-8") as f:
  quantizer.disable_exllama = False
  json.dump(quantizer.to_dict(), f, indent=2)

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Found modules on cpu/disk. Using Exllama backend requires all the modules to be on GPU. Setting `disable_exllama=True`

Thrown during validation:
`do_sample` is set to `False`. However, `temperature` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.


Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

In [6]:
import time

# The prompt is based on the fine-tuning from the model: https://www.philschmid.de/instruction-tune-llama-2#4-test-model-and-run-inference
prompt = """### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.

### Input:
Dear [boss name],

I'm writing to request next week, August 1st through August 4th,
off as paid time off.

I have some personal matters to attend to that week that require
me to be out of the office. I wanted to give you as much advance
notice as possible so you can plan accordingly while I am away.

Thank you, [Your name]

### Response:
"""




In [7]:
# helper function to generate text and measure latency
def generate_helper(pipeline,prompt=prompt):
    # warm up
    for i in range(5):
      _ = pipeline("Warm up")

    # measure latency in a simple way
    start = time.time()
    out = pipeline(prompt, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)
    end = time.time()

    generated_text = out[0]["generated_text"][len(prompt):]

    latency_per_token_in_ms = ((end-start)/len(pipeline.tokenizer(generated_text)["input_ids"]))*1000

    # return the generated text and the latency
    return {"text": out[0]["generated_text"][len(prompt):], "latency": f"{round(latency_per_token_in_ms,2)}ms/token"}


In [8]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Hugging Face model id
model_id = "philschmid/llama-2-7b-instruction-generator"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16) # we load the model in fp16 on purpose

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



In [9]:
import torch

vanilla_res = generate_helper(pipe)

print(f"Latency: {vanilla_res['latency']}")
print(f"GPU memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Generated Instruction: {vanilla_res['text']}")

# Latency: 37.49ms/token
# GPU memory: 12.62 GB
# Generated Instruction: Write a request for PTO letter to my boss

# My local:
# Latency: 1426.16ms/token
# GPU memory: 6.57 GB
# Generated Instruction: I need to take PTO




Latency: 1426.16ms/token
GPU memory: 6.57 GB
Generated Instruction: I need to take PTO



In [10]:
torch.cuda.get_device_name()

'NVIDIA TITAN RTX'

In [11]:
# clean up
del pipe
del model
del tokenizer
torch.cuda.empty_cache()

In [12]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# path to gptq weights
model_id = "quantized_llama"

q_tokenizer = AutoTokenizer.from_pretrained(model_id)
q_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype=torch.float16)

qtq_pipe = pipeline("text-generation", model=q_model, tokenizer=q_tokenizer)

In [13]:
gpq_res = generate_helper(qtq_pipe)

print(f"Latency: {gpq_res['latency']}")
print(f"GPU memory: {torch.cuda.memory_allocated() / 1024**3:.2f} GB")
print(f"Generated Instruction: {gpq_res['text']}")

# Latency: 36.0ms/token
# GPU memory: 3.83 GB
# Generated Instruction: Write a letter requesting time off

# My device:
# Latency: 106.29ms/token
# GPU memory: 3.99 GB
# Generated Instruction: How would you request a week off?



Latency: 106.29ms/token
GPU memory: 3.99 GB
Generated Instruction: How would you request a week off?



In [14]:
model="/home/ubuntu/test-gptq"
num_shard=1
quantize="gptq"
max_input_length=1562
max_total_tokens=4096 # 4096

!docker run --gpus all -ti -p 8080:80 \
  -e MODEL_ID=$model \
  -e QUANTIZE=$quantize \
  -e NUM_SHARD=$num_shard \
  -e MAX_INPUT_LENGTH=$max_input_length \
  -e MAX_TOTAL_TOKENS=$max_total_tokens \
  -v $model:$model \
  ghcr.io/huggingface/text-generation-inference:1.0.3


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

The command 'docker' could not be found in this WSL 2 distro.
We recommend to activate the WSL integration in Docker Desktop settings.

For details about using Docker Desktop with WSL 2, visit:

https://docs.docker.com/go/wsl2/



In [15]:
!curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs":"### Instruction:\nUse the Input below to create an instruction, which could have been used to generate the input using an LLM.\n\n### Input:\nDear [boss name],\n\nI am writing to request next week, August 1st through August 4th,\noff as paid time off.\n\nI have some personal matters to attend to that week that require\nme to be out of the office. I wanted to give you as much advance\nnotice as possible so you can plan accordingly while I am away.\n\nThank you, [Your name]\n\n### Response:","parameters":{"temperature":0.2, "top_p": 0.95, "max_new_tokens": 256}}' \
    -H 'Content-Type: application/json'


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
curl: /home/nissan/mambaforge/lib/libcurl.so.4: no version information available (required by curl)
curl: (7) Failed to connect to 127.0.0.1 port 8080 after 0 ms: Couldn't connect to server
