## VPTQ inference example

<a target="_blank" href="https://colab.research.google.com/github/microsoft/VPTQ/blob/main/notebooks/vptq_example.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Install VPTQ package and requirements
The latest transformers and accelerate is essential.

In [None]:
%%capture
!pip install https://github.com/microsoft/VPTQ/releases/download/v0.0.2/vptq-0.0.2-cp310-cp310-manylinux1_x86_64.whl
!pip install -U transformers accelerate


## Load model and tokenizer as usual
Note that T4-GPU does not support bf16,

Set `dtype = torch.half` for this model

Set `device_map='auto'` to load the model on GPU on priority.

In [8]:
import vptq
import transformers
import torch

tokenizer = transformers.AutoTokenizer.from_pretrained("VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-0-woft")
m = vptq.AutoModelForCausalLM.from_pretrained("VPTQ-community/Qwen2.5-7B-Instruct-v8-k65536-0-woft", device_map='auto', dtype=torch.half)



Replacing linear layers...: 100%|██████████| 399/399 [00:00<00:00, 1325.70it/s]


Fetching 11 files:   0%|          | 0/11 [00:00<?, ?it/s]

  return torch.load(checkpoint_file, map_location=torch.device("cpu"))


## Inference example with text generation

In [9]:
inputs = tokenizer("Explain: Do Not Go Gentle into That Good Night", return_tensors="pt").to("cuda")
out = m.generate(**inputs, max_new_tokens=100, pad_token_id=2)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Explain: Do Not Go Gentle into That Good Night

The poem “Do Not Go Gentle into That Good Night” by Dylan Thomas is a poem about death. The poem is written in the form of a sonnet, and it is written in the form of a monologue. The poem is written in the form of a monologue, and it is written in the form of a sonnet. The poem is written in the form of a sonnet, and it is written in the form of a monologue. The poem is written in the


## Generate token in streaming mode

In [21]:
inputs = tokenizer("share me a story,", return_tensors="pt").to("cuda")

streamer = transformers.TextStreamer(tokenizer)
_ = m.generate(**inputs, streamer=streamer, max_new_tokens=60,pad_token_id=tokenizer.eos_token_id)

share me a story, please
Certainly! Here's a short story for you:

### The Forgotten Garden

In the heart of a bustling city, there was a small, forgotten garden hidden behind a row of tall buildings. It was a place where the city's residents often forgot to visit, but it was a sanctuary for
