In this notebook, I will describe how we can quantise huggingface models on Intel GPUs (XPU). For demonstration, we're goingto embed a sentence using ```BAAI/bge-m3``` model one of the largest mother embedding model in existence.

#### Installation

Please install below libraries

https://github.com/intel/intel-extension-for-transformers

```pip install intel-extension-for-pytorch```

We're going to import both the standard transformers library and intel specific transformers library

In [None]:
from transformers import AutoTokenizer
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
import torch
import intel_extension_for_pytorch as ipex

Model name

In [None]:
# Model name or path
model_name = "BAAI/bge-m3"

Now, we'll load tokenizer and map everything on the Intel XPU (GPU)

In [None]:
device_map = "xpu"
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_sentence = "what's the capital of England?"
inputs = tokenizer(input_sentence, return_tensors="pt")
inputs = {key: tensor.to("xpu") for key, tensor in inputs.items()}

Loading the model on Intel XPU

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="xpu", trust_remote_code=True, use_llm_runtime=False)
model = ipex.optimize_transformers(model, inplace=True, dtype=torch.float16, quantization_config=True, device=device_map)

Generating embeddings

In [None]:
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

embeddings = logits.mean(dim=1)
print(embeddings)

**example output**

```
tensor([[ 4.3945e+00, -2.6588e-03,  9.7559e-01,  ...,  5.6680e+00,
          1.0303e+00,  2.5488e+00]], device='xpu:0', dtype=torch.float16)
```