# Running LLM on EDGE - Inferencing with Intel® CPU / GPU

### Import libraries

In [None]:
# PyTorch Framework
import torch

# Intel Extension for PyTorch
import intel_extension_for_pytorch as ipex

# BigDL LLM - Transformers Wrapper
from bigdl.llm.transformers import AutoModelForCausalLM

# Transformers API
from transformers import AutoTokenizer

### Load model from Hugging Face

In many cases, the architecture of the model that you want to use can be guessed from the name or path of the pretrained model that you are supplying through *from_pretrained()*.
Instantiating one of *AutoConfig*, *AutoModel*, *AutoTokenizer* will directly create a class of relevant architecture. For Example:

```python
model = AutoModel.from_pretrained("bert-base-cased")
```

Below is an example of using specific task AutoModel - AutoModelForCausalLM
**AutoModelForCausalLM** - a generic model class that will be instantiated as one of the model classes of the library (with a causal language modeling head) [link](https://huggingface.co/docs/transformers/model_doc/auto#natural-language-processing)

In [None]:
model_path = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_low_bit="sym_int4", optimize_model=True, trust_remote_code=True, use_cache=True)

### Initiate tokenizer from the model

Tokenizer is a model that splits input and output texts into smaller units called tokens. These tokens can be words, characters, subwords, or symbols, depending on the type and size of the model.
The tokens (integers) then is used to predict which tokens should come next

<div style="margin-top: 20px">
<img src="./imgs/gpt-token-encoder-decoder.jpg" width="600"/>
</div>

[Reference: Understanding GPT tokenizers](https://www.google.com/url?sa=i&url=https%3A%2F%2Fsimonwillison.net%2F2023%2FJun%2F8%2Fgpt-tokenizers%2F&psig=AOvVaw0_BgjL6lj6CIT_CvjmkWsR&ust=1700270023806000&source=images&cd=vfe&opi=89978449&ved=0CBIQjRxqFwoTCMjylMjtyYIDFQAAAAAdAAAAABAJ)

**Note**: Every model uses it's own architecture tokenizer. It is best to load the tokenizer from the model itself



In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

### Move model to specific xPU (CPU / GPU) accelerator
Once the model has been initialized, it can be move to specific accelerator

One could easily check the available devices using ipex api:
```python
for i in range(torch.xpu.device_count()):
    device_name = torch.xpu.get_device_name(i)
    print(f"{i}: {device_name}")
``` 

List of available xPU devices available in the system *(defer for every machine)*
```
0: Intel(R) Arc(TM) A770 Graphics
1: Intel(R) Arc(TM) A770 Graphics
2: Intel(R) UHD Graphic770
````


In [None]:
for i in range(torch.xpu.device_count()):
    device_name = torch.xpu.get_device_name(i)
    print(f"{i}: {device_name}")

### Select the target xPU to run the model

In [None]:
device = "cpu"
model = model.to(device)

### Prompt template
Prompt templates are predefined structures or formats that guide users in providing input to a language model. Templates help in improving the clarity, consistency, and the overall quality of LLM responses

Prompt format varies from model to model. Constantly check on the model's page for the prompt template format:

Eg: [Mistral-7B-Instruct-v0.1](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1#instruction-format)

```
text = "<s>[INST] What is your favourite condiment? [/INST]"
"Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> "
"[INST] Do you have mayonnaise recipes? [/INST]"
```

Eg: [ChatGLM3](https://github.com/THUDM/ChatGLM3/blob/main/PROMPT_en.d)

```
<|system|>
You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.
<|user|>
Hello
<|assistant|>
Hello, I'm ChatGLM3. What can I assist you today?
```

**Note**: Always refer to the model page for prompt template format

**Note**: Prompt template is defined by model's creator during model training / fine-tuning

In [None]:
MISTRAL_PROMPT_FORMAT = """<s>[INST]{prompt}[/INST]"""

In [None]:
import ipywidgets as widgets
input_prompt = widgets.Text(
    value='Did you encounter an error message saying "Legacy-Install-Failure" when installing the Openvino-Dev pip package using Python 3.10?',
    placeholder='Type something',
    description='Question:',
    disabled=False,
    layout=widgets.Layout(width="90%")
)
input_prompt

### Run inferencing

In [None]:
with torch.inference_mode():
        # prompt = MISTRAL_PROMPT_FORMAT.format(prompt="Two things are infinite")
        prompt = MISTRAL_PROMPT_FORMAT.format(prompt=input_prompt.value)
        input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
        output = model.generate(input_ids, temperature=0.3, do_sample=True, max_new_tokens=32, use_cache=True)
        torch.xpu.synchronize()
        output = output.cpu()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

## Notices & Disclaimers 

Intel technologies may require enabled hardware, software or service activation. 

No product or component can be absolutely secure.  

Your costs and results may vary.  

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (0BSD), Open Source Initiative. No rights are granted to create modifications or derivatives of this document. 

© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.  