# Notebook 4.1: Run Transformer Models

BigDL-LLM supports the optimization of any Hugging Face *transformers* model, allowing for efficient inference with significantly reduced latency. With the help of BigDL-LLM, PyTorch models (in FP16/BF16/FP32) from Hugging Face can be loaded with implicit quantization, so that heavy operations in Transformer can be speeded up through low precision (such as INT4/INT5/INT8, etc.).

In this tutorial, we will dive into the main usage of BigDL-LLM Transformers-style API for low-precision optimizations.

## 4.1.1 Install BigDL-LLM

Follow instructions in [Chapter 2](../ch_2_Environment_Setup/) to setup your environment if you haven't done so. Then install `bigdl-llm`:

In [None]:
!pip install BigDL-LLM[all]

## 4.1.2 Load Model

To leverage the benefits of BigDL-LLM, the first step is to load the transformers model with BigDL-LLM's low-precision optimizations. There are several use cases, which include loading models in low-precision, as well as saving and loading low-precision models.

For illustration purposes, let's take model [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) as an example.

### 4.1.2.0 Download Llama 2 (7B)

To download the [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) model from Hugging Face, you will need to obtain access granted by Meta. Please follow the instructions provided [here](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/tree/main) to request access to the model.

After receiving the access, download the model with your Hugging Face token:

In [None]:
from huggingface_hub import snapshot_download

model_path = snapshot_download(repo_id='/meta-llama/Llama-2-7b-chat-hf',
                               token='hf_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') # change it to your own Hugging Face access token

> **Note**
>
> The model will by default be downloaded to `HF_HOME='~/.cache/huggingface'`.

### 4.1.2.1 Load Model in Low Precision

One common use case is to load a Hugging Face *transformers* model in low precision, i.e. conduct **implicit** quantization while loading.

For Llama 2 (7B), you could simply import `bigdl.llm.transformers.AutoModelForCausalLM` instead of `transformers.AutoModelForCausalLM`, and specify `load_in_4bit=True` or `load_in_low_bit` parameter accordingly in the `from_pretrained` function. Compared to the Hugging Face *transformers* API, only minor code changes are required.

**For INT4 Optimizations (with `load_in_4bit=True`):**

In [None]:
from bigdl.llm.transformers import AutoModelForCausalLM

model_in_4bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
                                                     load_in_4bit=True)

> **Note**
>
> BigDL-LLM has supported `AutoModel`, `AutoModelForCausalLM`, `AutoModelForSpeechSeq2Seq` and `AutoModelForSeq2SeqLM`.

**For INT8 Optimizations (with `load_in_low_bit="sym_int8"`):**

In [None]:
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
model_in_8bit = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf",
                                                     load_in_low_bit="sym_int8")

> **Note**
>
> Currently, `load_in_low_bit` supports options `'sym_int4'`, `'asym_int4'`, `'sym_int5'`, `'asym_int5'` or `'sym_int8'`, in which 'sym' and 'asym' differentiate between symmetric and asymmetric quantization.
>
> It is worth mentioning that `load_in_4bit=True` is equivalent to `load_in_low_bit='sym_int4'`.

The corresponding tokenizer of Llama 2 (7B) can be loaded with official *transformers* API:

In [None]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="meta-llama/Llama-2-7b-chat-hf")

### 4.1.2.2 Save & Load Low-Precision Model

When conduct implicit quantization while loading a model, BigDL-LLM implicitly converts linear layers in the model into low-precision format. Taking INT4 as an example, in theory, a model with *X* B(illion) parameters saved in 16 or 32 bit will requires approximately 2*X* or 4*X* GB of memory for loading in 4 bit. Thus, for extremely large models like the 40B Falcon, 70B Llama 2, 176B Bloom etc., loading them with implicit low-precision quantization of BigDL-LLM can be both resource-intensive and time-consuming, and may even become impossible on memory-limited machines.

To address this issue, BigDL-LLM provides support for saving *transformers* models in BigDL-LLM low-precision format. Once the model is optimized and saved in this format, it can be loaded directly for subsequent inference, eliminating the need for repeated quantization. The saving and loading process can be completed on different machines.

**Save Low-Precision Model**

Let's take the `model_in_4bit` in section [4.1.2.1](#4121-load-model-in-low-precision) as an example. After we loading Llama 2 (7B) in 4 bit, we could use the `save_low_bit` function to save the optimized model:

In [None]:
save_directory='./llama-2-7b-bigdl-llm-4-bit'

model_in_4bit.save_low_bit(save_directory)

We recommend saving the tokenizer in the same directory as the optimized model to simplify the subsequent loading process:

In [None]:
tokenizer.save_pretrained(save_directory)

**Load Low-Precision Model**

We could load the optimized low-precision model through `load_low_bit` function, and load tokenizer from the same saved directory:

In [None]:
# note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
loaded_4bit_model = AutoModelForCausalLM.load_low_bit(save_directory)

loaded_tokenizer = LlamaTokenizer.from_pretrained(save_directory)

## 4.1.3 Conduct Inference