# Notebook 3: Basic Application Development

This notebook introduces the essential usage of `bigdl-llm`, and walks you through building a very basic chat application.

## 3.1 Install `bigdl-llm`

If you haven't installed `bigdl-llm`, install it as shown below. The one-line command will install the latest `bigdl-llm` with all the dependencies for common LLM application development.

In [None]:
!pip install --pre --upgrade bigdl-llm[all]

## 3.2 Load a pretrained Model

Before using a LLM, you need to first load one. Here we take a relatively small LLM, i.e. [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2) as an example.

### 3.2.1 Optimize Model

In general, you just need one-line `optimize_model` to optimize any PyTorch model you have loaded. 

The models loading process is as follows. 

First, use any PyTorch API's you like to load your model. Here we use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) library `LlamaForCausalLM` to load `open_llama_3b_v2`. 

Then, call `optimize_model` to optimize the loaded model (by default INT4 optimization is applied). 

In [1]:
from transformers import LlamaForCausalLM
from bigdl.llm import optimize_model

model_path = 'openlm-research/open_llama_3b_v2'

model = LlamaForCausalLM.from_pretrained(model_path,
                                         torch_dtype="auto",
                                         low_cpu_mem_usage=True)

# With only one line to enable BigDL-LLM optimization on model
model = optimize_model(model)

> **Note**
>
> * If you want to use precisions other than INT4(e.g. NF4/INT5/INT8,etc.), or know more details about the arguments, please refer to [API document](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html#) for more information. 
>
> * [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2) is a pretrained large language model hosted on huggingface. `openlm-research/open_llama_3b_v2` is its huggingface model_id. `LlamaForCausalLM.from_pretrained` will automatically download the model from huggingface to a local cache path (e.g. `~/.cache/huggingface`) and load the model. It may take a long time to download the model using API. You can also download the model yourself, and set `model_path` to the local path of the downloaded model. More information about `from_pretrained` can be found [here](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained).


### 3.2.2 Save & Load Optimized Model

In the previous section, prior to calling `optimize_model`, the model - loaded using the Huggingface transformers API - is actually in FP16 precision. This can already cause substantial memory usage, potentially leading to Out-of-Memory errors during the loading process, especially for larger models.

To address this problem, `bigdl-llm` allows you to save the optimized low-bit model using `save_low_bit` and load it later with `load_low_bit` for inference. This approach bypasses the need to load the original FP16 model, saving both memory and time. Moreover, because the optimized model format is platform-agnostic, you can seamlessly perform saving and loading operations across various machines, regardless of their operating systems. This flexibility enables you to perform optimization/saving on a high-RAM server and deploy the model for inference on a PC with limited RAM.


**Save Optimized Model**

For exmaple, you can use the `save_low_bit` function to save the optimized model as below:

In [None]:
save_directory = './open-llama-3b-v2-bigdl-llm-INT4'

model.save_low_bit(save_directory)

**Load Optimized Model**

Then use `load_low_bit` to load the optimized low-bit model as follows:

In [3]:
from bigdl.llm.optimize import load_low_bit

model = LlamaForCausalLM.from_pretrained(model_path,
                                         torch_dtype="auto",
                                         low_cpu_mem_usage=True)
model = load_low_bit(model, save_directory)

### 3.2.3 (Optional) Load Model with Transformers-Style API

The `optimize_model` API can be used to optimize any PyTorch model, regardless of the loading API or library employed. Additionally, `bigdl-llm` provides another set of API for Hugging Face *Transformers* models, referred to as the transformers-style API. 

For example, you can use `bigdl.llm.transformers.AutoModelForCausalLM` to load `open_llama_3b_v2`. Specify `load_in_4bit` in `from_pretrained` will automaticlaly apply INT4 optimziations during the loading process. 

In [4]:
from bigdl.llm.transformers import AutoModelForCausalLM  # note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True)

> **Note**
>
> Please refer to [API document](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html#) for more information about transformers-style API.

## 3.3 Building a Simple Chat Application

Now that the model is successfully loaded, we can start building our very first chat application. We shall use the `Hugginface transformers` inference API to do this job.

> **Note**
> 
> The code in this section is solely implemented using `Huggingface transformers` API. `bigdl-llm` does not require any change in the inference code so you can use any libaries to build your appliction at inference stage.  

> **Note**
> 
> Here we use Q&A dialog prompt template so that it can answer our questions.


> **Note**
> 
> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. 


In [None]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(model_path)

In [6]:
import torch

with torch.inference_mode():
    prompt = 'Q: What is CPU?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids, max_new_tokens=32)
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

    print('-'*20, 'Output', '-'*20)
    print(output_str)

-------------------- Output --------------------
Q: What is CPU?
A: CPU stands for Central Processing Unit. It is the brain of the computer.
Q: What is RAM?
A: RAM stands for Random Access Memory.
