# Notebook 3: Quick Start

This notebook shows you how to install `bigdl-llm`, load a pretrained large language model (LLM) and build a basic chat application.

## 3.1 Install `bigdl-llm`

In [None]:
!pip install --pre --upgrade bigdl-llm[all]

This one-line command will install the latest `bigdl-llm` with all the dependencies for common LLM application development.


## 3.2 Load a pretrained Model

Now let's load a relatively small LLM model, i.e [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2) as an example.

### 3.2.1 Optimize Model
You can load `open_llama_3b_v2` with [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) and then just need to use one-line Pytorch-style API in `bigdl-llm` to accelerate `open_llama_3b_v2` with INT4 optimization as follows:


In [None]:
from transformers import LlamaForCausalLM, LlamaTokenizer
from bigdl.llm import optimize_model

model_path = '/disk0/binbin/llm-models/open_llama_3b_v2'

tokenizer = LlamaTokenizer.from_pretrained(model_path)
model = LlamaForCausalLM.from_pretrained(model_path,
                                         torch_dtype="auto",
                                         low_cpu_mem_usage=True)

# With only one line to enable BigDL-LLM optimization on model
model = optimize_model(model)

> **Note**
>
> * Please refer to [API document](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/optimize.html#) for more information about `optimize_model`.
>
> * [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2) is a pretrained large language model hosted on huggingface. `openlm-research/open_llama_3b_v2` is its huggingface model_id. `from_pretrained` will automatically download the model from huggingface to a local cache path (e.g. `~/.cache/huggingface`) and load the model. It may take a long time to download the model using API. You can also download the model yourself, and set `model_path` to the local path of the downloaded model. More information about `from_pretrained` can be found [here](https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.from_pretrained).


### 3.2.2 Save & Load Optimized Model

After the model is optimized, you may save and load the optimized model as follows:

In [4]:
save_directory='./open-llama-3b-v2-bigdl-llm-INT4'

model.save_low_bit(save_directory)

### 3.2.3 (Optional) Load Model with Transformers-Style API

Besides the above Pytorch API, `bigdl-llm` also provides transformers-style API for applying INT4 optimizations to any Hugging Face *Transformers* models as follows:

In [5]:
from bigdl.llm.transformers import AutoModelForCausalLM  # note that the AutoModelForCausalLM here is imported from bigdl.llm.transformers
from transformers import LlamaTokenizer

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True)
tokenizer = LlamaTokenizer.from_pretrained(model_path)

> **Note**
>
> Please refer to [API document](https://bigdl.readthedocs.io/en/latest/doc/PythonAPI/LLM/transformers.html#) for more information about transformers-style API.

## 3.3 Run LLM

Now you can do model inference exactly the same way as using official `transformers` API to implement basic chat application.

> **Note**
> 
> Here we use Q&A dialog prompt template so that it can answer our questions.


> **Note**
> 
> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. 


In [6]:
import torch

with torch.inference_mode():
    prompt = 'Q: What is CPU?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids, max_new_tokens=32)
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

    print('-'*20, 'Output', '-'*20)
    print(output_str)

-------------------- Output --------------------
Q: What is CPU?
A: CPU stands for Central Processing Unit. It is the brain of the computer.
Q: What is RAM?
A: RAM stands for Random Access Memory.



## 3.4 What's Next?

The next chapter will explore the capability of large languange models in handling multiple languages.