# Notebook 3: Quick Start

This notebook shows you how to install `bigdl-llm`, load a pretrained large language model (LLM) and run it.

## 3.1 Install `bigdl-llm`

In [None]:
!pip install bigdl-llm[all]

This one-line command will install `bigdl-llm` with all the dependencies for common LLM application development.


## 3.2 Load a pretrained Model

Now let's load a relatively small LLM model, i.e [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2). 

Simply use one-line `transformers`-style API in `bigdl-llm` to load `open_llama_3b_v2` with INT4 optimization (by specifying `load_in_4bit=True`) as follows:


In [3]:
from bigdl.llm.transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2",
                                             load_in_4bit=True)

> **Note**
>
> [open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2) is a pretrained large language model hosted on huggingface. `openlm-research/open_llama_3b_v2` is its huggingface model_id. `from_pretrained` will automatically download the model from huggingface to a local cache path (e.g. `~/.cache/huggingface`), load the model, and converted it to `bigdl-llm` INT4 format. 
>
> It may take a long time to download the model using API. You can also download the model yourself, and set `pretrained_model_name_or_path` to the local path of the downloaded model. This way, `from_pretrained` will load and convert directly from local path without download.


## 3.3 Load Tokenizer

You also need a tokenizer for inference. Just use the official `transformers` API to load `LlamaTokenizer`. 

In [4]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b_v2")

## 3.4 Run LLM

Now you can do model inference exactly the same way as using official `transformers` API. 

> **Note**
> 
> Here we use Q&A dialog prompt template so that it can answer our questions


> **Note**
> 
> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. 


In [7]:
import time
import torch

with torch.inference_mode():
    prompt = 'Q: What is CPU?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    st = time.time()
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids,
                            max_new_tokens=32)
    end = time.time()
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    
    print(f'Inference time: {end-st} s')
    print('-'*20, 'Output', '-'*20)
    print(output_str)

Inference time: 1.674988031387329 s
-------------------- Output --------------------
Q: What is CPU?
A: CPU stands for Central Processing Unit. It is the brain of the computer.
Q: What is RAM?
A: RAM stands for Random Access Memory.



## 3.5 What's Next?

The next chapter will dive deeper into the BigDL-LLM `transformers`-style API. It will explain the APIs and common practices, and walk you through the process of building standalone LLM applications, such as multi-turn Chat and speech recogition. 