# Notebook 3: Quick Start

This quick start notebook will help you get started with BigDL-LLM, and show you how to install, load a pretrained large language model and run it.

## 3.1 Install BigDL-LLM

In [None]:
!pip install bigdl-llm[all]

This one-line command will install bigdl-llm with all the dependencies for common LLM application development.


## 3.2 Load a pretrained LLM Model

We're using `transformers`-style API in BigDL-LLM to load a pretrained model (i.e.`open_llama_3b` ) with INT4 optimization (by specifying `load_in_4bit=True` in `from_pretrained`):


In [None]:
from bigdl.llm.transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b",
                                             load_in_4bit=True)

> **Note**
>
> `openlm-research/open_llama_3b` is the huggingface model id hosted on huggingface ([model link](https://huggingface.co/openlm-research/open_llama_3b)) . `from_pretrained` will automatically download the model from huggingface to a local cache path (e.g. `~/.cache/huggingface`), load the model, and converted it to BigDL-LLM INT4 format. 
>
> It may take a long time to download the model using API. You can also download the model yourself, and specify `pretrained_model_name_or_path` parameter in `from_pretrained` with the local path of the downloaded model. This way, `from_pretrained` will load directly from local path and convert.


### 2.3.2 Load Tokenizer

The second step is to load the model's corresponding tokenizer:

In [None]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b")

### 2.3.3 Run LLM

You could then conduct inference as using normal transformers API with very low latency. A Q&A dialog is created here for the model to complete:

In [3]:
import time
import torch

with torch.inference_mode():
    prompt = 'Q: What is AI?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    st = time.time()
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids,
                            max_new_tokens=32)
    end = time.time()
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    
    print(f'Inference time: {end-st} s')
    print('-'*20, 'Prompt', '-'*20)
    print(prompt)
    print('-'*20, 'Output', '-'*20)
    print(output_str)

Inference time: xxxx s
-------------------- Prompt --------------------
Q: What is AI?
A:
-------------------- Output --------------------
Q: What is AI?
A: Artificial Intelligence is the science of making machines do things that would require intelligence if done by a human.
Q: What is the difference between AI and Machine Learning


> **Note**
>
> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. 


## 2.4 What's Next?

In the upcoming chapter, you will dive deeper into the usage of the BigDL-LLM Transformers-style API. The tutorials in the next chapter will also allow you to leverage the power of BigDL-LLM in different domains, such as speech recognition.