# Notebook: Quick Start

With the environment set up, we now move into the hands-on tutorial to build an application for infering on a large language model with BigDL-LLM transformers INT4 optimization.

## 2. BigDL-LLM Installation

Install BigDL-LLM through:

In [None]:
!pip install bigdl-llm[all]

The all option is for installing other required packages by BigDL-LLM.

## 3. Building an Application

BigDL-LLM offers a Transformers-style API, which ensures that users familiar with Hugging Face transformers can have a unified and consistent experience. 

### 3.1 Load Model

The first step is to import BigDL-LLM, and load the large language model with INT4 optimization:


In [1]:
from bigdl.llm.transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b",
                                             load_in_4bit=True)

> **Note**
>
> [`openlm-research/open_llama_3b`](https://huggingface.co/openlm-research/open_llama_3b) is the id for this pretrained model hosted on [huggingface.co](huggingface.co). By executing this `from_pretrained` function, the model will be downloaded to `~/.cache/huggingface` by default, and then be loaded and converted to BigDL-LLM INT4 format implicitly. You could set the environment variable `HF_HOME` to define where you want the model to be downloaded.
>
> If you have already downloaded the model, you could also specify `pretrained_model_name_or_path` parameter with the corresponding local path.


### 3.2 Load Tokenizer

The second step is to load the model's corresponding tokenizer:

In [2]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b")

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565


### 3.3 Conduct Inference

You could then conduct inference as using normal transformers API with very low latency. A Q&A dialog is created here for the model to complete:

In [3]:
import time
import torch

with torch.inference_mode():
    prompt = 'Q: What is AI?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    st = time.time()
    # predict the next tokens (maximum 32) based on the input token ids
    output = model.generate(input_ids,
                            max_new_tokens=32)
    end = time.time()
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    
    print(f'Inference time: {end-st} s')
    print('-'*20, 'Prompt', '-'*20)
    print(prompt)
    print('-'*20, 'Output', '-'*20)
    print(output_str)

Inference time: xxxx s
-------------------- Prompt --------------------
Q: What is AI?
A:
-------------------- Output --------------------
Q: What is AI?
A: Artificial Intelligence is the science of making machines do things that would require intelligence if done by a human.
Q: What is the difference between AI and Machine Learning


> **Note**
>
> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. 


## 4. What's Next?

In the upcoming chapter, you will dive deeper into the usage of the BigDL-LLM Transformers-style API. The tutorials in the next chapter will also allow you to leverage the power of BigDL-LLM in different domains, such as speech recognition.