# Quick Start

This tutorial offers a step-by-step guide that allows for hands-on learning. We will begin by setting up required environment and then proceed to develop an application with BigDL-LLM transformers INT4 optimization. This application will allow us to conduct inferences on a large language model with low latency. By following this tutorial, you will gain a seamless experience that will enable you to easily comprehend and follow the upcoming tutorials.

## 1. Environment Setup

### 1.1 Recommended System

For a smooth experience, we recommend running the tutorial on PCs equipped with 12th Gen Intel® Core™ processor or higher, and at least 16GB RAM. For server users, we recommend the ones with Intel® Xeon® processors.

For OS, BigDL-LLM supports Ubuntu 20.04 or later, CentOS 7 or later, and Windows 10/11.

### 1.2 Conda and Environment Management

[Conda](https://docs.conda.io/projects/conda/en/stable/) is an open-source package & environment management system which is supported in multiple platforms. It provides a convenient way to manage packages and create isolated environments for different projects. We highly recommend using Conda here to create environment for the tutorials.

#### 1.2.1 Install Conda

##### 1.2.1.1 Linux
For Linux users, you could install Conda through:

```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash ./Miniconda3-latest-Linux-x86_64.sh
```

Then you could run:
```bash
conda init
```
and follow the output instructions to finish the Conda initialization.


##### 1.2.1.2 Native Windows
For native Windows users, you could download Conda installer [here](https://docs.conda.io/en/latest/miniconda.html#latest-miniconda-installer-links) based on your system information.

After the installation, open "Anaconda Powershell Prompt (Miniconda3)" for the following steps.

##### 1.2.1.3 Windows with WSL
For WSL users, you could follow the same instructions in section [1.2.1.1 Linux](#1211-linux).

> **Related Readings**
>
> For how to install WSL on your windows, refer to [here](https://bigdl.readthedocs.io/en/latest/doc/UserGuide/win.html#install-wsl2) for more information.

#### 1.2.2 Create Environment
We suggest using Python 3.9 for BigDL-LLM. To create an environment with Python 3.9, run:
```
conda create -n llm-tutorial python=3.9
```
You have the flexibility to choose any name you prefer instead of `llm-tutorial`.

You can then activate the environment through:
```
conda activate llm-tutorial
```
and proceed with the installation of other packages.

##### 1.2.2.1 Install Jupyter Notebook
Package `notebook` is required to be install for running this tutorial:
```
pip install notebook
```

After installation, you could use the following command:
```
jupyter notebook
```
to open and run the tutorial notebook in web browser.

## 2. BigDL-LLM Installation

Install BigDL-LLM through:

In [None]:
!pip install bigdl-llm[all]

The all option is for installing other required packages by BigDL-LLM.

## 3. Building an Application

We now have all the prerequisites to build an application for infering on a large language model with BigDL-LLM transformers INT4 optimization. BigDL-LLM offers a Transformers-style API, which ensures that users familiar with Hugging Face transformers can have a unified and consistent experience. 

### 3.1 Load Model

The first step is to import BigDL-LLM, and load the large language model with INT4 optimization:


In [None]:
from bigdl.llm.transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b",
                                             load_in_4bit=True)

> **Note**
>
> [`openlm-research/open_llama_3b`](https://huggingface.co/openlm-research/open_llama_3b) is the id for this pretrained model hosted on [huggingface.co](huggingface.co). By executing this `from_pretrained` function, the model will be downloaded to `~/.cache/huggingface` by default, and then be loaded and converted to BigDL-LLM INT4 format implicitly. You could set the environment variable `HF_HOME` to define where you want the model to be downloaded.
>
> If you have already downloaded the model, you could also specify `pretrained_model_name_or_path` parameter with the corresponding local path.


### 3.2 Load Tokenizer

The second step is to load the model's corresponding tokenizer:

In [None]:
from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained(pretrained_model_name_or_path="openlm-research/open_llama_3b")

### 3.3 Conduct Inference

You could then conduct inference as using normal transformers API with very low latency:

In [None]:
import time
import torch

with torch.inference_mode():
    prompt = 'Q: What is LLM?\nA:'
    
    # tokenize the input prompt from string to token ids
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    st = time.time()
    # predict the next 32 tokens based on the input token ids
    output = model.generate(input_ids,
                            max_new_tokens=32)
    end = time.time()
    # decode the predicted token ids to output string
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    
    print(f'Inference time: {end-st} s')
    print('-'*20, 'Prompt', '-'*20)
    print(prompt)
    print('-'*20, 'Output', '-'*20)
    print(output_str)

> **Note**
>
> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. 


#### 4. What's Next?

In the upcoming chapter, you will dive deeper into the usage of the BigDL-LLM Transformers-style API. The tutorials in the next chapter will also allow you to leverage the power of BigDL-LLM in different domains, such as speech recognition.