# Notebook 5.2: Baichuan-13B

## Overview

This is an example shows how to run [Baichuan-13B](https://github.com/baichuan-inc/Baichuan-13B) Chinese inference on low-cost PCs (without the need of discrete GPU) using [BigDL-LLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm) APIs. Baichuan-13B is an open-source, commercially available large-scale language model developed by Baichuan Intelligent Technology following [Baichuan-7B](https://github.com/baichuan-inc/baichuan-7B). Baichuan-13B also can be found in [Huggingface models](https://huggingface.co/models) in following [link](https://huggingface.co/baichuan-inc/Baichuan-13B-Chat).

## Environment Preparation

### System Recommended Requirement

When loading the model in 4-bit, BigDL-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference.

Please select the appropriate size of the Baichuan model based on the capabilities of your machine.

### BigDL-LLM Installation

Install BigDL-LLM through:

In [None]:
! pip install bigdl-llm[all]

The all option is for installing other required packages by BigDL-LLM.

## Inference

### Create Prompt Template

In [1]:
BAICHUAN_PROMPT_FORMAT = "<human>{prompt} <bot>"

### Load Model and Tokenizer

In [None]:
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model_path = "baichuan-inc/Baichuan-13B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)

### Save and Reload

In [3]:
# Save pretrained model
save_dir = "./baichuan/"
model.save_low_bit(save_dir)

# Load pretrained model again
model = AutoModelForCausalLM.load_low_bit(save_dir, trust_remote_code=True)

### Predict

In [4]:
import time
import torch

prompt = "AI是什么？"
n_predict = 128
with torch.inference_mode():
        prompt = BAICHUAN_PROMPT_FORMAT.format(prompt=prompt)
        input_ids = tokenizer.encode(prompt, return_tensors="pt")
        st = time.time()
        # if your selected model is capable of utilizing previous key/value attentions
        # to enhance decoding speed, but has `"use_cache": false` in its model config,
        # it is important to set `use_cache=True` explicitly in the `generate` function
        # to obtain optimal performance with BigDL-LLM INT4 optimizations
        output = model.generate(input_ids,
                                max_new_tokens=n_predict)
        end = time.time()
        output_str = tokenizer.decode(output[0], skip_special_tokens=True)
        print(f'Inference time: {end-st} s')
        print('-'*20, 'Prompt', '-'*20)
        print(prompt)
        print('-'*20, 'Output', '-'*20)
        print(output_str)

Inference time: 8.000465869903564 s
-------------------- Prompt --------------------
<human>AI是什么？ <bot>
-------------------- Output --------------------
<human>AI是什么？ <bot>人工智能(Artificial Intelligence，简称AI)是指由人制造出来的系统所表现出来的智能，通常是通过计算机系统实现的。这种智能来源于计算机程序和数据处理能力，可以模拟、扩展和辅助人类的认知功能。
