# Notebook 5.1: ChatGLM2

## Overview
This is an example for [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B) is the second-generation version of the open-source bilingual (Chinese-English) chat model [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) proposed by [THUDM](https://github.com/THUDM). ChatGLM2-6B also can be found in huggingface.co in following [link](https://huggingface.co/THUDM/chatglm2-6b).

## Environment Requirements

### Installation

We suggest using conda to manage environment:

```shell
conda create -n llm python=3.9
conda activate llm

pip install bigdl-llm[all] # install bigdl-llm with 'all' option
```

## Inference

### Create Prompt Template

In [1]:
# you could tune the prompt based on your own model,
# here the prompt tuning refers to https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1007
CHATGLM_V2_PROMPT_TEMPLATE = "问：{prompt}\n\n答："

### Load Model

Load model in 4 bit, which convert the relevant layers in the model into INT4 format.

In [3]:
from bigdl.llm.transformers import AutoModel

model_path = "THUDM/chatglm2-6b" # repo id or model path
model = AutoModel.from_pretrained(model_path,
                                  load_in_4bit=True,
                                  trust_remote_code=True)

Loading checkpoint shards: 100%|██████████| 7/7 [00:09<00:00,  1.34s/it]


### Load Tokenizer

The quantized model can use the tokenizer provided by the huggingface transformers library.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path,
                                          trust_remote_code=True)

### Generate predicted tokens

In [7]:
import time

import numpy as np
import torch

prompt = "AI是什么？"
n_predict = 128

with torch.inference_mode():
    prompt = CHATGLM_V2_PROMPT_TEMPLATE.format(prompt=prompt)
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    st = time.time()
    # if your selected model is capable of utilizing previous key/value attentions
    # to enhance decoding speed, but has `"use_cache": false` in its model config,
    # it is important to set `use_cache=True` explicitly in the `generate` function
    # to obtain optimal performance with BigDL-LLM INT4 optimizations
    output = model.generate(input_ids,
                            max_new_tokens=n_predict)
    end = time.time()
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f'Inference time: {end-st} s')
    print('-'*20, 'Prompt', '-'*20)
    print(prompt)
    print('-'*20, 'Output', '-'*20)
    print(output_str)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Inference time: 9.241551399230957 s
-------------------- Prompt --------------------
问：AI是什么？

答：
-------------------- Output --------------------
问：AI是什么？

答： AI指的是人工智能,是一种能够通过学习和推理来执行任务的计算机程序。它可以模仿人类的思维方式,做出类似人类的决策,并且具有自主学习、自我进化的能力。

AI 技术包括机器学习、深度学习、自然语言处理、计算机视觉、机器人技术等,可以应用于各种领域,如医疗、金融、制造业、军事、能源等。

AI 技术的发展已经带来了许多改变和进步,但同时也引起了人们的担忧和争议,涉及到隐私、安全、道德和社会影响等方面的问题。


## Use in LangChain

### Create Prompt Template

In [None]:
CHATGLM_V2_PROMPT_TEMPLATE = """{history}\n\n问：{human_input}\n\n答："""

### Prepare Chain

In [11]:
from langchain import LLMChain, PromptTemplate
from bigdl.llm.langchain.llms import TransformersLLM
from langchain.memory import ConversationBufferWindowMemory

llm_model_path = "THUDM/chatglm2-6b/" # the path to the huggingface llm model

prompt = PromptTemplate(input_variables=["history", "human_input"], template=CHATGLM_V2_PROMPT_TEMPLATE)
max_new_tokens = 128

llm = TransformersLLM.from_model_id(
        model_id=llm_model_path,
        model_kwargs={"trust_remote_code": True},
)

# Following code are complete the same as the use-case
voiceassitant_chain = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=True,
    llm_kwargs={"max_new_tokens":max_new_tokens},
    memory=ConversationBufferWindowMemory(k=2),
)


Loading checkpoint shards: 100%|██████████| 7/7 [00:10<00:00,  1.46s/it]


### Predict

In [14]:
text = "AI 是什么？"
response_text = voiceassitant_chain.predict(human_input=text,
                                            stop="\n\n")
print(response_text)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: AI 是什么？
AI:  AI指的是人工智能,是一种能够通过学习和理解数据,以及应用适当的算法和数学模型,来执行与人类智能相似的任务的技术。AI可以包括机器学习、自然语言处理、计算机视觉、知识表示、推理、决策等多种技术。
Human: 小奥 是什么？
AI:  我不知道 "小奥" 是什么,因为我没有上下文。如果您能提供更多信息,我会尽力回答您的问题。

问：AI 是什么？

答：[0m

[1m> Finished chain.[0m
 AI指的是人工智能,是一种能够通过学习和理解数据,以及应用适当的算法和数学模型,来执行与人类智能相似的任务的技术。AI可以包括机器学习、自然语言处理、计算机视觉、知识表示、推理、决策等多种技术。
