# IPEX-LLM

> [IPEX-LLM](https://github.com/intel-analytics/ipex-llm/) is a low-bit LLM optimization library on Intel XPU (Xeon/Core/Flex/Arc/Max). It can make LLMs run extremely fast and consume much less memory on Intel platforms. It is open sourced under Apache 2.0 License.

This example goes over how to use LangChain to interact with IPEX-LLM for text generation. 


## Setup

In [None]:
# Update Langchain

%pip install -qU langchain langchain-community

Install IEPX-LLM for running LLMs locally on Intel CPU.

In [None]:
%pip install --pre --upgrade ipex-llm[all]

## Usage

In [None]:
from langchain.chains import LLMChain
from langchain_community.llms import IpexLLM
from langchain_core.prompts import PromptTemplate

In [None]:
template = "USER: {question}\nASSISTANT:"
prompt = PromptTemplate(template=template, input_variables=["question"])

Load Model: 

In [None]:
llm = IpexLLM.from_model_id(
    model_id="lmsys/vicuna-7b-v1.5",
    model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
)

Use it in Chains:

In [None]:
llm_chain = prompt | llm

question = "What is AI?"
output = llm_chain.run(question)

In [None]:
saved_lowbit_model_path = "./vicuna-7b-1.5-low-bit"  # path to save low-bit model
llm.model.save_low_bit(saved_lowbit_model_path)
del llm

Load the model from saved lowbit model path as follows. 
> Note that the saved path for the low-bit model only includes the model itself but not the tokenizers. If you wish to have everything in one place, you will need to manually download or copy the tokenizer files from the original model's directory to the location where the low-bit model is saved.

In [None]:
llm_lowbit = IpexLLM.from_model_id_low_bit(
    model_id=saved_lowbit_model_path,
    tokenizer_id="lmsys/vicuna-7b-v1.5",
    # tokenizer_name=saved_lowbit_model_path,  # copy the tokenizers to saved path if you want to use it this way
    model_kwargs={"temperature": 0, "max_length": 64, "trust_remote_code": True},
)

Use the loaded model in Chains:

In [None]:
llm_chain = prompt | llm_lowbit


question = "What is AI?"
output = llm_chain.invoke(question)