# Running Large Language Model at the Edge
By utilizing the Intel® OneAPI Toolkit and Intel® BigDL-LLM python packages, customers can run LLM on their workstations / personal PCs with optimized performance and latency.

## Verifying Python3 path and version
Currently BigDL-LLM package is supported on Python3.9. So, the following steps are the sanity check for Python3 version. It is also recommended to use Anaconda as it provides an easier way to create a Python3 virtual environment with the Python version specified by the user.

In [None]:
# !which python3

In [None]:
# !python3 --version

## Installing BigDL-LLM Package
Installing BigDL-LLM package is easy as it is a one line installation.

In [None]:
# CPU
# !python3 -m pip install -q bigdl-llm[all]

In [None]:
# Intel(R) Discrete GPU
# !python3 -m pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu

## Specifying the LLM prompt, model and prediction token

In [None]:
LLAMA2_PROMPT_FORMAT = """### HUMAN:
{prompt}

### RESPONSE:
"""

In [None]:
model_path = "./models/models--meta-llama--Llama-2-7b-chat-hf/snapshots/c1b0db933684edbfe29a06fa47eb19cc48025e93"
n_predict = 1000

## Running a LLAMA2-7B LLM Model

### Import the necessary function and class from Python3 transformer library

In [None]:
import os
# This will create a models folder on the local path, download and save the model locally
os.environ["TRANSFORMERS_CACHE"] = "./models"
os.environ['TRANSFORMERS_OFFLINE'] = "1"

import torch
import time
import argparse

from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, TextIteratorStreamer, TextStreamer

### Create the tokenizers for transformers

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True)
streamer = TextStreamer(tokenizer)

In [None]:
def generate_output(prompt, tokenizer, model, streamer):
    _prompt = None
    optimized_elapsed_time = None
    with torch.inference_mode():
            _prompt = LLAMA2_PROMPT_FORMAT.format(prompt=prompt)
            input_ids = tokenizer.encode(_prompt, return_tensors="pt")
            start_time = time.time()
            output = model.generate(input_ids, streamer=streamer, max_new_tokens=n_predict)
            elapsed_time = time.time() - start_time
            output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Inference time: {round(elapsed_time, 2)} secs")

### Create the model with Intel® Software Optimization

In [None]:
from bigdl.llm.transformers import AutoModelForCausalLM

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True)

In [None]:
prompt = "How is your day today?"
generate_output(prompt=prompt, tokenizer=tokenizer, model=model, streamer=streamer)

## Role Prompt with LLM

In [None]:
sys_msg = "As a retail store customer support, you are tasked with providing assistance and information to customers visiting your online store. Reply to the question below"
question = "How can I buy the item online?"
prompt = f"{sys_msg}\n### QUESTION: {question}\n"
generate_output(prompt=prompt, tokenizer=tokenizer, model=model, streamer=streamer)

## Notices & Disclaimers 

Intel technologies may require enabled hardware, software or service activation. 

No product or component can be absolutely secure.  

Your costs and results may vary.  

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document, with the sole exception that code included in this document is licensed subject to the Zero-Clause BSD open source license (0BSD), Open Source Initiative. No rights are granted to create modifications or derivatives of this document. 

© Intel Corporation.  Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.  Other names and brands may be claimed as the property of others.  