# Inference using Pytorch on Intel GPUs

## Introduction

This notebook demonstrates how to run LLM inference using pytorch on Windows with Intel GPUs. It applies to Intel Core Ultra and Core 11 - 14 gen integrated GPUs (iGPUs), as well as Intel Arc Series GPU.

## What is an AIPC

What is an AI PC you ask?

Here is an [explanation](https://www.intel.com/content/www/us/en/newsroom/news/what-is-an-ai-pc.htm#gs.a55so1):

”An AI PC has a CPU, a GPU and an NPU, each with specific AI acceleration capabilities. An NPU, or neural processing unit, is a specialized accelerator that handles artificial intelligence (AI) and machine learning (ML) tasks right on your PC instead of sending data to be processed in the cloud. The GPU and CPU can also process these workloads, but the NPU is especially good at low-power AI calculations. The AI PC represents a fundamental shift in how our computers operate. It is not a solution for a problem that didn’t exist before. Instead, it promises to be a huge improvement for everyday PC usages.”

## Install Prerequisites

### Step 1: System Preparation

To set up your AIPC for running with Intel iGPUs, follow these essential steps:

1. Update Intel GPU Drivers: Ensure your system has the latest Intel GPU drivers, which are crucial for optimal performance and compatibility. You can download these directly from Intel's [official website](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html) . Once you have installed the official drivers, you could also install Intel ARC Control to monitor the gpu:

   <img src="Assets/gpu_arc_control.png">


2. Install Visual Studio 2022 Community edition with C++: Visual Studio 2022, along with the “Desktop Development with C++” workload, is required. This prepares your environment for C++ based extensions used by the intel SYCL backend that powers accelerated Ollama. You can download VS 2022 Community edition from the official site, [here](https://visualstudio.microsoft.com/downloads/).

3. Install conda-forge: conda-forge will manage your Python environments and dependencies efficiently, providing a clean, minimal base for your Python setup. Visit conda-forge's [installation site](https://conda-forge.org/download/) to install for windows.

   

## Step 2: Setup the environment and install required libraries

### After installation of conda-forge, open the Miniforge Prompt, and create a new python environment:
  ```
  conda create -n llm python=3.11 libuv

  ```

### Activate the new environment

```
conda activate llm

```
<img src="Assets/conda_llm.png">



### With the llm environment active, use pip to install ipex-llm for GPU. 

* pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ (for US)
* pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ (for CN)

<img src="Assets/llm12.png">

## Verify Installation
You can verify if ipex-llm is successfully installed following below steps.

### Open the Miniforge Prompt and activate the Python environment llm you previously created:
```
conda activate llm
```


### Set the following environment variables according to your device:
For Intel iGPU:

* set SYCL_CACHE_PERSISTENT=1
* set BIGDL_LLM_XMX_DISABLED=1
  
<img src="Assets/llm13.png">



### Run Python Code
Launch the Python interactive shell by typing python in the Miniforge Prompt window and then press Enter.
Copy following code to Miniforge Prompt line by line and press Enter after copying each line.

```
import torch 
from ipex_llm.transformers import AutoModel,AutoModelForCausalLM    
tensor_1 = torch.randn(1, 1, 40, 128).to('xpu') 
tensor_2 = torch.randn(1, 1, 128, 40).to('xpu') 
print(torch.matmul(tensor_1, tensor_2).size()) 

```

You should see at the end:
torch.Size([1, 1, 40, 40])

### Install these packages to run the below code
```
pip install tiktoken transformers_stream_generator einops

```
```
conda install -c conda-forge jupyter

```

## Code Walkthrough

Now let’s play with a real LLM. We’ll be using the Qwen-1.8B-Chat model, a 1.8 billion parameter LLM for this demonstration. 
Follow the steps below to setup and run the model, and observe how it responds to a prompt “What is AI?”.

Below id the code snippet using Hugging Face's Transformers library to utilize the AutoModelForCausalLM class

Note: When running LLMs on Intel iGPUs with limited memory size, we recommend setting cpu_embedding=True in the from_pretrained function. This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU.

In [None]:
import torch
from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, GenerationConfig

AutoModelForCausalLM is a class that automatically selects the appropriate model architecture for causal language modeling based on the pre-trained model specified, and AutoTokenizer is a class that automatically selects the appropriate tokenizer.
We then initialize the tokenizer and the model using the from_pretrained method, which loads the pre-trained 

In [None]:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                           trust_remote_code=True)

# Load Model using ipex-llm and load it to GPU
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat",
                                             load_in_4bit=True,
                                             cpu_embedding=True,
                                             trust_remote_code=True)

#### Load it to the GPU

In [None]:
model = model.to('xpu')

We define a text prompt that the model will use as a starting point to generate text.

In [None]:
question = "What is AI?"
prompt = "user: {prompt}\n\nassistant:".format(prompt=question)

* We use the tokenizer to encode the text prompt into a format that the model can understand. The return_tensors='pt' argument tells the tokenizer to return PyTorch tensors.
* We use the model's generate method to generate a sequence of text based on the input prompt. The max_length argument specifies the maximum length of the generated text. 
* The temperature argument controls the randomness of the output (lower values make the output more deterministic and higher values make it more random).
* The num_return_sequences argument specifies the number of different sequences to generate.
* We use the tokenizer's decode method to convert the generated sequence of tokens back into human-readable text.

In [None]:
generation_config = GenerationConfig(use_cache=True)
with torch.inference_mode():
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu')

    print('--------------------------------------Note-----------------------------------------')
    print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |')
    print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |')
    print('| Please be patient until it finishes warm-up...                                  |')
    print('-----------------------------------------------------------------------------------')

    # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks.
    # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience.
    output = model.generate(input_ids,
                            do_sample=False,
                            max_new_tokens=32,
                            generation_config=generation_config) # warm-up

    print('Successfully finished warm-up, now start generation...')

    output = model.generate(input_ids,
                            do_sample=False,
                            max_new_tokens=32,
                            generation_config=generation_config).cpu()
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)
    print(output_str)

## Complete code snippet using Streamlit

### Install streamlit
```
pip install streamlit

```

In [None]:
%%writefile src/chat.py
import os

os.environ["SYCL_CACHE_PERSISTENT"]="1"
os.environ["BIGDL_LLM_XMX_DISABLED"]="1"

import threading

import streamlit as st

from ipex_llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer, GenerationConfig, TextIteratorStreamer
import torch


MODEL_CACHE = {}


def save_model_thread(model, model_path):
    model.save_low_bit(model_path)
    print(f"Model saved to {model_path}")


def warmup_model(model, tokenizer):
    question = "Hello, how are you?"
    tokenizer.pad_token = tokenizer.eos_token
    if model.name_or_path.startswith("microsoft"):
        prompt = f"<|user|>\n{question}<|end|>\n<|assistant|>"
    else:
        prompt = "user: {prompt}\n\nassistant:".format(prompt=question)
    dummy_input = tokenizer(prompt, return_tensors="pt").to("xpu")
    generation_config = GenerationConfig(use_cache=True,
                                        top_k=50,
                                        top_p=0.95,
                                        temperature=0.7, do_sample=True,
                                        )
    _ = model.generate(**dummy_input, generation_config=generation_config)
    print("Model warmed up successfully!")


def load_model(model_name: str = "Qwen/Qwen-1_8B-Chat"):
    if model_name in MODEL_CACHE:
        return MODEL_CACHE[model_name]

    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    model_path = f"./model_local_cache/{model_name}"

    if os.path.exists(model_path):
        print(f"Loading model from {model_path}")
        model = AutoModelForCausalLM.load_low_bit(
            model_path, cpu_embedding=True, trust_remote_code=True
        )
    else:
        print(f"Loading model from {model_name}")
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            load_in_4bit=True,
            cpu_embedding=True,
            trust_remote_code=True
        )
        save_model_thread(model, model_path)

    model = model.to("xpu")

    MODEL_CACHE[model_name] = (model, tokenizer)
    print("Model loaded successfully!")
    return model, tokenizer


def get_response(model, tokenizer, input_text: str):
    question = input_text
    tokenizer.pad_token = tokenizer.eos_token
    if model.name_or_path.startswith("microsoft"):
        prompt = f"<|user|>\n{question}<|end|>\n<|assistant|>"
    else:
        prompt = "user: {prompt}\n\nassistant:".format(prompt=question)

    with torch.inference_mode():
        input_ids = tokenizer(prompt, return_tensors="pt").to("xpu")
        streamer = TextIteratorStreamer(
            tokenizer, skip_prompt=False, skip_special_tokens=True
        )

        generation_config = GenerationConfig(
            use_cache=True, top_k=50, top_p=0.95,
            temperature=0.7, do_sample=True,
        )

        kwargs = dict(
            input_ids,
            streamer=streamer,
            max_new_tokens=256,
            generation_config=generation_config,
        )
        thread = threading.Thread(target=model.generate, kwargs=kwargs)
        thread.start()
    return streamer


def main():
    if "model" not in st.session_state:
        st.session_state.model = None
    if "tokenizer" not in st.session_state:
        st.session_state.tokenizer = None

    st.header("Lets chat... 🐻‍❄️")
    selected_model = st.selectbox(
        "Please select a model", ("Qwen/Qwen-1_8B-Chat", "microsoft/Phi-3-mini-4k-instruct")
    )

    if st.button("Load Model"):
        with st.spinner("Loading..."):
            st.session_state.model, st.session_state.tokenizer = load_model(
                model_name=selected_model
            )
            if (
                st.session_state.model is not None
                and st.session_state.tokenizer is not None
            ):
                st.success("Model loaded successfully!")
                st.info("Warming up the model...")
                warmup_model(st.session_state.model, st.session_state.tokenizer)
                st.success("Model warmed up and ready to use!")
            else:
                st.error("Failed to load the model.")

    chat_container = st.container()
    with chat_container:
        st.subheader("Chat")
        input_text = st.text_input("Enter your input here...")
        if st.button("Generate"):
            if st.session_state.model is None or st.session_state.tokenizer is None:
                st.warning("Please load the model first.")
            else:
                with st.spinner("Running....🐎"):
                    streamer = get_response(
                        st.session_state.model, st.session_state.tokenizer, input_text
                    )
                    st.write_stream(streamer)


if __name__ == "__main__":
   main()

### Sample output stream

Below is the screenshot of sample output and offloaded to the iGPU

<img src="Assets/pytorch_st.png">

* Reference: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html

In [None]:
! streamlit run src/chat.py