# Inference on AI PCs Using LlamaCPP Python

## Introduction  
This notebook demonstrates how to install LLamacpp Python with Intel GPUs and run LLM inference locally on an AI PC. It is optimized for Intel® Core™ Ultra processors, utilizing the combined capabilities of the CPU, GPU, and NPU for efficient AI workloads. 

### What is an AI PC?  

An AI PC is a next-generation computing platform equipped with a CPU, GPU, and NPU, each designed with specific AI acceleration capabilities.  

- **Fast Response (CPU)**  
  The central processing unit (CPU) is optimized for smaller, low-latency workloads, making it ideal for quick responses and general-purpose tasks.  

- **High Throughput (GPU)**  
  The graphics processing unit (GPU) excels at handling large-scale workloads that require high parallelism and throughput, making it suitable for tasks like deep learning and data processing.  

- **Power Efficiency (NPU)**  
  The neural processing unit (NPU) is designed for sustained, heavily-used AI workloads, delivering high efficiency and low power consumption for tasks like inference and machine learning.  

The AI PC represents a transformative shift in computing, enabling advanced AI applications and AI workflows to run seamlessly on local hardware. This innovation enhances everyday PC usage by delivering faster, more efficient AI experiences without relying on cloud resources.  

In this notebook, we’ll explore how to use the AI PC’s capabilities to perform LLM inference, showcasing the power of local AI acceleration for modern applications.  

## Install Prerequisites

### Step 1: System Preparation

To set up your AIPC for running with Intel iGPUs, follow these essential steps:

1. Update Intel GPU Drivers: Ensure your system has the latest Intel GPU drivers, which are crucial for optimal performance and compatibility. You can download these directly from Intel's [official website](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html) . Once you have installed the official drivers, you could also install Intel ARC Control to monitor the gpu:

   <img src="Assets/gpu_arc_control.png">


2. Install Visual Studio 2022 Community edition with C++: Visual Studio 2022, along with the “Desktop Development with C++” workload, is required. This prepares your environment for C++ based extensions used by the intel SYCL backend that powers accelerated Ollama. You can download VS 2022 Community edition from the official site, [here](https://visualstudio.microsoft.com/downloads/).

3. Install conda-forge: conda-forge will manage your Python environments and dependencies efficiently, providing a clean, minimal base for your Python setup. Visit conda-forge's [installation site](https://conda-forge.org/download/) to install for windows.

4. Install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html)

   

## Step 2: Install Llamacpp python for SYCL
The llama.cpp SYCL backend is designed to support Intel GPU firstly. Based on the cross-platform feature of SYCL.

### After installation of conda-forge, open the Miniforge Prompt, and create a new python environment:
  ```
  conda create -n llm-sycl python=3.11

  ```

### Activate the new environment
```
conda activate llm-sycl

```

<img src="Assets/llm4.png">

### With the llm-sycl environment active, enable oneAPI environment. 
Type oneapi in the windows search and then open the Intel oneAPI command prompt for Intel 64 for Visual Studio 2022 App.

<img src="Assets/oneapi1.png">

#### Run the below command in the VS command prompt and you should see the below sycl devices displayed in the console
There should be one or more level-zero GPU devices displayed as ext_oneapi_level_zero:gpu.

```
sycl-ls

```

<img src="Assets/oneapi2.png">

### Install build tools

* Download & install [cmake for Windows](https://cmake.org/download/):
* The new Visual Studio will install Ninja as default. (If not, please install it manually: https://ninja-build.org/)

### Install llama.cpp Python

  
* On the oneAPI command line window, step into the llama.cpp main directory and run the following:
  
  ```
  @call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

    Open a new terminal and perform the following steps:


# Set the environment variables
    set CMAKE_GENERATOR=Ninja
    set CMAKE_C_COMPILER=cl
    set CMAKE_CXX_COMPILER=icx
    set CXX=icx
    set CC=cl
    set CMAKE_ARGS="-DGGML_SYCL=ON -DGGML_SYCL_F16=ON -DCMAKE_CXX_COMPILER=icx -DCMAKE_C_COMPILER=cl"
    
    pip install llama-cpp-python -U --force --no-cache-dir –-verbose

### Below shows a simple example to show how to run a community GGUF model with llama.cpp for SYCL
* Download the model from huggingface and prepare the model for inference
* Run the model as below

In [None]:
!@call "C:\\Program Files (x86)\\Intel\\oneAPI\\setvars.bat" intel64 --force

In [None]:
from llama_cpp import Llama
prompt = "Write a story about Pandas"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

## Run the inference

In [None]:
llm = Llama(
      model_path=".\models\phi-2.Q5_K_M.gguf",
      chat_format="llama-2",
      n_gpu_layers=-1, # use GPU acceleration
      seed=1337, # set a specific seed
      n_ctx=2048, # set the context window
      n_threads=16,
      f16_kv=True,
)

The below code creates a chat completion object specifies the input messages and tells the model to generate text in a streaming fashion.
Then we iterates over the generated chunks of text to generate streaming response

In [None]:
output = llm.create_chat_completion(
    messages=[
        { "role": "system", "content": "You are a story writing assistant." },
        {
            "role": "user",
            "content": prompt
        }
    ],
    stream=True
)

for chunk in output:
    delta = chunk['choices'][0]['delta']
    if 'content' in delta:   
        print(delta['content'], end='', flush=True)

## Pulling models from Huggingface hub

The below code loads the pre-trained Llama model from huggingface repository specified by the repository ID, filename, and other parameters for the model.

In [None]:
from llama_cpp import Llama

prompt = "Write a story about Pandas"

llm = Llama.from_pretrained(repo_id="TheBloke/phi-2-GGUF",
                                filename="*Q5_K_M.gguf",
                                chat_format="llama-2",
                                n_gpu_layers=-1, # Uncomment to use GPU acceleration
                                seed=1337, # Uncomment to set a specific seed
                                n_ctx=2048, # Uncomment to increase the context window
                                n_threads=16,
                                f16_kv=True,
                               )

The below code creates a chat completion object specifies the input messages and tells the model to generate text in a streaming fashion.
Then we iterates over the generated chunks of text to generate streaming response

In [None]:
output = llm.create_chat_completion(
    messages=[
        { "role": "system", "content": "You are a story writing assistant." },
        {
            "role": "user",
            "content": prompt
        }
    ],
    max_tokens=256,
    stream=True
)

for chunk in output:
    delta = chunk['choices'][0]['delta']
    if 'content' in delta:   
        print(delta['content'], end='', flush=True)

## Example output

<img src="Assets/output_latest.png">

* Reference:https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md