# Running LlamaCPP Inference on AI PCs with Intel GPUs

## Introduction  

This notebook demonstrates how to run LLM inference locally on an AI PC. It is optimized for Intel® Core™ Ultra processors, utilizing the combined capabilities of the CPU, GPU, and NPU for efficient AI workloads. 

### What is an AI PC?  

An AI PC is a next-generation computing platform equipped with a CPU, GPU, and NPU, each designed with specific AI acceleration capabilities.  

- **Fast Response (CPU)**  
  The central processing unit (CPU) is optimized for smaller, low-latency workloads, making it ideal for quick responses and general-purpose tasks.  

- **High Throughput (GPU)**  
  The graphics processing unit (GPU) excels at handling large-scale workloads that require high parallelism and throughput, making it suitable for tasks like deep learning and data processing.  

- **Power Efficiency (NPU)**  
  The neural processing unit (NPU) is designed for sustained, heavily-used AI workloads, delivering high efficiency and low power consumption for tasks like inference and machine learning.  

The AI PC represents a transformative shift in computing, enabling advanced AI applications to run seamlessly on local hardware. This innovation enhances everyday PC usage by delivering faster, more efficient AI experiences without relying on cloud resources.  

In this notebook, we’ll explore how to use the AI PC’s capabilities to perform LLM inference, showcasing the power of local AI acceleration for modern applications.  

## Install Prerequisites

### Step 1: System Preparation

To set up your AIPC for running with Intel iGPUs, follow these essential steps:

1. Update Intel GPU Drivers: Ensure your system has the latest Intel GPU drivers, which are crucial for optimal performance and compatibility. You can download these directly from Intel's [official website](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html) . Once you have installed the official drivers, you could also install Intel ARC Control to monitor the gpu:

   <img src="Assets/gpu_arc_control.png">


2. Install Visual Studio 2022 Community edition with C++: Visual Studio 2022, along with the “Desktop Development with C++” workload, is required. This prepares your environment for C++ based extensions used by the intel SYCL backend that powers accelerated Ollama. You can download VS 2022 Community edition from the official site, [here](https://visualstudio.microsoft.com/downloads/).

3. Install conda-forge: conda-forge will manage your Python environments and dependencies efficiently, providing a clean, minimal base for your Python setup. Visit conda-forge's [installation site](https://conda-forge.org/download/) to install for windows.

   

## Step 2: Setup the environment and install required libraries

### After installation of conda-forge, open the Miniforge Prompt, and create a new python environment:
  ```
  conda create -n llm-cpp python=3.11

  ```

### Activate the new environment
```
conda activate llm-cpp

```

<img src="Assets/llm4.png">

### With the llm-cpp environment active, use pip to install required libraries for suppport. 

```
pip install --pre --upgrade ipex-llm[cpp]

```

<img src="Assets/llm5.png">

### Create llama-cpp directory

```
mkdir llama-cpp
cd llama-cpp

```

<img src="Assets/llm6.png">

### Please run the following command with administrator privilege in Miniforge Prompt. We should see many soft links of llama.cpp’s executable files in current directory.
```
init-llama-cpp.bat

```

<img src="Assets/llm7.png">

### Set the following environment variables according to your device to use GPU acceleration
For Intel iGPU:
```
set SYCL_CACHE_PERSISTENT=1

```
### Below shows a simple example to show how to run a community GGUF model
* Download and run the model for example as below 

```
main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "What is AI" -t 8 -e -ngl 33 --color
```

<img src="Assets/llm8.png">

### Below is an example output

<img src="Assets/llm9.png">


<img src="Assets/llm10.png">




In [None]:
! C:\workshop\llama-cpp\main.exe -m ../models/llama-2-7b-chat.Q5_K_M.gguf -n 100 --prompt "What is AI" -t 16 -ngl 999 --color -e 

* Reference: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html