# Quantization using SYCL backend on AI PC

## Introduction

This notebook demonstrates how to quantize a model on Windows AI PC with Intel GPUs. It applies to Intel Core Ultra and Core 11 - 14 gen integrated GPUs (iGPUs), as well as Intel Arc Series GPU.

## What is an AIPC

What is an AI PC you ask?

Here is an [explanation](https://www.intel.com/content/www/us/en/newsroom/news/what-is-an-ai-pc.htm#gs.a55so1) from Intel:

”An AI PC has a CPU, a GPU and an NPU, each with specific AI acceleration capabilities. An NPU, or neural processing unit, is a specialized accelerator that handles artificial intelligence (AI) and machine learning (ML) tasks right on your PC instead of sending data to be processed in the cloud. The GPU and CPU can also process these workloads, but the NPU is especially good at low-power AI calculations. The AI PC represents a fundamental shift in how our computers operate. It is not a solution for a problem that didn’t exist before. Instead, it promises to be a huge improvement for everyday PC usages.”

## Install Prerequisites

### Step 1: System Preparation

To set up your AIPC for running with Intel iGPUs, follow these essential steps:

1. Update Intel GPU Drivers: Ensure your system has the latest Intel GPU drivers, which are crucial for optimal performance and compatibility. You can download these directly from Intel's [official website](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html) . Once you have installed the official drivers, you could also install Intel ARC Control to monitor the gpu:

   <img src="Assets/gpu_arc_control.png">


2. Install Visual Studio 2022 Community edition with C++: Visual Studio 2022, along with the “Desktop Development with C++” workload, is required. This prepares your environment for C++ based extensions used by the intel SYCL backend that powers accelerated Ollama. You can download VS 2022 Community edition from the official site, [here](https://visualstudio.microsoft.com/downloads/).

3. Install conda-forge: conda-forge will manage your Python environments and dependencies efficiently, providing a clean, minimal base for your Python setup. Visit conda-forge's [installation site](https://conda-forge.org/download/) to install for windows.

4. Install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html)

   

## Step 2: Install Llamacpp for SYCL
The llama.cpp SYCL backend is designed to support Intel GPU firstly. Based on the cross-platform feature of SYCL.

### After installation of conda-forge, open the Miniforge Prompt, and create a new python environment:
  ```
  conda create -n llm-sycl python=3.11

  ```

### Activate the new environment
```
conda activate llm-sycl

```

<img src="Assets/llm4.png">

### With the llm-sycl environment active, enable oneAPI environment. 
Type oneapi in the windows search and then open the Intel oneAPI command prompt for Intel 64 for Visual Studio 2022 App.

<img src="Assets/oneapi1.png">

#### Run the below command in the VS command prompt and you should see the below sycl devices displayed in the console
There should be one or more level-zero GPU devices displayed as ext_oneapi_level_zero:gpu.

```
sycl-ls

```

<img src="Assets/oneapi2.png">

### Install build tools

* Download & install [cmake for Windows](https://cmake.org/download/):
* The new Visual Studio will install Ninja as default. (If not, please install it manually: https://ninja-build.org/)

### Install llama.cpp

* git clone the llama.cpp repo
  
  ```
  git clone https://github.com/ggerganov/llama.cpp.git

  ```
  
* On the oneAPI command line window, step into the llama.cpp main directory and run the following:
  
  ```
  @call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

    # Option 1: Use FP32 (recommended for better performance in most cases)
  cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release
    
    # Option 2: Or FP16
  cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPILER=icx  -DCMAKE_BUILD_TYPE=Release -DGGML_SYCL_F16=ON
    
    cmake --build build --config Release -j

  ```

### Below shows a simple example to show how to run a community GGUF model with llama.cpp for SYCL
* Download the model from huggingface and prepare the model for inference
* Run the model for example as below
* Open the mini-forge prompt, activate the llm-sycl environment and enable oneAPI enviroment as below

  ```
  "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64  
  ```
* List the sycl devices as below

  ```
  build\bin\ls-sycl-device.exe

  ```
* Run inference
```
build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm none -mg 0
```

<img src="Assets/cmd1.png">

### Below is an example output

<img src="Assets/out1.png">



## Run the inference

In [None]:
! ..\git_llamacpp\llama.cpp\build\bin\llama-cli.exe -m Qwen1.5-4B.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 25 -s 0 -sm none -mg 0

## Quantization of the Models on AI PC

* Quantization: Reduces the precision of the model's parameters (e.g., from 32-bit floating-point to 8-bit or 4-bit integers), decreasing the model size and often speeding up inference with minimal impact on accuracy.

* When quantizing to 4 bits, each value is represented with only 4 bits, significantly reducing the amount of data needed to store and process information. This reduction in data size leads to several advantages, including decreased memory usage and faster processing speeds, which are particularly beneficial for deploying models on AI PCs.

*  Additionally, 4-bit quantization can lead to lower power consumption, making it an attractive option for AI PCs with GPUs and NPus

*  **llama-3-8b-instruct** - Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. More details about model can be found in [Meta blog post](https://ai.meta.com/blog/meta-llama-3/), [model website](https://llama.meta.com/llama3) and [model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).
>**Note**: run model with demo, you will need to accept license agreement. 
>You must be a registered user in 🤗 Hugging Face Hub. Please visit [HuggingFace model card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), carefully read terms of usage and click accept button.  You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
>You can login on Hugging Face Hub in notebook environment, using following code:

* **llama-2-7b-chat** - LLama 2 is the second generation of LLama models developed by Meta. Llama 2 is a collection of pre-trained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. llama-2-7b-chat is 7 billions parameters version of LLama 2 finetuned and optimized for dialogue use case. More details about model can be found in the [paper](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/), [repository](https://github.com/facebookresearch/llama) and [HuggingFace model card](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf).
>**Note**: run model with demo, you will need to accept license agreement. 
>You must be a registered user in 🤗 Hugging Face Hub. Please visit [HuggingFace model card](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), carefully read terms of usage and click accept button.  You will need to use an access token for the code below to run. For more information on access tokens, refer to [this section of the documentation](https://huggingface.co/docs/hub/security-tokens).
>You can login on Hugging Face Hub in notebook environment, using following code:
 
```python
    ## login to huggingfacehub to get access to pretrained model 

    from huggingface_hub import notebook_login, whoami

    try:
        whoami()
        print('Authorization token already provided')
    except OSError:
        notebook_login()
```

* **phi3-mini-instruct** - The Phi-3-Mini is a 3.8B parameters, lightweight, state-of-the-art open model trained with the Phi-3 datasets that includes both synthetic data and the filtered publicly available websites data with a focus on high-quality and reasoning dense properties. More details about model can be found in [model card](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct), [Microsoft blog](https://aka.ms/phi3blog-april) and [technical report](https://aka.ms/phi3-tech-report).
* **qwen2-1.5b-instruct/qwen2-7b-instruct** - Qwen2 is the new series of Qwen large language models.Compared with the state-of-the-art open source language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most open source models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.
For more details, please refer to [model_card](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [blog](https://qwenlm.github.io/blog/qwen2/), [GitHub](https://github.com/QwenLM/Qwen2), and [Documentation](https://qwen.readthedocs.io/en/latest/).

* **neural-chat-7b-v3-1** - Mistral-7b model fine-tuned using Intel Gaudi. The model fine-tuned on the open source dataset [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) and aligned with [Direct Preference Optimization (DPO) algorithm](https://arxiv.org/abs/2305.18290). More details can be found in [model card](https://huggingface.co/Intel/neural-chat-7b-v3-1) and [blog post](https://medium.com/@NeuralCompressor/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3).

In [None]:
from huggingface_hub import notebook_login, whoami
try:
    whoami()
    print('Authorization token already provided')
except OSError:
    notebook_login()

In [None]:
!pip install ipywidgets

In [None]:
import ipywidgets as widgets

model = widgets.Dropdown(
    options=['phi3-mini-instruct', 'llama-2-7b-chat', 'qwen2-1.5b-instruct', 'llama-3-8b-instruct', 'neural-chat-7b-v3-1' ],
    value='llama-3-8b-instruct',  # Default value
    description="Select Model:",
    disabled=False,
)

model

In [None]:
model_id = "microsoft/Phi-3-mini-4k-instruct"
model_path = "./phi3/"

if model.value == "phi3-mini-instruct":
    model_id = "microsoft/Phi-3-mini-4k-instruct"
    model_path = "./phi3/"
    model_fp16 = "Phi-3-mini-4k-instruct.Fp16.gguf"
    model_gguf = "Phi-3-mini-4k-instruct.Q4_K_M.gguf"
elif model.value == "llama-2-7b-chat":
    model_id = "meta-llama/Llama-2-7b-chat-hf"
    model_fp16 = "Llama-2-7b-chat-hf.Fp16.gguf"
    model_path = "./llama2/"
    model_gguf = "Llama-2-7b-chat-hf.Q4_K_M.gguf"
elif model.value == "llama-3-8b-instruct":
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    model_fp16 = "llama-3-8b-instruct.Fp16.gguf"
    model_path = "./llama3/"
    model_gguf = "llama-3-8b-instruct.Q4_K_M.gguf"
elif model.value == "qwen2-1.5b-instruct":
    model_id = "Qwen/Qwen1.5-4B-Chat"
    model_fp16 = "Qwen1.5-4B-Chat.Fp16.gguf"
    model_path = "./Qwen/"
    model_gguf = "Qwen1.5-4B-Chat.Q4_K_M.gguf"
elif model.value == "neural-chat-7b-v3-1":
    model_id = "Intel/neural-chat-7b-v3-1"
    model_fp16 = "neural-chat-7b-v3-1.Fp16.gguf"
    model_path = "./Intel_neural_chat/"
    model_gguf = "neural-chat-7b-v3-1.Q4_K_M.gguf"
else:
    model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
    model_fp16 = "llama-3-8b-instruct.Fp16.gguf"
    model_path = "./llama3/"
    model_gguf = "llama-3-8b-instruct.Q4_K_M.gguf"

In [None]:
print(f"Selected model {model.value}")
print(f"Selected model ", model_id)

### Initialize oneAPI environment

In [None]:
!@call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

### Download the model from Huggingface to local folder

In [None]:
from huggingface_hub import snapshot_download

In [None]:
snapshot_download(repo_id = model_id,local_dir = model_path)

### Convert the model to GGUF format

In [None]:
import time
start_time = time.time()

In [None]:
!python ..\git_llamacpp\llama.cpp\convert-hf-to-gguf.py {model_path} --outtype f16 --outfile ./converted_models/{model_fp16}

end_time = time.time()
total_time = end_time - start_time
print(f"Model conversion time: {total_time} seconds")

### Quantize the model to 4bit (Q4_K_M) format

In [None]:
! ..\git_llamacpp\llama.cpp\build\bin\llama-quantize.exe ./converted_models/{model_fp16} ./quantized_models/{model_gguf} Q4_K_M

### Run the Inference using the quantized model

In [None]:
import time
start_time = time.time()

In [None]:
! ..\git_llamacpp\llama.cpp\build\bin\llama-cli.exe -m ./quantized_models/{model_gguf} -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 100 -e -ngl 33 -s 0 -sm none -mg 0

end_time = time.time()
total_time = end_time - start_time
print(f"Model warmup and Inference time: {total_time} seconds")

### Upload the model to Huggingface hub

In [None]:
from huggingface_hub import login
login()

In [None]:
from huggingface_hub import HfApi, HfFolder, Repository, create_repo, upload_file
#from huggingface_hub import HfApi, HfFolder, create_repo, upload_file
import os

# Authentication
token = HfFolder.get_token()  # Make sure you have logged in using `huggingface-cli login` or set the token manually
if token is None:
    raise ValueError("Hugging Face token not found. Please login using `huggingface-cli login`.")

# Define repository details
model_file_path = "./quantized_models/" + model_gguf  # Your GGUG model file name
model_file_name = model_gguf
repo_name = model.value  # Repository name
organization = "Your org name"  # Change this to your Hugging Face username or organization
repo_url = f"{organization}/{repo_name}"

# Initialize HfApi to interact with Hugging Face Hub
api = HfApi()

# Check if the repository exists, if not, create it

api.create_repo(repo_id=repo_name, token=token, private=True)  # Set `private=True` for a private repository

# Clone the repository locally (if not already cloned)

api.upload_file(
    path_or_fileobj=model_file_path,
    path_in_repo=model_file_name,
    repo_id=repo_url,
    repo_type="model",
)

print(f"Model file {model_file_name} successfully uploaded to Hugging Face at {repo_url}")

### Download model from huggingface_hub

In [None]:
from huggingface_hub import snapshot_download
snapshot_download(repo_id=repo_url, local_dir="./download_models/")


#### Run the inference locally on AI PC

In [None]:
! ..\git_llamacpp\llama.cpp\build\bin\llama-cli.exe -m ./download_models/{model_gguf} -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 100 -e -ngl 33 -s 0 -sm none -mg 0

## Example output

<img src="Assets/output_latest.png">

* Reference:https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md