# Visual-Language Assistant on AI PCs

## Introduction  
This notebook demonstrates how to install LLamacpp Python with Intel GPUs and run Multi-modality applications locally on an AI PC. It is optimized for Intel® Core™ Ultra processors, utilizing the combined capabilities of the CPU, GPU, and NPU for efficient AI workloads. 

### What is an AI PC?  

An AI PC is a next-generation computing platform equipped with a CPU, GPU, and NPU, each designed with specific AI acceleration capabilities.  

- **Fast Response (CPU)**  
  The central processing unit (CPU) is optimized for smaller, low-latency workloads, making it ideal for quick responses and general-purpose tasks.  

- **High Throughput (GPU)**  
  The graphics processing unit (GPU) excels at handling large-scale workloads that require high parallelism and throughput, making it suitable for tasks like deep learning and data processing.  

- **Power Efficiency (NPU)**  
  The neural processing unit (NPU) is designed for sustained, heavily-used AI workloads, delivering high efficiency and low power consumption for tasks like inference and machine learning.  

The AI PC represents a transformative shift in computing, enabling advanced AI applications and AI workflows to run seamlessly on local hardware. This innovation enhances everyday PC usage by delivering faster, more efficient AI experiences without relying on cloud resources.  

In this notebook, we’ll explore how to use the AI PC’s capabilities to perform LLM inference, showcasing the power of local AI acceleration for modern applications.  

## Install Prerequisites

### Step 1: System Preparation

To set up your AIPC for running with Intel iGPUs, follow these essential steps:

1. Update Intel GPU Drivers: Ensure your system has the latest Intel GPU drivers, which are crucial for optimal performance and compatibility. You can download these directly from Intel's [official website](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html) . Once you have installed the official drivers, you could also install Intel ARC Control to monitor the gpu:

   <img src="Assets/gpu_arc_control.png">


2. Install Visual Studio 2022 Community edition with C++: Visual Studio 2022, along with the “Desktop Development with C++” workload, is required. This prepares your environment for C++ based extensions used by the intel SYCL backend that powers accelerated Ollama. You can download VS 2022 Community edition from the official site, [here](https://visualstudio.microsoft.com/downloads/).

3. Install conda-forge: conda-forge will manage your Python environments and dependencies efficiently, providing a clean, minimal base for your Python setup. Visit conda-forge's [installation site](https://conda-forge.org/download/) to install for windows.

4. Install [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html)

   

## Step 2: Install Llamacpp python for SYCL
The llama.cpp SYCL backend is designed to support Intel GPU firstly. Based on the cross-platform feature of SYCL.

### After installation of conda-forge, open the Miniforge Prompt, and create a new python environment:
  ```
  conda create -n llm-sycl python=3.11

  ```

### Activate the new environment
```
conda activate llm-sycl

```

<img src="Assets/llm4.png">

### With the llm-sycl environment active, enable oneAPI environment. 
Type oneapi in the windows search and then open the Intel oneAPI command prompt for Intel 64 for Visual Studio 2022 App.

<img src="Assets/oneapi1.png">

#### Run the below command in the VS command prompt and you should see the below sycl devices displayed in the console
There should be one or more level-zero GPU devices displayed as ext_oneapi_level_zero:gpu.

```
sycl-ls

```

<img src="Assets/oneapi2.png">

### Install build tools

* Download & install [cmake for Windows](https://cmake.org/download/):
* The new Visual Studio will install Ninja as default. (If not, please install it manually: https://ninja-build.org/)

### Install llama.cpp Python

  
* On the oneAPI command line window, step into the llama.cpp main directory and run the following:
  
  ```
  @call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64 --force

    Open a new terminal and perform the following steps:


# Set the environment variables
    set CMAKE_GENERATOR=Ninja
    set CMAKE_C_COMPILER=cl
    set CMAKE_CXX_COMPILER=icx
    set CXX=icx
    set CC=cl
    set CMAKE_ARGS="-DGGML_SYCL=ON -DGGML_SYCL_F16=ON -DCMAKE_CXX_COMPILER=icx -DCMAKE_C_COMPILER=cl"
    Install Llamacpp-Python bindings
    pip install llama-cpp-python -U --force --no-cache-dir –verbose  ```

### Below shows a simple example to show how to run a community GGUF model with llama.cpp for SYCL
* Download the model from huggingface and prepare the model for inference
* Run the model as below

## Pulling models from Huggingface hub

The below code loads the pre-trained Llama model from huggingface repository specified by the repository ID, filename, and other parameters for the model.

### Initialize oneAPI environment

In [None]:
!@call "C:\\Program Files (x86)\\Intel\\oneAPI\\setvars.bat" intel64 --force

In [None]:
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MoondreamChatHandler

# Initialize the chat handler with a pre-trained model
chat_handler = MoondreamChatHandler.from_pretrained(
    repo_id="vikhyatk/moondream2",  # Repository ID for the pre-trained model
    filename="*mmproj*",  # Filename pattern for the model
)

# Initialize the model with the pre-trained model and chat handler
llm = Llama.from_pretrained(
    repo_id="vikhyatk/moondream2",  # Repository ID for the pre-trained model
    filename="*text-model*",  # Filename pattern for the text model
    chat_handler=chat_handler,  # Chat handler for formatting
    n_gpu_layers=-1,  # Uncomment to use GPU acceleration
    seed=1337,  # Uncomment to set a specific seed for reproducibility
    n_ctx=2048,  # Uncomment to increase the context window size
    n_threads=16,  # Number of threads to use
)


The below code creates a chat completion object specifies the input messages and tells the model to generate text in a streaming fashion.
Then we iterates over the generated chunks of text to generate streaming response

In [None]:
 # Create a chat completion request with a user message
response = llm.create_chat_completion(
    messages=[
        {
            "role": "user",  # Role of the message sender
            "content": [
                {"type": "text", "text": "What is unuusal int this picture?"},  # Text content of the message
                {"type": "image_url", "image_url": {"url": "https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/d5fbbd1a-d484-415c-88cb-9986625b7b11"}}  # Image URL content of the message
            ]
        }
    ],
    stream=True  # Stream the response
)

# Stream and print the response content
for chunk in response:
    delta = chunk['choices'][0]['delta']  # Extract the delta from the response chunk
    if 'content' in delta:  # Check if the delta contains content
        print(delta['content'], end='', flush=True)  # Print the content without a newline and flush the output buffer

### Streamlit Demo

In [None]:
! pip install streamlit

In [None]:
%%writefile src/st_visual_answering.py
import time
from threading import Thread
import streamlit as st
from llama_cpp import Llama
from llama_cpp.llama_chat_format import MoondreamChatHandler
import tempfile
from PIL import Image
import base64

# Create a StreamliVisual-language assistantt app that displays the response word by word
st.header("Visual-language assistant with SYCL 🐻‍❄️")

# Dropdown to select a model
selected_model = st.selectbox(
    "Please select a model", 
    ("vikhyatk/moondream2", "microsoft/Phi-3-vision-128k-instruct", "Intel/llava-gemma-2b"), 
    index=0
)

# File uploader for image
img_file_buffer = st.file_uploader('Upload a PNG image', type=["jpg", "png", "gif"])

# Input for image URL
# Input for image URL
url = st.text_input("Enter the URL of the Image:",value="Enter the URL of the Image", key="url_path")

# Display the uploaded image or the image from the URL
if img_file_buffer is not None:
    try:
        image = Image.open(img_file_buffer)
        st.image(image, width=600)  # Manually Adjust the width of the image as per requirement
    except Exception as e:
        st.error(f"Error loading image: {e}")
else:
    st.error("Please provide an image URL or upload an image.")


# Input prompt for the question
question = st.text_input("Enter the question:", value="What's the content of the image?", key="question")

def getfinalresponse(input_text):
    try:
        # Create a temporary file if an image is uploaded
        if img_file_buffer is not None:
            with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
                tmp_file.write(img_file_buffer.getvalue())
                file_path = tmp_file.name

            def image_to_base64_data_uri():               
                with open(file_path, "rb") as img_file:
                    base64_data = base64.b64encode(img_file.read()).decode('utf-8')
                    return f"data:image/jpg;base64,{base64_data}"      

        # Initialize the chat handler with a pre-trained model
        chat_handler = MoondreamChatHandler.from_pretrained(
            repo_id="vikhyatk/moondream2",
            filename="*mmproj*",
        )

        # Initialize the Llama model with the pre-trained model and chat handler
        llm = Llama.from_pretrained(
            repo_id=selected_model,
            filename="*text-model*",
            chat_handler=chat_handler,
            n_gpu_layers=-1,  # Uncomment to use GPU acceleration
            seed=1337,  # Uncomment to set a specific seed
            n_ctx=2048,  # Uncomment to increase the context window
            n_threads=16,
        )

        # Create a chat completion request with the appropriate image URL
        if img_file_buffer is not None:
            response = llm.create_chat_completion(
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": question},
                            {"type": "image_url", "image_url": {"url": image_to_base64_data_uri()}}
                        ]
                    }
                ],
                stream=True
            )
        else:
            response = llm.create_chat_completion(
                messages=[
                    {
                        "role": "user",
                        "content": [
                            {"type": "text", "text": question},
                            {"type": "image_url", "image_url": {"url": url}}
                        ]
                    }
                ],
                stream=True
            )

        # Stream and yield the response content word by word
        for chunk in response:
            res = chunk['choices'][0]['delta']
            if 'content' in res:
                word = res['content'].split()
                for token in word:
                    yield token + " "
    except Exception as e:
        st.error(f"An error occurred: {e}")

# Generate response when the button is clicked
if st.button("Generate"):
    with st.spinner("Running....🐎"):
        if not question.strip():
            st.error("Please enter a question.")
        elif not url.strip() and img_file_buffer is None:
            st.error("Please provide an image URL or upload an image.")
        else:
            st.write_stream(getfinalresponse(question))


In [None]:
! streamlit run src/st_visual_answering.py

* Reference:https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/SYCL.md
* https://github.com/abetlen/llama-cpp-python