Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions vllm/FAQ.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Table of Contents

## Installation
- [Can I run the platform benchmark under a bare-metal Ubuntu environment?](#can-i-run-the-platform-benchmark-under-a-bare-metal-ubuntu-environment)
- [Can I use Ubuntu 24.04 LTS as the base OS?](#can-i-use-ubuntu-2404-lts-as-the-base-os)
- [Why can't I see the desktop even with Ubuntu 25.04 desktop version installed?](#why-cant-i-see-the-desktop-even-with-ubuntu-2504-desktop-version-installed)
- [Can I update the kernel version or other drivers of Ubuntu to get the latest fixes?](#can-i-update-the-kernel-version-or-other-drivers-of-ubuntu-to-get-the-latest-fixes)
- [Why do I need to run `native_bkc_setup.sh` before using the `vllm/platform` Docker image?](#why-do-i-need-to-run-native_bkc_setupsh-before-using-the-vllmplatform-docker-image)

## Hardware & Firmware
- [No re-sizable BAR configuration in my BIOS. What can I do to enable B60 with a larger BAR2 size?](#no-re-sizable-bar-configuration-in-my-bios-what-can-i-do-to-enable-b60-with-a-larger-bar2-size)
- [Maxsun 2x GPU Card Not Detected Behind PCIe Switch](#maxsun-2x-gpu-card-not-detected-behind-pcie-switch)

## Benchmarking
- [Why do I see unusually high Device-to-Device bandwidth in `ze_peak` benchmark?](#why-do-i-see-unusually-high-device-to-device-bandwidth-in-ze_peak-benchmark)
- [How can I verify if the benchmark data from `platform_basic_evaluation.sh` is valid?](#how-can-i-verify-if-the-benchmark-data-from-platform_basic_evaluationsh-is-valid)

## Tools
- [Why can't I see `xpu-smi` in the `vllm` Docker image?](#why-cant-i-see-xpu-smi-in-the-vllm-docker-image)
- [Why can't I see GPU utilization with `xpu-smi`?](#why-cant-i-see-gpu-utilization-with-xpu-smi)

---

# Installation

## Can I run the platform benchmark under a bare-metal Ubuntu environment?

Yes. Please contact the Intel support team to obtain an offline installer for native setup.
We also plan to make the offline installer publicly available on the Intel RDC website in an upcoming release.

## Can I use Ubuntu 24.04 LTS as the base OS? {#can-i-use-ubuntu-2404-lts-as-the-base-os}

Not yet. Support for Ubuntu 24.04 LTS is planned in future releases (targeting late 2025).

## Why can't I see the desktop even with Ubuntu 25.04 desktop version installed? {#why-cant-i-see-the-desktop-even-with-ubuntu-2504-desktop-version-installed}

Some versions of Ubuntu may default to text mode (multi-user target) after installation. You can check the current mode:

```bash
sudo systemctl get-default
```

If it returns `multi-user.target`, you can switch to graphical mode:

```bash
sudo systemctl set-default graphical.target
sudo reboot
```

## Can I update the kernel version or other drivers of Ubuntu to get the latest fixes?

During the evaluation phase, we **do not recommend updating the kernel or system packages** to ensure consistency with the validated environment.
Any updates may affect stability or introduce compatibility issues with pre-installed components.

## Why do I need to run `native_bkc_setup.sh` before using the `vllm/platform` Docker image?

To ensure consistent kernel and firmware behavior, `native_bkc_setup.sh` is required to unify Linux kernel version and install B60 GuC/HuC firmware directly on the host system before using the container image.

---

# Hardware & Firmware

## No re-sizable BAR configuration in my BIOS. What can I do to enable B60 with a larger BAR2 size?

Please contact your AIB (Add-In-Board) vendor to request the latest IFWI (firmware image) with max re-sizable BAR pre-configured.
This setup has been validated on Gunnir and Maxsun B60 cards.

## Maxsun 2x GPU Card Not Detected Behind PCIe Switch

Many PCIe switch firmware versions do not support PCIe bifurcation, which prevents detection of dual-GPU cards like Maxsun 2x.

Solution: A firmware update for the PCIe switch is required.
The Broadcom PEX 89104 has been validated. Please contact your PCIe switch vendor for support or an updated firmware.

---

# Benchmarking

## Why do I see unusually high Device-to-Device bandwidth in `ze_peak` benchmark?

Please export the following environment variable before running ze_peak.

```bash
export NEOReadDebugKeys=1
export RenderCompressedBuffersEnabled=0
```

## How can I verify if the benchmark data from `platform_basic_evaluation.sh` is valid?

Sample benchmark results are available in:

```
/opt/intel/multi-arc/results
```

These data points are collected from internal evaluations using an Intel® Xeon® W5-2545X system with dual B60 GPUs.
> **Disclaimer**: This reference is provided for informational purposes only and should not be interpreted as official performance indicators or guarantees. Actual results may vary depending on hardware configuration, software stack, and usage scenarios.

---

# Tools

## Why can't I see `xpu-smi` in the `vllm` Docker image?

Due to release process limitations, `xpu-smi` is currently not included in the official `vllm` Docker image.
We plan to add it in the next release. In the meantime, you may install it manually using:

[xpu-smi 1.3.1 on GitHub](https://github.com/intel/xpumanager/releases/download/V1.3.1/xpumanager_1.3.1_20250724.061629.60921e5e_u24.04_amd64.deb)

## Why can't I see GPU utilization with `xpu-smi`?

GPU utilization metrics are not yet fully supported by `xpu-smi` in the current release.
This functionality is scheduled to be added in next release.
13 changes: 13 additions & 0 deletions vllm/KNOWN_ISSUES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@

# 01. System Hang During Ubuntu 25.04 Installation with B60 Card Plugged In
The issue is caused by an outdated GPU GuC firmware bundled in the official Ubuntu 25.04 Desktop ISO image.

Workaround: Remove the B60 card before starting the Ubuntu installation, and plug it back in once the installation is complete.
We are also working with the Ubuntu team to address this issue upstream.

# 02. Limited 33 GB/s Bi-Directional P2P Bandwidth with 1x GPU Card
When using a single GPU card over a x16 PCIe connection without a PCIe switch, the observed bi-directional P2P bandwidth is limited to 33 GB/s.

Workaround: Change the PCIe slot configuration in BIOS from Auto/x16 to x8/x8.
With this change, over 40 GB/s bi-directional P2P bandwidth can be achieved.
Root cause analysis is still in progress.
17 changes: 17 additions & 0 deletions vllm/tools/platform/config/disable_apparmor.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash
# disable-snap-apparmor-logs.sh
# Quiet AppArmor DENIED messages from snapd

set -e

CONFIG="/etc/apparmor/parser.conf"

echo "[1/2] Updating AppArmor config to disable audit logs..."
if ! grep -q "^no-audit" "$CONFIG"; then
echo "no-audit" | sudo tee -a "$CONFIG"
fi

echo "[2/2] Restarting AppArmor..."
sudo systemctl restart apparmor

echo "✅ AppArmor snap-confine DENIED logs have been silenced."
29 changes: 29 additions & 0 deletions vllm/tools/platform/config/disable_auto_update.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash
# disable-auto-upgrade.sh
# Permanently disable automatic updates on Ubuntu

set -e

echo "[1/4] Disable unattended-upgrades service..."
sudo systemctl stop unattended-upgrades.service || true
sudo systemctl disable unattended-upgrades.service || true

echo "[2/4] Disable apt-daily timers..."
sudo systemctl stop apt-daily.timer apt-daily-upgrade.timer || true
sudo systemctl disable apt-daily.timer apt-daily-upgrade.timer || true

echo "[3/4] Update APT config to disable periodic upgrades..."
CONFIG_FILE="/etc/apt/apt.conf.d/20auto-upgrades"
if [ -f "$CONFIG_FILE" ]; then
sudo sed -i 's/^\(APT::Periodic::Update-Package-Lists\).*/\1 "0";/' "$CONFIG_FILE"
sudo sed -i 's/^\(APT::Periodic::Unattended-Upgrade\).*/\1 "0";/' "$CONFIG_FILE"
else
echo 'APT::Periodic::Update-Package-Lists "0";' | sudo tee "$CONFIG_FILE"
echo 'APT::Periodic::Unattended-Upgrade "0";' | sudo tee -a "$CONFIG_FILE"
fi

echo "[4/4] Disable Snap auto-refresh..."
sudo systemctl stop snapd.snap-repair.timer || true
sudo systemctl disable snapd.snap-repair.timer || true

echo "✅ Automatic updates have been disabled permanently."
64 changes: 64 additions & 0 deletions vllm/tools/platform/debug/collect_sysinfo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/bin/bash
set -euo pipefail

is_docker() {
grep -qaE 'docker|kubepods|containerd' /proc/1/cgroup && return 0
[[ "$(hostname)" =~ ^[0-9a-f]{12}$ ]] && return 0
return 1
}

# Check for root privileges
if [[ "$EUID" -ne 0 ]]; then
echo "[ERROR] This script must be run as root."
exit 1
fi

if is_docker; then
echo "[ERROR] Please run this script under native environment, not in docker"
exit 1
fi

# Prepare output directory
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
OUTDIR="sysinfo_$TIMESTAMP"
mkdir -p "$OUTDIR"

echo "[INFO] Collecting system information into $OUTDIR..."

# 1. CPU governor
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > "$OUTDIR/scaling_governor.txt" 2>/dev/null || echo "Not available" > "$OUTDIR/scaling_governor.txt"

# 2. CPU architecture
lscpu > "$OUTDIR/lscpu.txt"

# 3. PCI topology
lspci -tv > "$OUTDIR/lspci_tree.txt"
lspci -vvv > "$OUTDIR/lspci_verbose.txt"

# 4. Kernel messages
dmesg > "$OUTDIR/dmesg.txt"

# 5. DRI tree
tree /sys/kernel/debug/dri/ > "$OUTDIR/dri_tree.txt" 2>/dev/null || echo "Not available" > "$OUTDIR/dri_tree.txt"

# 6. Memory usage
free -h > "$OUTDIR/memory.txt"

# 7. Hardware info
dmidecode > "$OUTDIR/dmidecode.txt"

# 8. libze info
dpkg -l | grep libze > "$OUTDIR/libze_version.txt"

# Create tar archive first
TAR_FILE="sysinfo_$TIMESTAMP.tar"
XZ_FILE="$TAR_FILE.xz"

echo "[INFO] Creating archive $TAR_FILE..."
tar -cf "$TAR_FILE" "$OUTDIR"

echo "[INFO] Compressing with xz -9..."
xz -9 "$TAR_FILE"

echo "[INFO] Done. Output file: $XZ_FILE"

67 changes: 67 additions & 0 deletions vllm/tools/platform/debug/get_bkc_version.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
#!/bin/bash

# Output header
echo "Category,Version"

# 1. Ubuntu version
UBUNTU_VERSION=$(grep '^VERSION=' /etc/os-release | cut -d '"' -f 2)
echo "Ubuntu,$UBUNTU_VERSION"

# 2. Linux kernel version
KERNEL_VERSION=$(uname -r)
echo "Linux Kernel,$KERNEL_VERSION"

# 3. Intel GPU firmware versions from dmesg

# Extract GuC firmware version
guc_ver=$(dmesg | grep -i 'Using GuC firmware' | head -n1 | grep -oP 'version \K[\d\.]+')
if [[ -n "$guc_ver" ]]; then
echo "GPU Firmware (guc),$guc_ver"
else
echo "GPU Firmware (guc),Not Found"
fi

# Extract HuC firmware version
huc_ver=$(dmesg | grep -i 'Using HuC firmware' | head -n1 | grep -oP 'version \K[\d\.]+')
if [[ -n "$huc_ver" ]]; then
echo "GPU Firmware (huc),$huc_ver"
fi

# Extract DMC firmware version
dmc_ver=$(dmesg | grep -i 'Finished loading DMC firmware' | head -n1 | grep -oP '\(v\K[\d\.]+')
if [[ -n "$dmc_ver" ]]; then
echo "GPU Firmware (dmc),$dmc_ver"
else
echo "GPU Firmware (dmc),Not Found"
fi

# 4. OneAPI version (offline installed)
ONEAPI_LOG=$(ls /opt/intel/oneapi/logs/installer.install.intel.oneapi.lin.basekit.product,v=* 2>/dev/null | head -n1)
if [[ -n "$ONEAPI_LOG" ]]; then
oneapi_ver=$(basename "$ONEAPI_LOG" | sed -n 's/.*basekit\.product,v=\(.*\)\..*/\1/p')
echo "oneapi,oneapi-base-toolkit=$oneapi_ver"
else
echo "oneapi,oneapi-base-toolkit=Not Installed"
fi

# 5. Parse passed-in package files
for file in "$@"; do
[[ ! -f "$file" ]] && continue

category=$(basename "$file" .txt)
first=1

while IFS= read -r pkg; do
[[ -z "$pkg" || "$pkg" =~ ^# ]] && continue

version=$(dpkg-query -W -f='${Version}\n' "$pkg" 2>/dev/null)
version_output="$pkg=${version:-Not Installed}"

if [[ $first -eq 1 ]]; then
echo "$category,$version_output"
first=0
else
echo ",$version_output"
fi
done < "$file"
done
59 changes: 59 additions & 0 deletions vllm/tools/platform/docker/build_ubuntu_image.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash
set -e

# Help message
usage() {
echo "Usage: $0 [-n image_name:tag]"
echo "Default image name: ubuntu:25.04-custom"
exit 1
}

# Default image name
IMAGE_NAME="ubuntu:25.04-custom"

# Parse options
while getopts ":n:h" opt; do
case ${opt} in
n )
IMAGE_NAME=$OPTARG
;;
h )
usage
;;
\? )
echo "Invalid option: -$OPTARG" >&2
usage
;;
esac
done

TAR_NAME="ubuntu-2504-rootfs.tar.gz"

echo "[+] Image name: $IMAGE_NAME"
echo "[+] Creating root filesystem archive..."

sudo tar --numeric-owner -czpf "$TAR_NAME" \
--exclude=/proc \
--exclude=/sys \
--exclude=/dev \
--exclude=/tmp/* \
--exclude=/run/* \
--exclude=/mnt \
--exclude=/media \
--exclude=/lost+found \
--exclude=/var/tmp/* \
--exclude=/home \
--exclude=/root \
--exclude=/etc/ssh \
--exclude=/etc/hostname \
--exclude=/etc/hosts \
/

echo "[+] Archive created: $TAR_NAME"

echo "[+] Importing into Docker as image: $IMAGE_NAME"
cat "$TAR_NAME" | docker import - "$IMAGE_NAME"

echo "[✔] Done!"
echo "You can run the image using:"
echo " docker run -it $IMAGE_NAME bash"
Loading