diff --git a/vllm/FAQ.md b/vllm/FAQ.md new file mode 100644 index 0000000..5505b3e --- /dev/null +++ b/vllm/FAQ.md @@ -0,0 +1,113 @@ +# Table of Contents + +## Installation +- [Can I run the platform benchmark under a bare-metal Ubuntu environment?](#can-i-run-the-platform-benchmark-under-a-bare-metal-ubuntu-environment) +- [Can I use Ubuntu 24.04 LTS as the base OS?](#can-i-use-ubuntu-2404-lts-as-the-base-os) +- [Why can't I see the desktop even with Ubuntu 25.04 desktop version installed?](#why-cant-i-see-the-desktop-even-with-ubuntu-2504-desktop-version-installed) +- [Can I update the kernel version or other drivers of Ubuntu to get the latest fixes?](#can-i-update-the-kernel-version-or-other-drivers-of-ubuntu-to-get-the-latest-fixes) +- [Why do I need to run `native_bkc_setup.sh` before using the `vllm/platform` Docker image?](#why-do-i-need-to-run-native_bkc_setupsh-before-using-the-vllmplatform-docker-image) + +## Hardware & Firmware +- [No re-sizable BAR configuration in my BIOS. What can I do to enable B60 with a larger BAR2 size?](#no-re-sizable-bar-configuration-in-my-bios-what-can-i-do-to-enable-b60-with-a-larger-bar2-size) +- [Maxsun 2x GPU Card Not Detected Behind PCIe Switch](#maxsun-2x-gpu-card-not-detected-behind-pcie-switch) + +## Benchmarking +- [Why do I see unusually high Device-to-Device bandwidth in `ze_peak` benchmark?](#why-do-i-see-unusually-high-device-to-device-bandwidth-in-ze_peak-benchmark) +- [How can I verify if the benchmark data from `platform_basic_evaluation.sh` is valid?](#how-can-i-verify-if-the-benchmark-data-from-platform_basic_evaluationsh-is-valid) + +## Tools +- [Why can't I see `xpu-smi` in the `vllm` Docker image?](#why-cant-i-see-xpu-smi-in-the-vllm-docker-image) +- [Why can't I see GPU utilization with `xpu-smi`?](#why-cant-i-see-gpu-utilization-with-xpu-smi) + +--- + +# Installation + +## Can I run the platform benchmark under a bare-metal Ubuntu environment? + +Yes. Please contact the Intel support team to obtain an offline installer for native setup. +We also plan to make the offline installer publicly available on the Intel RDC website in an upcoming release. + +## Can I use Ubuntu 24.04 LTS as the base OS? {#can-i-use-ubuntu-2404-lts-as-the-base-os} + +Not yet. Support for Ubuntu 24.04 LTS is planned in future releases (targeting late 2025). + +## Why can't I see the desktop even with Ubuntu 25.04 desktop version installed? {#why-cant-i-see-the-desktop-even-with-ubuntu-2504-desktop-version-installed} + +Some versions of Ubuntu may default to text mode (multi-user target) after installation. You can check the current mode: + +```bash +sudo systemctl get-default +``` + +If it returns `multi-user.target`, you can switch to graphical mode: + +```bash +sudo systemctl set-default graphical.target +sudo reboot +``` + +## Can I update the kernel version or other drivers of Ubuntu to get the latest fixes? + +During the evaluation phase, we **do not recommend updating the kernel or system packages** to ensure consistency with the validated environment. +Any updates may affect stability or introduce compatibility issues with pre-installed components. + +## Why do I need to run `native_bkc_setup.sh` before using the `vllm/platform` Docker image? + +To ensure consistent kernel and firmware behavior, `native_bkc_setup.sh` is required to unify Linux kernel version and install B60 GuC/HuC firmware directly on the host system before using the container image. + +--- + +# Hardware & Firmware + +## No re-sizable BAR configuration in my BIOS. What can I do to enable B60 with a larger BAR2 size? + +Please contact your AIB (Add-In-Board) vendor to request the latest IFWI (firmware image) with max re-sizable BAR pre-configured. +This setup has been validated on Gunnir and Maxsun B60 cards. + +## Maxsun 2x GPU Card Not Detected Behind PCIe Switch + +Many PCIe switch firmware versions do not support PCIe bifurcation, which prevents detection of dual-GPU cards like Maxsun 2x. + +Solution: A firmware update for the PCIe switch is required. +The Broadcom PEX 89104 has been validated. Please contact your PCIe switch vendor for support or an updated firmware. + +--- + +# Benchmarking + +## Why do I see unusually high Device-to-Device bandwidth in `ze_peak` benchmark? + +Please export the following environment variable before running ze_peak. + +```bash +export NEOReadDebugKeys=1 +export RenderCompressedBuffersEnabled=0 +``` + +## How can I verify if the benchmark data from `platform_basic_evaluation.sh` is valid? + +Sample benchmark results are available in: + +``` +/opt/intel/multi-arc/results +``` + +These data points are collected from internal evaluations using an Intel® Xeon® W5-2545X system with dual B60 GPUs. +> **Disclaimer**: This reference is provided for informational purposes only and should not be interpreted as official performance indicators or guarantees. Actual results may vary depending on hardware configuration, software stack, and usage scenarios. + +--- + +# Tools + +## Why can't I see `xpu-smi` in the `vllm` Docker image? + +Due to release process limitations, `xpu-smi` is currently not included in the official `vllm` Docker image. +We plan to add it in the next release. In the meantime, you may install it manually using: + +[xpu-smi 1.3.1 on GitHub](https://github.com/intel/xpumanager/releases/download/V1.3.1/xpumanager_1.3.1_20250724.061629.60921e5e_u24.04_amd64.deb) + +## Why can't I see GPU utilization with `xpu-smi`? + +GPU utilization metrics are not yet fully supported by `xpu-smi` in the current release. +This functionality is scheduled to be added in next release. diff --git a/vllm/KNOWN_ISSUES.md b/vllm/KNOWN_ISSUES.md new file mode 100644 index 0000000..2cd646f --- /dev/null +++ b/vllm/KNOWN_ISSUES.md @@ -0,0 +1,13 @@ + +# 01. System Hang During Ubuntu 25.04 Installation with B60 Card Plugged In +The issue is caused by an outdated GPU GuC firmware bundled in the official Ubuntu 25.04 Desktop ISO image. + +Workaround: Remove the B60 card before starting the Ubuntu installation, and plug it back in once the installation is complete. +We are also working with the Ubuntu team to address this issue upstream. + +# 02. Limited 33 GB/s Bi-Directional P2P Bandwidth with 1x GPU Card +When using a single GPU card over a x16 PCIe connection without a PCIe switch, the observed bi-directional P2P bandwidth is limited to 33 GB/s. + +Workaround: Change the PCIe slot configuration in BIOS from Auto/x16 to x8/x8. +With this change, over 40 GB/s bi-directional P2P bandwidth can be achieved. +Root cause analysis is still in progress. diff --git a/vllm/tools/platform/config/disable_apparmor.sh b/vllm/tools/platform/config/disable_apparmor.sh new file mode 100755 index 0000000..5467520 --- /dev/null +++ b/vllm/tools/platform/config/disable_apparmor.sh @@ -0,0 +1,17 @@ +#!/bin/bash +# disable-snap-apparmor-logs.sh +# Quiet AppArmor DENIED messages from snapd + +set -e + +CONFIG="/etc/apparmor/parser.conf" + +echo "[1/2] Updating AppArmor config to disable audit logs..." +if ! grep -q "^no-audit" "$CONFIG"; then + echo "no-audit" | sudo tee -a "$CONFIG" +fi + +echo "[2/2] Restarting AppArmor..." +sudo systemctl restart apparmor + +echo "✅ AppArmor snap-confine DENIED logs have been silenced." diff --git a/vllm/tools/platform/config/disable_auto_update.sh b/vllm/tools/platform/config/disable_auto_update.sh new file mode 100755 index 0000000..12326ba --- /dev/null +++ b/vllm/tools/platform/config/disable_auto_update.sh @@ -0,0 +1,29 @@ +#!/bin/bash +# disable-auto-upgrade.sh +# Permanently disable automatic updates on Ubuntu + +set -e + +echo "[1/4] Disable unattended-upgrades service..." +sudo systemctl stop unattended-upgrades.service || true +sudo systemctl disable unattended-upgrades.service || true + +echo "[2/4] Disable apt-daily timers..." +sudo systemctl stop apt-daily.timer apt-daily-upgrade.timer || true +sudo systemctl disable apt-daily.timer apt-daily-upgrade.timer || true + +echo "[3/4] Update APT config to disable periodic upgrades..." +CONFIG_FILE="/etc/apt/apt.conf.d/20auto-upgrades" +if [ -f "$CONFIG_FILE" ]; then + sudo sed -i 's/^\(APT::Periodic::Update-Package-Lists\).*/\1 "0";/' "$CONFIG_FILE" + sudo sed -i 's/^\(APT::Periodic::Unattended-Upgrade\).*/\1 "0";/' "$CONFIG_FILE" +else + echo 'APT::Periodic::Update-Package-Lists "0";' | sudo tee "$CONFIG_FILE" + echo 'APT::Periodic::Unattended-Upgrade "0";' | sudo tee -a "$CONFIG_FILE" +fi + +echo "[4/4] Disable Snap auto-refresh..." +sudo systemctl stop snapd.snap-repair.timer || true +sudo systemctl disable snapd.snap-repair.timer || true + +echo "✅ Automatic updates have been disabled permanently." diff --git a/vllm/tools/platform/debug/collect_sysinfo.sh b/vllm/tools/platform/debug/collect_sysinfo.sh new file mode 100755 index 0000000..d472cfe --- /dev/null +++ b/vllm/tools/platform/debug/collect_sysinfo.sh @@ -0,0 +1,64 @@ +#!/bin/bash +set -euo pipefail + +is_docker() { + grep -qaE 'docker|kubepods|containerd' /proc/1/cgroup && return 0 + [[ "$(hostname)" =~ ^[0-9a-f]{12}$ ]] && return 0 + return 1 +} + +# Check for root privileges +if [[ "$EUID" -ne 0 ]]; then + echo "[ERROR] This script must be run as root." + exit 1 +fi + +if is_docker; then + echo "[ERROR] Please run this script under native environment, not in docker" + exit 1 +fi + +# Prepare output directory +TIMESTAMP=$(date +%Y%m%d_%H%M%S) +OUTDIR="sysinfo_$TIMESTAMP" +mkdir -p "$OUTDIR" + +echo "[INFO] Collecting system information into $OUTDIR..." + +# 1. CPU governor +cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > "$OUTDIR/scaling_governor.txt" 2>/dev/null || echo "Not available" > "$OUTDIR/scaling_governor.txt" + +# 2. CPU architecture +lscpu > "$OUTDIR/lscpu.txt" + +# 3. PCI topology +lspci -tv > "$OUTDIR/lspci_tree.txt" +lspci -vvv > "$OUTDIR/lspci_verbose.txt" + +# 4. Kernel messages +dmesg > "$OUTDIR/dmesg.txt" + +# 5. DRI tree +tree /sys/kernel/debug/dri/ > "$OUTDIR/dri_tree.txt" 2>/dev/null || echo "Not available" > "$OUTDIR/dri_tree.txt" + +# 6. Memory usage +free -h > "$OUTDIR/memory.txt" + +# 7. Hardware info +dmidecode > "$OUTDIR/dmidecode.txt" + +# 8. libze info +dpkg -l | grep libze > "$OUTDIR/libze_version.txt" + +# Create tar archive first +TAR_FILE="sysinfo_$TIMESTAMP.tar" +XZ_FILE="$TAR_FILE.xz" + +echo "[INFO] Creating archive $TAR_FILE..." +tar -cf "$TAR_FILE" "$OUTDIR" + +echo "[INFO] Compressing with xz -9..." +xz -9 "$TAR_FILE" + +echo "[INFO] Done. Output file: $XZ_FILE" + diff --git a/vllm/tools/platform/debug/get_bkc_version.sh b/vllm/tools/platform/debug/get_bkc_version.sh new file mode 100755 index 0000000..82b6579 --- /dev/null +++ b/vllm/tools/platform/debug/get_bkc_version.sh @@ -0,0 +1,67 @@ +#!/bin/bash + +# Output header +echo "Category,Version" + +# 1. Ubuntu version +UBUNTU_VERSION=$(grep '^VERSION=' /etc/os-release | cut -d '"' -f 2) +echo "Ubuntu,$UBUNTU_VERSION" + +# 2. Linux kernel version +KERNEL_VERSION=$(uname -r) +echo "Linux Kernel,$KERNEL_VERSION" + +# 3. Intel GPU firmware versions from dmesg + +# Extract GuC firmware version +guc_ver=$(dmesg | grep -i 'Using GuC firmware' | head -n1 | grep -oP 'version \K[\d\.]+') +if [[ -n "$guc_ver" ]]; then + echo "GPU Firmware (guc),$guc_ver" +else + echo "GPU Firmware (guc),Not Found" +fi + +# Extract HuC firmware version +huc_ver=$(dmesg | grep -i 'Using HuC firmware' | head -n1 | grep -oP 'version \K[\d\.]+') +if [[ -n "$huc_ver" ]]; then + echo "GPU Firmware (huc),$huc_ver" +fi + +# Extract DMC firmware version +dmc_ver=$(dmesg | grep -i 'Finished loading DMC firmware' | head -n1 | grep -oP '\(v\K[\d\.]+') +if [[ -n "$dmc_ver" ]]; then + echo "GPU Firmware (dmc),$dmc_ver" +else + echo "GPU Firmware (dmc),Not Found" +fi + +# 4. OneAPI version (offline installed) +ONEAPI_LOG=$(ls /opt/intel/oneapi/logs/installer.install.intel.oneapi.lin.basekit.product,v=* 2>/dev/null | head -n1) +if [[ -n "$ONEAPI_LOG" ]]; then + oneapi_ver=$(basename "$ONEAPI_LOG" | sed -n 's/.*basekit\.product,v=\(.*\)\..*/\1/p') + echo "oneapi,oneapi-base-toolkit=$oneapi_ver" +else + echo "oneapi,oneapi-base-toolkit=Not Installed" +fi + +# 5. Parse passed-in package files +for file in "$@"; do + [[ ! -f "$file" ]] && continue + + category=$(basename "$file" .txt) + first=1 + + while IFS= read -r pkg; do + [[ -z "$pkg" || "$pkg" =~ ^# ]] && continue + + version=$(dpkg-query -W -f='${Version}\n' "$pkg" 2>/dev/null) + version_output="$pkg=${version:-Not Installed}" + + if [[ $first -eq 1 ]]; then + echo "$category,$version_output" + first=0 + else + echo ",$version_output" + fi + done < "$file" +done diff --git a/vllm/tools/platform/docker/build_ubuntu_image.sh b/vllm/tools/platform/docker/build_ubuntu_image.sh new file mode 100755 index 0000000..793df1d --- /dev/null +++ b/vllm/tools/platform/docker/build_ubuntu_image.sh @@ -0,0 +1,59 @@ +#!/bin/bash +set -e + +# Help message +usage() { + echo "Usage: $0 [-n image_name:tag]" + echo "Default image name: ubuntu:25.04-custom" + exit 1 +} + +# Default image name +IMAGE_NAME="ubuntu:25.04-custom" + +# Parse options +while getopts ":n:h" opt; do + case ${opt} in + n ) + IMAGE_NAME=$OPTARG + ;; + h ) + usage + ;; + \? ) + echo "Invalid option: -$OPTARG" >&2 + usage + ;; + esac +done + +TAR_NAME="ubuntu-2504-rootfs.tar.gz" + +echo "[+] Image name: $IMAGE_NAME" +echo "[+] Creating root filesystem archive..." + +sudo tar --numeric-owner -czpf "$TAR_NAME" \ + --exclude=/proc \ + --exclude=/sys \ + --exclude=/dev \ + --exclude=/tmp/* \ + --exclude=/run/* \ + --exclude=/mnt \ + --exclude=/media \ + --exclude=/lost+found \ + --exclude=/var/tmp/* \ + --exclude=/home \ + --exclude=/root \ + --exclude=/etc/ssh \ + --exclude=/etc/hostname \ + --exclude=/etc/hosts \ + / + +echo "[+] Archive created: $TAR_NAME" + +echo "[+] Importing into Docker as image: $IMAGE_NAME" +cat "$TAR_NAME" | docker import - "$IMAGE_NAME" + +echo "[✔] Done!" +echo "You can run the image using:" +echo " docker run -it $IMAGE_NAME bash" diff --git a/vllm/tools/platform/docker/run_gpu_container.sh b/vllm/tools/platform/docker/run_gpu_container.sh new file mode 100755 index 0000000..1b64985 --- /dev/null +++ b/vllm/tools/platform/docker/run_gpu_container.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +# Usage check +if [ $# -ne 2 ]; then + echo "Usage: $0 " + exit 1 +fi + +IMAGE_NAME="$1" +HOST_DIR="$2" + +# Verify directory exists +if [ ! -d "$HOST_DIR" ]; then + echo "Error: Directory '$HOST_DIR' does not exist." + exit 2 +fi + +# Run the container +docker run \ + -it \ + --privileged \ + --device=/dev/dri \ + $(for dev in /dev/mei*; do echo --device $dev; done) \ + --group-add video \ + --cap-add=SYS_ADMIN \ + --mount type=bind,source=/dev/dri/by-path,target=/dev/dri/by-path \ + --mount type=bind,source=/sys,target=/sys \ + --mount type=bind,source=/dev/bus,target=/dev/bus \ + --mount type=bind,source=/dev/char,target=/dev/char \ + --mount type=bind,source="$(realpath "$HOST_DIR")",target=/mnt/workdir \ + "$IMAGE_NAME" \ + bash diff --git a/vllm/tools/platform/evaluation/gen_evaluation_report.py b/vllm/tools/platform/evaluation/gen_evaluation_report.py new file mode 100644 index 0000000..cbc70f7 --- /dev/null +++ b/vllm/tools/platform/evaluation/gen_evaluation_report.py @@ -0,0 +1,152 @@ +import re +import csv +import sys + +# ------------------------------- +# Parse P2P bandwidth, keep only 128MB and 256MB +# ------------------------------- +def parse_p2p_bandwidth(lines): + TARGETS = { + "Unidirectional Write": r"Bandwidth Write : Device\( 0 \)->Device\( 1 \)", + "Unidirectional Read": r"Bandwidth Read : Device\( 0 \)<-Device\( 1 \)", + "Bidirectional Write": r"Bandwidth Write : Device\( 0 \)<->Device\( 1 \)", + "Bidirectional Read": r"Bandwidth Read : Device\( 0 \)<->Device\( 1 \)", + } + keep_sizes = ["128 MB", "256 MB"] + results = [] + + i = 0 + while i < len(lines): + line = lines[i] + for name, pattern in TARGETS.items(): + if re.search(pattern, line): + i += 1 + while i < len(lines) and "BW [GBPS]" in lines[i]: + match = re.search(r"([\d]+ ?[KM]?B):\s+([\d\.]+)", lines[i]) + if match: + size = match.group(1).strip() + if size in keep_sizes: + bw = float(match.group(2)) + results.append(["p2p", name, size, bw]) + i += 1 + break + i += 1 + + return results + +# ------------------------------- +# Parse GPU memory bandwidth (H2D, D2H, D2D) +# ------------------------------- +def parse_gpu_memory_bandwidth(lines): + results = [] + h2d = d2h = d2d_float8 = d2d_float16 = None + for i, line in enumerate(lines): + if "GPU Copy Host to Shared Memory" in line: + match = re.search(r"([\d\.]+) GB/s", line) + if match: h2d = float(match.group(1)) + elif "GPU Copy Shared Memory to Host" in line: + match = re.search(r"([\d\.]+) GB/s", line) + if match: d2h = float(match.group(1)) + elif "Global memory bandwidth" in line: + j = i + 1 + while j < len(lines) and lines[j].strip() != "": + if "float8" in lines[j]: + match = re.search(r"float8\s*:\s*([\d\.]+) GB/s", lines[j]) + if match: d2d_float8 = float(match.group(1)) + elif "float16" in lines[j]: + match = re.search(r"float16\s*:\s*([\d\.]+) GB/s", lines[j]) + if match: d2d_float16 = float(match.group(1)) + j += 1 + if h2d is not None: results.append(["GPU memory bandwidth", "H2D", "", h2d]) + if d2h is not None: results.append(["GPU memory bandwidth", "D2H", "", d2h]) + if d2d_float8 is not None: results.append(["GPU memory bandwidth", "D2D", "float8", d2d_float8]) + if d2d_float16 is not None: results.append(["GPU memory bandwidth", "D2D", "float16", d2d_float16]) + return results + +# ------------------------------- +# Parse GEMM int8 performance +# ------------------------------- +def parse_gemm_int8(lines): + in_int8 = False + for line in lines: + if "matrix multiplication" in line and "int8 precision" in line: + in_int8 = True + elif in_int8 and "Average performance" in line: + match = re.search(r"Average performance:\s*([\d\.]+)TF", line) + if match: + return [["gemm", "int8", "", float(match.group(1))]] + return [] + +# ------------------------------- +# Parse oneCCL benchmarks (allreduce/allgather/alltoall) +# Only extract busbw at 128MB +# ------------------------------- +def parse_ccl_busbw(lines, target_bytes=134217728): + results = [] + current_test = None + pattern_test = re.compile(r"benchmarking:\s*(allreduce|allgather|alltoall)", re.I) + for line in lines: + clean_line = line.lstrip("# ").strip() + m = pattern_test.match(clean_line) + if m: + current_test = m.group(1).lower() + continue + if current_test and re.match(r"^\d", clean_line): + cols = clean_line.split() + if len(cols) >= 9: + bytes_val = int(cols[0]) + busbw_val = float(cols[8]) + if bytes_val == target_bytes: + results.append(["1ccl", current_test, "128MB", busbw_val]) + current_test = None + return results + +# ------------------------------- +# Load reference values +# ------------------------------- +def load_reference(reference_file): + reference = {} + with open(reference_file, "r") as f: + reader = csv.reader(f) + header = next(reader, None) + for row in reader: + if len(row) >= 4: + key = (row[0], row[1], row[2]) + reference[key] = row[3] + return reference + +# ------------------------------- +# Main function +# ------------------------------- +def main(): + if len(sys.argv) != 4: + print("Usage: python script.py ") + sys.exit(1) + + input_file = sys.argv[1] + reference_file = sys.argv[2] + output_file = sys.argv[3] + + with open(input_file, "r") as f: + lines = f.readlines() + + all_results = [] + all_results.extend(parse_p2p_bandwidth(lines)) + all_results.extend(parse_gpu_memory_bandwidth(lines)) + all_results.extend(parse_gemm_int8(lines)) + all_results.extend(parse_ccl_busbw(lines)) + + reference = load_reference(reference_file) + + with open(output_file, "w", newline="") as csvfile: + writer = csv.writer(csvfile) + writer.writerow(["Category", "Subcategory", "Data/Packet Size", "Measured (GB/s)", "Reference (GB/s)"]) + for row in all_results: + key = (row[0], row[1], row[2]) + ref_val = reference.get(key, "") + writer.writerow(row + [ref_val]) + + print(f"Report generated: {output_file}") + +if __name__ == "__main__": + main() diff --git a/vllm/tools/platform/evaluation/platform_basic_evaluation.sh b/vllm/tools/platform/evaluation/platform_basic_evaluation.sh new file mode 100755 index 0000000..7a283ed --- /dev/null +++ b/vllm/tools/platform/evaluation/platform_basic_evaluation.sh @@ -0,0 +1,135 @@ +#!/bin/bash +set -eo pipefail + +# === Error Handling === +CURRENT_STEP="" +function print_info() { echo -e "\033[1;34m[INFO]\033[0m $1"; } +function print_success() { echo -e "\033[1;32m[SUCCESS]\033[0m $1"; } +function print_error() { echo -e "\033[1;31m[ERROR]\033[0m $1"; } + +function error_handler() { + local exit_code=$? + local line_no=$1 + print_error "Script failed during: '$CURRENT_STEP' (line $line_no, exit code $exit_code)" + echo "[FAILED COMMAND] $BASH_COMMAND" + echo "[DEBUG] Check log file: $LOG" + exit $exit_code +} +trap 'error_handler $LINENO' ERR + +function step() { + CURRENT_STEP="$1" + print_info "$CURRENT_STEP" +} + +# === Validate current directory === +step "Validating directory structure" +if [ ! -d "./tools" ] || [ ! -d "./scripts" ]; then + print_error "This script must be run from the root directory of the installation package." + echo "Expected directories: ./tools and ./scripts" + exit 1 +fi + +# === Timestamped result directory === +step "Creating result directory" +TIMESTAMP=$(date "+%Y%m%d_%H%M%S") +RESULT_DIR="results/$TIMESTAMP" +mkdir -p "$RESULT_DIR" +LOG="$RESULT_DIR/benchmark_detail_log.txt" + +# === Load Intel oneAPI environment === +step "Sourcing Intel oneAPI environment" +SETVARS="/opt/intel/oneapi/setvars.sh" +if [ ! -f "$SETVARS" ]; then + print_error "$SETVARS not found. Please check the path." + exit 1 +fi +source "$SETVARS" --force>> "$LOG" 2>&1 + +# === Setup Necessary Environment Variable === +# For accurate memory bandwidth benchmark +export NEOReadDebugKeys=1 +export RenderCompressedBuffersEnabled=0 + +# === List SYCL Devices === +step "Listing SYCL devices" +sycl-ls 2>&1 | tee -a "$LOG" + +# === Run xpu-smi === +step "xpu-smi test" +xpu-smi discovery 2>&1 | tee -a "$LOG" +xpu-smi dump -m 0,1,2,3,4,5,18,19,20 -n 1 2>&1 | tee -a "$LOG" + +# === P2P Bandwidth Test === +count=$(lspci | grep -E "e211|e210" | wc -l) +echo "Detected GPU count: $count" +if [ "$count" -ge 2 ]; then + step "Running ze_peer default test" + ./tools/level-zero-tests/ze_peer -s 0 -d 1 2>&1 | tee -a "$LOG" + + step "Running ze_peer bi-directional write test" + ./tools/level-zero-tests/ze_peer -o write -t transfer_bw -s 0 -d 1 -b 2>&1 | tee -a "$LOG" + + step "Running ze_peer bi-directional read test" + ./tools/level-zero-tests/ze_peer -o read -t transfer_bw -s 0 -d 1 -b 2>&1 | tee -a "$LOG" + +# if [ "$count" -ge 4 ]; then +# step "Running ze_peer 2 pair uni-directional write test" +# ./tools/level-zero-tests/ze_peer -o write -t transfer_bw --parallel_pair_targets 0:1,2:3 2>&1 | tee -a "$LOG" +# +# step "Running ze_peer 2 pair bi-directional write test" +# ./tools/level-zero-tests/ze_peer -o write -t transfer_bw --parallel_pair_targets 0:1,2:3 -b 2>&1 | tee -a "$LOG" +# fi +else + echo "GPU count < 2, no need to do P2P benchmark" +fi + +# === Host ↔ Device Bandwidth Test === +step "Running H2D/D2H transfer_bw test" +./tools/level-zero-tests/ze_peak -t transfer_bw 2>&1 | tee -a "$LOG" + +# === Device ↔ Device Bandwidth Test === +step "Copying .spv files" +cp ./tools/level-zero-tests/*.spv ./ 2>/dev/null || print_info "No .spv files to copy." + +step "Running D2D global_bw test" +./tools/level-zero-tests/ze_peak -t global_bw 2>&1 | tee -a "$LOG" + +step "Cleaning up SPIR-V files" +rm -f *.spv + +# === GEMM Test using MKL === +step "Running GEMM MKL test (int8)" +matrix_mul_mkl int8 -m 40960 -n 40960 -k 40960 -c 0 2>&1 | tee -a "$LOG" + +# === 1CCL Benchmarking === +function run_ccl_test() { + local op=$1 + local outfile="$RESULT_DIR/${op}_outplace_128M.csv" + step "Running 1CCL ${op^^} test" + mpirun -np 2 /usr/bin/1ccl_benchmark \ + -a gpu -m usm -u device -e in_order \ + -l "$op" -i 50 -w 20 -f 512 -t 67108864 \ + -j off -p 0 -d float16 -q 0 -o "$outfile" 2>&1 | tee -a "$LOG" + + if [ ${PIPESTATUS[0]} -eq 0 ]; then + print_success "1CCL $op test completed. Output: $outfile" + else + print_error "1CCL $op test failed." + exit 1 + fi +} + +run_ccl_test allreduce +run_ccl_test allgather +run_ccl_test alltoall + + +# === Generate Report ==== +python3 ./scripts/evaluation/gen_evaluation_report.py $RESULT_DIR/benchmark_detail_log.txt results/reference_perf.csv $RESULT_DIR/benchmark_report.csv + +# === Final Message === +print_success "All tests completed." +print_info "Logs saved to: $LOG" +print_info "Result report saved in: $RESULT_DIR/benchmark_report.csv" +print_info "Result detailed log saved in: $RESULT_DIR/benchmark_detailed_log.txt" diff --git a/vllm/tools/platform/evaluation/setup_perf.sh b/vllm/tools/platform/evaluation/setup_perf.sh new file mode 100755 index 0000000..82ab0bf --- /dev/null +++ b/vllm/tools/platform/evaluation/setup_perf.sh @@ -0,0 +1,10 @@ +#!/bin/bash +gpu_num=$(sudo xpu-smi discovery | grep card | wc -l) +for((i=0; i<$gpu_num; i++)); do + echo "Set GPU $i freq to 2400Mhz" + sudo xpu-smi config -d $i -t 0 --frequencyrange 2400,2400 +done + +echo "Set CPU to performance mode" +echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor +echo 0 | sudo tee /sys/devices/system/cpu/cpu*/power/energy_perf_bias diff --git a/vllm/tools/platform/installation/add_intel_graphics_ppa.sh b/vllm/tools/platform/installation/add_intel_graphics_ppa.sh new file mode 100755 index 0000000..97ff25a --- /dev/null +++ b/vllm/tools/platform/installation/add_intel_graphics_ppa.sh @@ -0,0 +1 @@ +add-apt-repository -y ppa:kobuk-team/intel-graphics diff --git a/vllm/tools/platform/installation/build_local_apt_repo.sh b/vllm/tools/platform/installation/build_local_apt_repo.sh new file mode 100755 index 0000000..812a055 --- /dev/null +++ b/vllm/tools/platform/installation/build_local_apt_repo.sh @@ -0,0 +1,74 @@ +#!/bin/bash +# +# build_local_apt_repo.sh +# Build a local APT repository from a directory of .deb files +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Author: James Tang +# Date: 2025-07-26 + +set -euo pipefail + +# === Colors === +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' + +log_info() { echo -e "${GREEN}[INFO]${NC} $1"; } +log_warn() { echo -e "${YELLOW}[WARN]${NC} $1"; } +log_error() { echo -e "${RED}[ERROR]${NC} $1" >&2; } + +# === Argument Check === +if [[ $# -ne 1 ]]; then + echo -e "${YELLOW}Usage: $0 ${NC}" + exit 1 +fi + +DEB_SOURCE_DIR="$1" +REPO_DIR="/opt/local-apt-repo" +APT_SOURCE_FILE="/etc/apt/sources.list.d/local-repo.list" + +# === Validation === +if [[ ! -d "$DEB_SOURCE_DIR" ]]; then + log_error "Directory '$DEB_SOURCE_DIR' does not exist." + exit 2 +fi + +if [[ ! "$(ls -1 "$DEB_SOURCE_DIR"/*.deb 2>/dev/null)" ]]; then + log_error "No .deb files found in '$DEB_SOURCE_DIR'." + exit 3 +fi + +log_info "Creating local APT repository from '$DEB_SOURCE_DIR'..." + +# === Step 1: Prepare Repository === +log_info "Copying .deb files to repository directory: $REPO_DIR" +mkdir -p "$REPO_DIR" +cp -v "$DEB_SOURCE_DIR"/*.deb "$REPO_DIR"/ + +# === Step 2: Generate Packages.gz === +log_info "Generating Packages.gz index..." +cd "$REPO_DIR" +dpkg-scanpackages . /dev/null | gzip -9c > Packages.gz + +# === Step 3: Configure APT Source === +log_info "Configuring APT source file: $APT_SOURCE_FILE" +echo "deb [trusted=yes] file://$REPO_DIR ./" | tee "$APT_SOURCE_FILE" > /dev/null + +# === Step 4: Update APT Index === +log_info "Updating APT package index..." +apt update + +log_info "✅ Local APT repository is ready and active!" diff --git a/vllm/tools/platform/installation/get_kernel.sh b/vllm/tools/platform/installation/get_kernel.sh new file mode 100755 index 0000000..eecc170 --- /dev/null +++ b/vllm/tools/platform/installation/get_kernel.sh @@ -0,0 +1,61 @@ +#!/bin/bash +# kernel-manager.sh +# Usage: +# sudo ./kernel-manager.sh +# Example: +# sudo ./kernel-manager.sh 6.14.0-1006-intel install +# sudo ./kernel-manager.sh 6.14.0-1006-intel download + +set -e + +if [ $# -ne 2 ]; then + echo "Usage: $0 " + echo " : e.g. 6.14.0-1006-intel" + echo " : install | download" + exit 1 +fi + +KERNEL_VERSION="$1" +ACTION="$2" + +# Kernel package list +PACKAGES=( + "linux-image-${KERNEL_VERSION}" + "linux-modules-${KERNEL_VERSION}" + "linux-modules-extra-${KERNEL_VERSION}" + "linux-headers-${KERNEL_VERSION}" +) + +echo "Target kernel version: $KERNEL_VERSION" +echo "Action: $ACTION" +echo "Packages:" +for pkg in "${PACKAGES[@]}"; do + echo " $pkg" +done + +echo +read -p "Do you want to continue? (y/N): " confirm +if [[ ! "$confirm" =~ ^[Yy]$ ]]; then + echo "Aborted." + exit 0 +fi + +# Update package cache +sudo apt update + +# Perform action +if [ "$ACTION" == "install" ]; then + echo "Installing packages..." + sudo apt install -y "${PACKAGES[@]}" + echo "Installation complete. Please reboot to use the new kernel." +elif [ "$ACTION" == "download" ]; then + echo "Downloading packages..." + mkdir -p ./kernel-packages-"$KERNEL_VERSION" + cd ./kernel-packages-"$KERNEL_VERSION" + apt download "${PACKAGES[@]}" + echo "Download complete. Packages saved in $(pwd)." +else + echo "Invalid action: $ACTION" + echo "Allowed values: install | download" + exit 1 +fi diff --git a/vllm/tools/platform/installation/install_apt_from_file.sh b/vllm/tools/platform/installation/install_apt_from_file.sh new file mode 100755 index 0000000..150ba00 --- /dev/null +++ b/vllm/tools/platform/installation/install_apt_from_file.sh @@ -0,0 +1,30 @@ +#!/bin/bash +set -euo pipefail + +# Check if package file is given +if [ $# -ne 1 ]; then + echo "Usage: $0 " + exit 1 +fi + +PACKAGE_FILE="$1" + +if [ ! -f "$PACKAGE_FILE" ]; then + echo "Error: File '$PACKAGE_FILE' not found." + exit 2 +fi + +# Optionally refresh package index +echo "[INFO] Updating APT package index..." +apt update + +while IFS= read -r line || [[ -n "$line" ]]; do + # Skip empty lines or comments + [[ -z "$line" || "$line" =~ ^# ]] && continue + + # line format: name=version + echo "[INFO] Installing package: $line" + apt install -y --allow-downgrades --allow-change-held-packages "$line" +done < "$PACKAGE_FILE" + +echo "[INFO] All packages installed successfully." diff --git a/vllm/tools/platform/installation/install_gpu_fw.sh b/vllm/tools/platform/installation/install_gpu_fw.sh new file mode 100755 index 0000000..20bee9d --- /dev/null +++ b/vllm/tools/platform/installation/install_gpu_fw.sh @@ -0,0 +1,22 @@ +WORK_DIR=/tmp/multi-arc +mkdir -p $WORK_DIR + +echo -e "\n[INFO] Downloading and installing GPU firmware..." +FIRMWARE_DIR=$WORK_DIR/firmware +mkdir -p "$FIRMWARE_DIR" +cd "$FIRMWARE_DIR" +rm -rf ./* +wget https://gitlab.com/kernel-firmware/linux-firmware/-/raw/main/xe/bmg_guc_70.bin +wget https://gitlab.com/kernel-firmware/linux-firmware/-/raw/main/xe/bmg_huc.bin +zstd -1 bmg_guc_70.bin -o bmg_guc_70.bin.zst +zstd -1 bmg_huc.bin -o bmg_huc.bin.zst + +if [ -d /lib/firmware/xe ]; then + cp *.zst /lib/firmware/xe +else + echo "[ERROR] /lib/firmware/xe does not exist. Ensure your system supports Xe firmware." + exit 1 +fi + +update-initramfs -u +echo -e "Update GPU firmware successfully, please reboot to apply changes!" diff --git a/vllm/tools/platform/installation/install_kernel.sh b/vllm/tools/platform/installation/install_kernel.sh new file mode 100755 index 0000000..fd9df5f --- /dev/null +++ b/vllm/tools/platform/installation/install_kernel.sh @@ -0,0 +1,79 @@ +#!/bin/bash +set -euo pipefail + +trap 'echo -e "\033[1;31m[ERROR]\033[0m Command failed on line $LINENO: $BASH_COMMAND" >&2' ERR + +# === Require root === +if [[ "$EUID" -ne 0 ]]; then + echo "This script must be run as root. Please run with sudo or as root user." + exit 1 +fi + +if [ ! -d "./tools" ] || [ ! -d "./scripts" ]; then + print_error "This script must be run from the root directory of the installation package." + echo "Expected directories: ./tools and ./scripts" + exit 1 +fi + +# === Config === +TARGET_VERSION="6.14.0-1006-intel" +SUBMENU_TITLE="Advanced options for Ubuntu" +MENUENTRY_TITLE="Ubuntu, with Linux $TARGET_VERSION" +DEFAULT_FILE="/etc/default/grub" +GRUB_CFG="/boot/grub/grub.cfg" + +# === Check running kernel === +CURRENT_VERSION="$(uname -r)" +echo "Current running kernel: $CURRENT_VERSION" +echo "Target kernel version: $TARGET_VERSION" + +if [[ "$CURRENT_VERSION" == "$TARGET_VERSION" ]]; then + echo "✅ Already running the target kernel. No changes needed." + exit 0 +fi + +# === Check if target kernel is installed === +if dpkg -s "linux-image-${TARGET_VERSION}" >/dev/null 2>&1; then + echo "Kernel ${TARGET_VERSION} installed, skip..." +else + echo "⚠️ Target kernel is not installed. Installing..." + dpkg -i kernel/*.deb + echo "✅ Kernel $TARGET_VERSION installed successfully." +fi + +# === Check GRUB top-level menu for target kernel === +echo "🔍 Checking if current GRUB default kernel is '$TARGET_VERSION'..." + +FOUND_IN_TOP=0 + +while IFS= read -r line || [[ -n "$line" ]]; do + clean_line="$(echo "$line" | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')" + + if [[ "$clean_line" =~ ^submenu ]]; then + break + fi + + if [[ "$clean_line" =~ ^menuentry[[:space:]]\'Ubuntu,[[:space:]]with[[:space:]]Linux[[:space:]]$TARGET_VERSION ]]; then + FOUND_IN_TOP=1 + echo "✅ Found target kernel in top-level GRUB menu. No update needed." + break + fi +done < "$GRUB_CFG" + +# === Set default GRUB entry if not in top menu === +if [[ "$FOUND_IN_TOP" -ne 1 ]]; then + echo "⚙️ Setting default GRUB entry to: $SUBMENU_TITLE > $MENUENTRY_TITLE" + grub-set-default "$SUBMENU_TITLE>$MENUENTRY_TITLE" + + # === Ensure GRUB uses 'saved' as default === + if grep -q '^GRUB_DEFAULT=' "$DEFAULT_FILE"; then + sed -i 's/^GRUB_DEFAULT=.*/GRUB_DEFAULT=saved/' "$DEFAULT_FILE" + else + echo 'GRUB_DEFAULT=saved' >> "$DEFAULT_FILE" + fi + + echo "🔁 Updating GRUB configuration..." + update-grub +fi + +echo "💡 You may now reboot to use the new kernel." diff --git a/vllm/tools/platform/installer.sh b/vllm/tools/platform/installer.sh new file mode 100755 index 0000000..0aab3f8 --- /dev/null +++ b/vllm/tools/platform/installer.sh @@ -0,0 +1,197 @@ +#!/bin/bash +# ============================================================================== +# Intel Multi-ARC Base Platform Offline Installer Script +# ------------------------------------------------------------------------------ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Author: James Tang +# Contact: jun.tang@intel.com +# Created: 2025-07-27 +# ============================================================================== + +set -euo pipefail + +trap 'echo -e "\033[1;31m[ERROR]\033[0m Command failed on line $LINENO: $BASH_COMMAND" >&2' ERR + +# -------- Configuration -------- +TIMESTAMP=$(date +"%Y%m%d_%H%M%S") +LOGFILE="install_log_$TIMESTAMP.log" +WORK_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +SCRIPT_NAME=$(basename "$0") + +# -------- Color Output -------- +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' # No Color + +log_info() { echo -e "${GREEN}[INFO]${NC} $1" | tee -a "$LOGFILE"; } +log_warn() { echo -e "${YELLOW}[WARN]${NC} $1" | tee -a "$LOGFILE"; } +log_error() { echo -e "${RED}[ERROR]${NC} $1" | tee -a "$LOGFILE" >&2; } + +# -------- Begin Logging -------- +echo "=== Installer Log: $TIMESTAMP ===" > "$LOGFILE" +log_info "Running script: $SCRIPT_NAME" +log_info "Working directory: $WORK_DIR" +log_info "Log file: $LOGFILE" + +# -------- Root Privileges -------- +if [[ "$EUID" -ne 0 ]]; then + log_error "This script must be run as root." + exit 1 +fi + +# -------- Script Location Check -------- +if [[ "$WORK_DIR" != "$(pwd)" ]]; then + log_error "Please run this script from its own directory: $WORK_DIR" + exit 1 +fi + +# -------- Docker Detection -------- +is_docker() { + if [ "${BUILD_ENV:-}" = "docker" ]; then return 0; fi + grep -qaE 'docker|kubepods|containerd' /proc/1/cgroup && return 0 + [[ "$(hostname)" =~ ^[0-9a-f]{12}$ ]] && return 0 + return 1 +} + +if is_docker; then + log_info "Detected Docker container environment." +else + log_info "Detected native host environment." +fi + +# -------- Unified .deb Installer -------- +install_deb_packages() { + local desc="$1" + shift + log_info "Installing $desc ..." + dpkg -i "$@" 2>&1 | tee -a "$LOGFILE" +} + +# -------- oneAPI Installer -------- +install_oneapi() { + local installer="$1" + if [ -f "$installer" ]; then + log_info "Installing Intel oneAPI Base Toolkit..." + bash "$installer" -a -s --eula accept --install-dir=/opt/intel/oneapi \ + --components intel.oneapi.lin.dpcpp-ct:intel.oneapi.lin.dpcpp_dbg:intel.oneapi.lin.dpl:intel.oneapi.lin.tbb.devel:intel.oneapi.lin.ccl.devel:intel.oneapi.lin.dpcpp-cpp-compiler:intel.oneapi.lin.mkl.devel \ + 2>&1 | tee -a "$LOGFILE" + log_info "Intel oneAPI installed successfully." + else + log_error "oneAPI installer not found: $installer" + exit 1 + fi +} + +# -------- Install validated kernel (Host Only) -------- +if ! is_docker; then + log_info "Install kernel." + ./scripts/installation/install_kernel.sh 2>&1 | tee -a "$LOGFILE" +fi + +# -------- Disable IOMMU (Host Only) -------- +if ! is_docker; then + GRUB_FILE="/etc/default/grub" + if [ -f "$GRUB_FILE" ]; then + cp "$GRUB_FILE" "${GRUB_FILE}.bak" + sed -i 's/^GRUB_CMDLINE_LINUX_DEFAULT=.*/GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=off"/' "$GRUB_FILE" + update-grub 2>&1 | tee -a "$LOGFILE" + log_info "Disabled IOMMU in GRUB and updated configuration." + else + log_error "GRUB configuration not found at $GRUB_FILE." + exit 1 + fi +fi + +# -------- Disable Ubuntu Auto-Update (Host Only) -------- +if ! is_docker; then + log_info "Disabled Ubuntu Auto-Update to maintain consistent environment." + ./scripts/config/disable_auto_update.sh 2>&1 | tee -a "$LOGFILE" +fi + +# -------- Disable AppArmor (Host Only) -------- +if ! is_docker; then + log_info "Disabled Ubuntu AppArmor to avoid unnecessary message." + ./scripts/config/disable_apparmor.sh 2>&1 | tee -a "$LOGFILE" +fi + + +# -------- Install Firmware (Host Only) -------- +if ! is_docker; then + FIRMWARE_DIR="$WORK_DIR/firmware" + log_info "Installing GPU firmware from $FIRMWARE_DIR..." + + if [ -d "$FIRMWARE_DIR" ] && [ -d /lib/firmware/xe ]; then + cp "$FIRMWARE_DIR"/*.zst /lib/firmware/xe/ + update-initramfs -u 2>&1 | tee -a "$LOGFILE" + log_info "Firmware installed and initramfs updated." + else + log_error "Missing firmware source or target directory." + exit 1 + fi +fi + +cd "$WORK_DIR" + +# -------- Install Base Libraries -------- +install_deb_packages "base libraries" base/*.deb + +if ! is_docker; then + install_deb_packages "docker libraries" base/docker/*.deb +fi + +# -------- Install Graphics Drivers -------- +install_deb_packages "graphics base drivers" gfxdrv/base/*.deb +install_deb_packages "graphics GPGPU drivers" gfxdrv/gpgpu/*.deb + +if ! is_docker; then + install_deb_packages "graphics video drivers" gfxdrv/video/*.deb + install_deb_packages "graphics display drivers" gfxdrv/graphics/*.deb +fi + +# -------- Install Intel oneAPI Base Toolkit -------- +# Only install oneapi in native environment since our docker image is based on +# onepai base image which already has oneapi installed +if ! is_docker; then + ONEAPI_DIR="/opt/intel/oneapi/2025.1" + ONEAPI_INSTALLER="$WORK_DIR/oneapi/intel-oneapi-base-toolkit-2025.1.3.7_offline.sh" + + if [ -d "$ONEAPI_DIR" ]; then + log_info "Intel oneAPI already installed at $ONEAPI_DIR. Skipping." + else + install_oneapi "$ONEAPI_INSTALLER" + fi +fi + +# -------- Install Evaluation Tools -------- +TOOLS_DIR=$WORK_DIR/tools +install_deb_packages "1ccl tool" "$TOOLS_DIR/1ccl/"*.deb || true +install_deb_packages "gemm tool" "$TOOLS_DIR/gemm/"*.deb || true +install_deb_packages "xpu-smi tool" "$TOOLS_DIR/xpu-smi/"*.deb || true + +cd "$WORK_DIR" + +# -------- Final Message -------- +log_info "Intel Multi-ARC base platform installation complete." + +if is_docker; then + log_info "Docker environment detected — reboot not required." +else + log_info "Please reboot the system to apply changes." +fi + +echo -e "\n${GREEN}Tools installed:${NC} gemm / 1ccl / xpu-smi in /usr/bin" +echo -e "${GREEN}level-zero-tests:${NC} ./tools/level-zero-tests" +echo -e "${GREEN}Support scripts:${NC} ./scripts" +echo -e "${GREEN}Installation log:${NC} ./$LOGFILE"