Skip to content

kevinyuan/llm-inference-perf-model

Repository files navigation

LLM Inference Performance Calculator

Performance Analysis

1. Introduction: Background & Motivation

The deployment of Large Scale Mixture-of-Experts (MoE) models, such as DeepSeek-V3, represents a significant challenge in modern AI engineering. The design space for inference infrastructure is vast, involving complex trade-offs between Logical Architecture (Hyperparameters, Parallelism Strategies, Routing Algorithms) and Physical Hardware (Memory Bandwidth, Interconnect Topology, Power Constraints).

The Challenge

Engineers often face critical "What-If" questions that are expensive to test in production:

  • How does sequence length scaling impact the KV Cache memory wall?
  • Can we hide MoE All-to-All communication latency using DualPipe optimization?
  • What happens if we offload "Cold Experts" to system RAM or a neighbor node?

The Solution

The LLM Inference Performance Calculator is a first-principles, interactive visualization tool designed to bridge the gap between logical model design and physical hardware constraints. Unlike simple calculators, this tool simulates the physics of inference, modeling latency, bandwidth saturation, and PCIe bottlenecks to provide real-time feedback on system performance. It allows Architects and Systems Engineers to explore the entire design space—from the chip level up to the cluster level—without needing access to physical hardware.


2. Key Features

🧠 Predefined Models & Presets

Instantly load industry-standard configurations to establish a baseline, then modify them to test custom hypotheses.

  • DeepSeek-V3: (671B MoE, MLA, Multi-Token Prediction)
  • Mixtral 8x7B: (Sparse MoE, GQA)
  • Grok-1: (Large Scale MoE)
  • Qwen2.5-MoE: (High granularity experts)

🛠️ Architecture & Pipeline Customization

Granular control over the logical inference pipeline. Differentiate between Prefill (Throughput-bound) and Decode (Latency-bound) stages with distinct parallelism strategies.

Pipeline View

  • Architecture Config: Adjust Layers, Expert Count (N), Active Experts (K), and Attention Types (MLA/GQA/MHA).
  • Parallelism Strategy: Configure Tensor Parallel (TP), Pipeline Parallel (PP), Sequence Parallel (SP), and Data Parallel (DP) independently for Prefill and Decode stages.
  • Optimizations: Toggle Paged KV Cache, DualPipe (Compute-Comm Overlap), and Quantization (FP8/INT4).

⚡ AI Infrastructure Configuration

Map your logical workload onto physical hardware topology to visualize bottlenecks in the "Physical View".

Physical Architecture

  • Compute: Select from NVIDIA H100, B200, A100, or generic SKUs. Configure Host CPUs (Sapphire Rapids, Emerald Rapids).
  • Networking: Toggle between InfiniBand and Ethernet (RoCE) scale-out fabrics. Adjust Scale-Up topology (NVLink V3/V4/V5).
  • Topology: Auto-calculate the required number of nodes based on memory capacity and pipeline depth.

🧪 Experimental Features: MemPool & NMC

Explore cutting-edge research concepts for next-generation inference systems.

MemPool and NMC

  • Memory Pooling (MemPool): Simulate Transparent Page Placement (TPP).
    • Define hierarchical storage tiers: VRAM $\to$ System RAM $\to$ Node Pool (NVMe) $\to$ Global Pool.
    • Test Predictive Prefetching vs. On-Demand paging policies.
    • Visualize the impact of "Expert Locality" on PCIe saturation.
  • Near Memory Computing (NMC):
    • Simulate offloading specific operations (Top-K Selection, Quantization, Sparse Attention) to NMC-enabled memory pool devices.
    • Evaluate latency reduction by processing data closer to where it lives.

3. Build Instructions

This project is built using React, TypeScript, Tailwind CSS, and Vite.

Prerequisites

  • Node.js (v18 or higher recommended)
  • npm or yarn

Installation Steps

  1. Clone the Repository

    git clone https://github.com/kevinyuan/llm-inference-perf-model.git
    cd llm-inference-perf-model
  2. Install Dependencies

    npm install
    # or
    yarn install
  3. Run Development Server Start the local development server with hot-reloading.

    npm run dev

    Open your browser to http://localhost:5173 (or the port shown in your terminal).

  4. Build for Production Generate static assets for deployment.

    npm run build

    The output will be in the dist/ directory.

License

MIT License. Free for educational and research use.

About

A interactive web application for analyzing large-scale model inference performance.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published