The deployment of Large Scale Mixture-of-Experts (MoE) models, such as DeepSeek-V3, represents a significant challenge in modern AI engineering. The design space for inference infrastructure is vast, involving complex trade-offs between Logical Architecture (Hyperparameters, Parallelism Strategies, Routing Algorithms) and Physical Hardware (Memory Bandwidth, Interconnect Topology, Power Constraints).
Engineers often face critical "What-If" questions that are expensive to test in production:
- How does sequence length scaling impact the KV Cache memory wall?
- Can we hide MoE All-to-All communication latency using DualPipe optimization?
- What happens if we offload "Cold Experts" to system RAM or a neighbor node?
The LLM Inference Performance Calculator is a first-principles, interactive visualization tool designed to bridge the gap between logical model design and physical hardware constraints. Unlike simple calculators, this tool simulates the physics of inference, modeling latency, bandwidth saturation, and PCIe bottlenecks to provide real-time feedback on system performance. It allows Architects and Systems Engineers to explore the entire design space—from the chip level up to the cluster level—without needing access to physical hardware.
Instantly load industry-standard configurations to establish a baseline, then modify them to test custom hypotheses.
- DeepSeek-V3: (671B MoE, MLA, Multi-Token Prediction)
- Mixtral 8x7B: (Sparse MoE, GQA)
- Grok-1: (Large Scale MoE)
- Qwen2.5-MoE: (High granularity experts)
Granular control over the logical inference pipeline. Differentiate between Prefill (Throughput-bound) and Decode (Latency-bound) stages with distinct parallelism strategies.
- Architecture Config: Adjust Layers, Expert Count (N), Active Experts (K), and Attention Types (MLA/GQA/MHA).
- Parallelism Strategy: Configure Tensor Parallel (TP), Pipeline Parallel (PP), Sequence Parallel (SP), and Data Parallel (DP) independently for Prefill and Decode stages.
- Optimizations: Toggle Paged KV Cache, DualPipe (Compute-Comm Overlap), and Quantization (FP8/INT4).
Map your logical workload onto physical hardware topology to visualize bottlenecks in the "Physical View".
- Compute: Select from NVIDIA H100, B200, A100, or generic SKUs. Configure Host CPUs (Sapphire Rapids, Emerald Rapids).
- Networking: Toggle between InfiniBand and Ethernet (RoCE) scale-out fabrics. Adjust Scale-Up topology (NVLink V3/V4/V5).
- Topology: Auto-calculate the required number of nodes based on memory capacity and pipeline depth.
Explore cutting-edge research concepts for next-generation inference systems.
-
Memory Pooling (MemPool): Simulate Transparent Page Placement (TPP).
- Define hierarchical storage tiers: VRAM
$\to$ System RAM$\to$ Node Pool (NVMe)$\to$ Global Pool. - Test Predictive Prefetching vs. On-Demand paging policies.
- Visualize the impact of "Expert Locality" on PCIe saturation.
- Define hierarchical storage tiers: VRAM
-
Near Memory Computing (NMC):
- Simulate offloading specific operations (Top-K Selection, Quantization, Sparse Attention) to NMC-enabled memory pool devices.
- Evaluate latency reduction by processing data closer to where it lives.
This project is built using React, TypeScript, Tailwind CSS, and Vite.
- Node.js (v18 or higher recommended)
- npm or yarn
-
Clone the Repository
git clone https://github.com/kevinyuan/llm-inference-perf-model.git cd llm-inference-perf-model -
Install Dependencies
npm install # or yarn install -
Run Development Server Start the local development server with hot-reloading.
npm run dev
Open your browser to
http://localhost:5173(or the port shown in your terminal). -
Build for Production Generate static assets for deployment.
npm run build
The output will be in the
dist/directory.
MIT License. Free for educational and research use.



