LLM Inference Performance Calculator

1. Introduction: Background & Motivation

The deployment of Large Scale Mixture-of-Experts (MoE) models, such as DeepSeek-V3, represents a significant challenge in modern AI engineering. The design space for inference infrastructure is vast, involving complex trade-offs between Logical Architecture (Hyperparameters, Parallelism Strategies, Routing Algorithms) and Physical Hardware (Memory Bandwidth, Interconnect Topology, Power Constraints).

The Challenge

Engineers often face critical "What-If" questions that are expensive to test in production:

How does sequence length scaling impact the KV Cache memory wall?
Can we hide MoE All-to-All communication latency using DualPipe optimization?
What happens if we offload "Cold Experts" to system RAM or a neighbor node?

The Solution

The LLM Inference Performance Calculator is a first-principles, interactive visualization tool designed to bridge the gap between logical model design and physical hardware constraints. Unlike simple calculators, this tool simulates the physics of inference, modeling latency, bandwidth saturation, and PCIe bottlenecks to provide real-time feedback on system performance. It allows Architects and Systems Engineers to explore the entire design space—from the chip level up to the cluster level—without needing access to physical hardware.

2. Key Features

🧠 Predefined Models & Presets

Instantly load industry-standard configurations to establish a baseline, then modify them to test custom hypotheses.

DeepSeek-V3: (671B MoE, MLA, Multi-Token Prediction)
Mixtral 8x7B: (Sparse MoE, GQA)
Grok-1: (Large Scale MoE)
Qwen2.5-MoE: (High granularity experts)

🛠️ Architecture & Pipeline Customization

Granular control over the logical inference pipeline. Differentiate between Prefill (Throughput-bound) and Decode (Latency-bound) stages with distinct parallelism strategies.

Architecture Config: Adjust Layers, Expert Count (N), Active Experts (K), and Attention Types (MLA/GQA/MHA).
Parallelism Strategy: Configure Tensor Parallel (TP), Pipeline Parallel (PP), Sequence Parallel (SP), and Data Parallel (DP) independently for Prefill and Decode stages.
Optimizations: Toggle Paged KV Cache, DualPipe (Compute-Comm Overlap), and Quantization (FP8/INT4).

⚡ AI Infrastructure Configuration

Map your logical workload onto physical hardware topology to visualize bottlenecks in the "Physical View".

Compute: Select from NVIDIA H100, B200, A100, or generic SKUs. Configure Host CPUs (Sapphire Rapids, Emerald Rapids).
Networking: Toggle between InfiniBand and Ethernet (RoCE) scale-out fabrics. Adjust Scale-Up topology (NVLink V3/V4/V5).
Topology: Auto-calculate the required number of nodes based on memory capacity and pipeline depth.

🧪 Experimental Features: MemPool & NMC

Explore cutting-edge research concepts for next-generation inference systems.

Memory Pooling (MemPool): Simulate Transparent Page Placement (TPP).
- Define hierarchical storage tiers: VRAM $\to$ System RAM $\to$ Node Pool (NVMe) $\to$ Global Pool.
- Test Predictive Prefetching vs. On-Demand paging policies.
- Visualize the impact of "Expert Locality" on PCIe saturation.
Near Memory Computing (NMC):
- Simulate offloading specific operations (Top-K Selection, Quantization, Sparse Attention) to NMC-enabled memory pool devices.
- Evaluate latency reduction by processing data closer to where it lives.

3. Build Instructions

This project is built using React, TypeScript, Tailwind CSS, and Vite.

Prerequisites

Node.js (v18 or higher recommended)
npm or yarn

Installation Steps

Clone the Repository

git clone https://github.com/kevinyuan/llm-inference-perf-model.git
cd llm-inference-perf-model

Install Dependencies
```
npm install
# or
yarn install
```
Run Development Server Start the local development server with hot-reloading.
```
npm run dev
```
Open your browser to http://localhost:5173 (or the port shown in your terminal).
Build for Production Generate static assets for deployment.
```
npm run build
```
The output will be in the dist/ directory.

License

MIT License. Free for educational and research use.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
components		components
images		images
utils		utils
.gitignore		.gitignore
App.tsx		App.tsx
README.md		README.md
constants.ts		constants.ts
index.html		index.html
index.tsx		index.tsx
metadata.json		metadata.json
package.json		package.json
tsconfig.json		tsconfig.json
types.ts		types.ts
vite.config.ts		vite.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Inference Performance Calculator

1. Introduction: Background & Motivation

The Challenge

The Solution

2. Key Features

🧠 Predefined Models & Presets

🛠️ Architecture & Pipeline Customization

⚡ AI Infrastructure Configuration

🧪 Experimental Features: MemPool & NMC

3. Build Instructions

Prerequisites

Installation Steps

License

About

Uh oh!

Releases

Packages

Languages

kevinyuan/llm-inference-perf-model

Folders and files

Latest commit

History

Repository files navigation

LLM Inference Performance Calculator

1. Introduction: Background & Motivation

The Challenge

The Solution

2. Key Features

🧠 Predefined Models & Presets

🛠️ Architecture & Pipeline Customization

⚡ AI Infrastructure Configuration

🧪 Experimental Features: MemPool & NMC

3. Build Instructions

Prerequisites

Installation Steps

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages