# Part 4, Lab 7: Production Integration

**Time:** ~30 minutes

Integrate your knowledge into production systems. This lab covers using vLLM and TensorRT-LLM for optimized inference.

## Learning Objectives

1. Deploy models with vLLM
2. Understand TensorRT-LLM optimization
3. Configure quantization in production
4. Benchmark and profile

---
## 1. vLLM Quick Start

vLLM provides high-throughput serving with PagedAttention.

In [None]:
# Installation (uncomment to install)
# !pip install vllm

# Note: vLLM requires GPU. This cell demonstrates the API.
vllm_example = '''
from vllm import LLM, SamplingParams

# Load model with default settings
llm = LLM(model="meta-llama/Llama-2-7b-hf")

# Or with quantization
llm_quantized = LLM(
    model="meta-llama/Llama-2-7b-hf",
    quantization="awq",  # Options: awq, gptq, squeezellm, fp8
    dtype="float16",
    gpu_memory_utilization=0.9,
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=256,
)

# Generate
prompts = [
    "Explain GPU memory hierarchy in one paragraph:",
    "What is FlashAttention?",
]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(f"Prompt: {output.prompt}")
    print(f"Output: {output.outputs[0].text}")
    print()
'''

print("vLLM Usage Example:")
print(vllm_example)

---
## 2. vLLM Server Mode

vLLM can run as an OpenAI-compatible API server.

In [None]:
server_commands = '''
# Start vLLM server
python -m vllm.entrypoints.openai.api_server \\
    --model meta-llama/Llama-2-7b-hf \\
    --tensor-parallel-size 1 \\
    --dtype float16 \\
    --port 8000

# With quantization
python -m vllm.entrypoints.openai.api_server \\
    --model TheBloke/Llama-2-7B-AWQ \\
    --quantization awq \\
    --port 8000

# Client usage (standard OpenAI API)
curl http://localhost:8000/v1/completions \\
    -H "Content-Type: application/json" \\
    -d '{
        "model": "meta-llama/Llama-2-7b-hf",
        "prompt": "Hello, world!",
        "max_tokens": 100,
        "temperature": 0.7
    }'
'''

print("vLLM Server Commands:")
print(server_commands)

---
## 3. TensorRT-LLM Overview

TensorRT-LLM provides maximum performance on NVIDIA GPUs through compilation.

In [None]:
trtllm_workflow = '''
# TensorRT-LLM Workflow:

# 1. Install TensorRT-LLM
pip install tensorrt-llm

# 2. Convert model to TensorRT-LLM format
python convert_checkpoint.py \\
    --model_dir meta-llama/Llama-2-7b-hf \\
    --output_dir ./llama2-7b-trtllm \\
    --dtype float16

# With INT8 quantization
python convert_checkpoint.py \\
    --model_dir meta-llama/Llama-2-7b-hf \\
    --output_dir ./llama2-7b-int8 \\
    --dtype float16 \\
    --int8_kv_cache \\
    --weight_only_precision int8

# 3. Build TensorRT engine
trtllm-build \\
    --checkpoint_dir ./llama2-7b-trtllm \\
    --output_dir ./llama2-7b-engine \\
    --gemm_plugin float16 \\
    --max_batch_size 64 \\
    --max_input_len 2048 \\
    --max_output_len 512

# 4. Run inference
python run.py \\
    --engine_dir ./llama2-7b-engine \\
    --input_text "Hello, world!"
'''

print("TensorRT-LLM Workflow:")
print(trtllm_workflow)

In [None]:
# TensorRT-LLM Python API example
trtllm_python = '''
from tensorrt_llm import LLM, SamplingParams

# Load compiled engine
llm = LLM(model="./llama2-7b-engine")

# Or directly from HuggingFace (auto-compile)
llm = LLM(model="meta-llama/Llama-2-7b-hf")

# Generate with sampling
outputs = llm.generate(
    ["What is the capital of France?"],
    sampling_params=SamplingParams(
        temperature=0.7,
        top_p=0.95,
        max_tokens=100
    )
)

for output in outputs:
    print(output.outputs[0].text)
'''

print("TensorRT-LLM Python API:")
print(trtllm_python)

---
## 4. Quantization Configuration

Different quantization strategies for different use cases.

In [None]:
quantization_configs = {
    "Maximum Quality (FP16)": {
        "vllm": "--dtype float16",
        "trtllm": "--dtype float16",
        "use_case": "When accuracy is critical",
        "memory": "1x baseline"
    },
    "Balanced (FP8 KV Cache)": {
        "vllm": "--dtype float16 --kv-cache-dtype fp8",
        "trtllm": "--dtype float16 --fp8_kv_cache",
        "use_case": "H100, best quality/throughput ratio",
        "memory": "~0.75x (KV only)"
    },
    "Weight INT8 (W8A16)": {
        "vllm": "--quantization squeezellm",  # or load pre-quantized
        "trtllm": "--weight_only_precision int8",
        "use_case": "Memory-constrained, good quality",
        "memory": "~0.5x (weights only)"
    },
    "Weight INT4 (W4A16)": {
        "vllm": "--quantization awq",  # or gptq
        "trtllm": "--weight_only_precision int4_awq",
        "use_case": "Maximum memory savings",
        "memory": "~0.3x (weights only)"
    },
    "Full INT8 (W8A8)": {
        "vllm": "N/A (not directly supported)",
        "trtllm": "--dtype int8 --int8_kv_cache",
        "use_case": "Maximum throughput, calibration needed",
        "memory": "~0.5x"
    },
    "FP8 (W8A8 FP8)": {
        "vllm": "--quantization fp8 --kv-cache-dtype fp8",
        "trtllm": "--dtype fp8 --fp8_kv_cache",
        "use_case": "H100/Ada, near-FP16 quality, 2x throughput",
        "memory": "~0.5x"
    },
}

print("Quantization Configuration Guide:")
print("=" * 80)
for name, config in quantization_configs.items():
    print(f"\n{name}")
    print(f"  Use case: {config['use_case']}")
    print(f"  Memory:   {config['memory']}")
    print(f"  vLLM:     {config['vllm']}")
    print(f"  TRT-LLM:  {config['trtllm']}")

---
## 5. Benchmarking

Key metrics for LLM inference performance.

In [None]:
benchmark_guide = '''
Key Metrics for LLM Inference:

1. Time to First Token (TTFT)
   - Latency from request to first output token
   - Critical for interactive applications
   - Dominated by prompt processing (prefill)

2. Time Per Output Token (TPOT)
   - Latency per generated token
   - Determines streaming speed
   - Often memory-bandwidth bound

3. Throughput (tokens/second)
   - Total tokens generated per second
   - Maximize with batching
   - Trade-off with latency

4. Memory Usage
   - Model weights + KV cache + activations
   - Limits batch size and context length
   - Quantization reduces requirements

Benchmarking Commands:

# vLLM benchmark
python benchmark_serving.py \\
    --model meta-llama/Llama-2-7b-hf \\
    --num-prompts 1000 \\
    --request-rate 10

# TensorRT-LLM benchmark
python benchmark.py \\
    --engine_dir ./llama2-7b-engine \\
    --batch_size 32 \\
    --input_len 128 \\
    --output_len 128
'''

print(benchmark_guide)

---
## 6. Decision Framework

Choosing the right framework and configuration.

In [None]:
decision_tree = '''
Framework Selection Decision Tree:

1. What hardware?
   ├── NVIDIA GPU → Continue
   ├── AMD GPU → vLLM (ROCm support)
   └── CPU/Edge → llama.cpp

2. Priority?
   ├── Maximum throughput → TensorRT-LLM
   ├── Easy deployment → vLLM
   └── Research/flexibility → vLLM or HF Transformers

3. Model size vs GPU memory?
   ├── Fits in FP16 → Start with FP16
   ├── Tight fit → FP8 or INT8 KV cache
   └── Doesn't fit → Weight quantization (INT4/INT8)

4. Latency requirements?
   ├── Interactive (<100ms TTFT) → Small batch, high priority
   ├── Batch processing → Maximize batch size
   └── Streaming → Optimize TPOT

Recommended Starting Points:

- Development: vLLM with default settings
- Production (NVIDIA): TensorRT-LLM with FP8
- Memory-constrained: vLLM with AWQ INT4
- Multi-GPU: TensorRT-LLM with tensor parallelism
'''

print(decision_tree)

---
## Exercises

1. **Deploy vLLM**: Set up a vLLM server with an AWQ-quantized model
2. **Benchmark**: Compare throughput and latency of FP16 vs INT8 vs INT4
3. **Profile**: Use Nsight Systems to identify bottlenecks in your deployment

## Course Completion

Congratulations! You've completed the GPU Learning course. You now understand:

- GPU architecture and execution model
- Memory hierarchy and optimization
- Writing kernels with Triton
- Profiling and optimization techniques
- Attention mechanisms and FlashAttention
- Quantization formats and implementation
- Production deployment with vLLM and TensorRT-LLM

**Next steps**: Apply this knowledge to your own projects, contribute to open-source inference libraries, or dive deeper into specific areas like kernel development or model optimization.