# 🚀 vLLM vs TGI vs Ollama: Low-Latency Chat Benchmarking

[![OpenShift](https://img.shields.io/badge/Platform-OpenShift-red)](https://www.redhat.com/en/technologies/cloud-computing/openshift)
[![vLLM](https://img.shields.io/badge/Engine-vLLM-blue)](https://github.com/vllm-project/vllm)
[![TGI](https://img.shields.io/badge/Engine-TGI-orange)](https://github.com/huggingface/text-generation-inference)
[![Ollama](https://img.shields.io/badge/Engine-Ollama-green)](https://github.com/ollama/ollama)

---

## 🎯 Welcome to the Interactive Benchmarking Demo

This notebook demonstrates **vLLM's superior performance** for low-latency chat applications through a comprehensive three-way comparison with TGI and Ollama.

### What You'll Experience
- **⚡ Real-time performance metrics** - TTFT, ITL, E2E latency
- **📊 Interactive visualizations** - Live charts and comparisons
- **🔧 Configuration insights** - Optimization techniques and best practices
- **🎮 Hands-on benchmarking** - Run your own tests and see immediate results

### Target Metrics
| Metric | Target | vLLM Expected |
|--------|--------|---------------|
| **TTFT** | < 100ms | ✅ ~50-80ms |
| **P95 Latency** | < 1 second | ✅ ~300-600ms |
| **Throughput** | 50+ users | ✅ 100+ users |

---

**⏱️ Demo Duration:** ~15-20 minutes  
**🏗️ Infrastructure:** OpenShift with GPU nodes  
**🤖 Model:** Qwen/Qwen2.5-7B (standardized across all engines)


# 📋 Section 1: Introduction & Architecture

Understanding the benchmarking environment and what makes this comparison meaningful.


In [8]:
# Core imports and setup
import os
import sys
import time
import json
import asyncio
from datetime import datetime
from pathlib import Path

# Data processing
import pandas as pd
import numpy as np

# Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns

# HTTP clients
import requests
import httpx

# Kubernetes/OpenShift
try:
    from kubernetes import client, config
except ImportError:
    print("⚠️ Kubernetes client not available - using fallback mode")

# Rich console output
from rich.console import Console
from rich.table import Table
from rich.progress import Progress
from rich.panel import Panel

# Initialize console
console = Console()

# Project paths
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data"
RESULTS_DIR = PROJECT_ROOT / "results"

# Create results directory
RESULTS_DIR.mkdir(exist_ok=True)

console.print("[green]✅ Environment setup complete![/green]")
console.print(f"[blue]📁 Project root: {PROJECT_ROOT}[/blue]")
console.print(f"[blue]📊 Results will be saved to: {RESULTS_DIR}[/blue]")
