# vLLM Quickstart: High-Throughput Serving

This notebook demonstrates how to serve an SLM using vLLM, a high-throughput and memory-efficient inference engine. Guide: [vLLM Deployment](https://slmhub.gitbook.io/slmhub/docs/deploy/quickstarts/vllm).

## 1. Install vLLM
vLLM requires a GPU (T4 is supported).

In [None]:
!pip install vllm

## 2. Offline Inference
Load the model and generate text directly in Python.

In [None]:
from vllm import LLM, SamplingParams

# Initialize model (Phi-3-mini is small and fast)
llm = LLM(model="microsoft/Phi-3-mini-4k-instruct", trust_remote_code=True)

# Define prompts
prompts = [
    "Hello, my name is",
    "The future of AI is",
    "Write a short poem about coding."
]

# Sampling parameters
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=60)

# Generate
outputs = llm.generate(prompts, sampling_params)

# Print results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

## 3. Server Mode (OpenAI Compatible)
Run vLLM as an API server. (Note: In Colab, this blocks the cell. We run it in background).

In [None]:
# Start server in background
import subprocess
import time

command = [
    "python", "-m", "vllm.entrypoints.openai.api_server",
    "--model", "microsoft/Phi-3-mini-4k-instruct",
    "--trust-remote-code",
    "--dtype", "auto",
    "--port", "8000"
]

process = subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print("Starting vLLM server... (takes ~1-2 mins to load model)")
time.sleep(60) # Wait for model load

## 4. Query the API
Use standard OpenAI client to query the local vLLM server.

In [None]:
!pip install openai

In [None]:
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="vllm")

try:
    completion = client.chat.completions.create(
        model="microsoft/Phi-3-mini-4k-instruct",
        messages=[
            {"role": "user", "content": "Explain vLLM in one sentence."}
        ]
    )
    print("Response:", completion.choices[0].message.content)
except Exception as e:
    print("Server might still be loading or failed:", e)
    # Print logs if failed
    out, err = process.communicate(timeout=1)
    print(err.decode())