# vLLM Direct Inference Demo

This notebook demonstrates how to use vLLM for direct offline inference without Docker.

## ⚠️ Important Limitations

**This notebook is for individual experimentation only:**
- **Single User**: The model is tied to your Python kernel
- **Not Shared**: Other team members cannot access it
- **Resource Intensive**: Requires dedicated GPU memory per user
- **Not Recommended for Teams**: Use `vllm_demo.ipynb` with Docker server for team collaboration

## Prerequisites

1. Install vLLM and dependencies:
   ```bash
   ./install.sh
   ```


In [None]:
import logging
from vllm import LLM, SamplingParams
import json

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

## Initialize vLLM for Direct Inference

We'll use vLLM's direct inference capabilities for offline batch processing, which is more efficient than running a separate server.


In [None]:
# Initialize vLLM for direct inference
print("Initializing vLLM model...")
llm = LLM(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.8
)

# Set up sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024
)

print("vLLM model loaded successfully!")
print(f"Model: Qwen/Qwen2.5-1.5B-Instruct")
print(f"Max model length: 4096 tokens")
print(f"GPU memory utilization: 80%")

## Single Completion Example

Generate a single completion using vLLM's direct inference.

In [None]:
# Generate a single completion using vLLM
prompt = "What is the role of proteins in biological systems?"

print(f"Prompt: {prompt}")
print("\nGenerating completion...")

# Generate completion using vLLM
outputs = llm.generate([prompt], sampling_params)
output = outputs[0]

# Extract the text from the response
completion_text = output.outputs[0].text
tokens_used = len(output.outputs[0].token_ids)

print(f"\nCompletion: {completion_text}")
print(f"\nTokens used: {tokens_used}")
print(f"Finish reason: {output.outputs[0].finish_reason}")

## Batch Processing Example

vLLM excels at batch processing - running multiple prompts at once is much more efficient and faster.

In [None]:
# Generate completions for multiple prompts
prompts = [
    "Explain the process of DNA replication.",
    "What are the main functions of mitochondria?",
    "How do enzymes work in biological reactions?"
]

print("Generating batch completions...")
print(f"Processing {len(prompts)} prompts\n")

# Generate batch completions using vLLM
outputs = llm.generate(prompts, sampling_params)

# Display results
total_tokens = 0
for i, (prompt, output) in enumerate(zip(prompts, outputs)):
    print(f"\n--- Question {i+1} ---")
    print(f"Prompt: {prompt}")
    print(f"Answer: {output.outputs[0].text}")
    tokens_used = len(output.outputs[0].token_ids)
    total_tokens += tokens_used
    print(f"Tokens used: {tokens_used}")

print(f"\nTotal tokens used: {total_tokens}")

## Chat Template Example

vLLM supports chat templates for conversational interactions. We can use the model's built-in chat template.


In [None]:
# Chat template example using vLLM's chat interface
from transformers import AutoTokenizer

# Load the tokenizer to apply chat template
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

# Prepare messages for chat
messages = [
    {"role": "system", "content": "You are a poetic biology tutor. Use analogies and paint a pretty picture."},
    {"role": "user", "content": "Explain photosynthesis in simple terms."},
]

# Apply chat template
chat_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

print("Chat Prompt:")
print(chat_prompt)
print("\nGenerating response...")

# Generate using vLLM
outputs = llm.generate([chat_prompt], sampling_params)
response = outputs[0].outputs[0].text

print("\nChat Response:")
print(f"Assistant: {response}")
print(f"Tokens used: {len(outputs[0].outputs[0].token_ids)}")