# vLLM Docker Demo

This notebook demonstrates how to use the vLLM server with the OpenAI client.

## Prerequisites

1. Start the vLLM Docker container:
   ```bash
   ./start_vllm_docker.sh
   ```


In [None]:
import logging
from openai import OpenAI
import json

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

## Initialize the OpenAI Client

The client will connect to the vLLM server running in Docker, which acts as drop-in replacement for OpenAI distant models callable via their API.

The difference is that the vLLM server is running on your AWS instance, you can run any open weights model and it will be much faster.


In [None]:
# Initialize the OpenAI client to connect to vLLM server
# This will connect to the Docker container running on localhost:8000
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require authentication
    base_url="http://localhost:8000/v1",
)

# Test the connection by listing available models
try:
    models = list(client.models.list())
    if models:
        print("vLLM server is up and running!")
        print(f"Available models: {[model.id for model in models]}")
    else:
        print("No models available")
except Exception as e:
    print(f"Error connecting to vLLM server: {e}")
    print("Make sure the server is running by executing `./start_vllm_docker.sh`")

model = models[0].id

## Single Completion Example

You can use the local model is if it were served by OpenAI.

In [None]:
# Generate a single completion using the Completions API
prompt = "What is the role of proteins in biological systems?"

print(f"Prompt: {prompt}")
print("\nGenerating completion...")

# Generate completion
response = client.completions.create(
    model=model,  
    prompt=prompt,
    max_tokens=256,
    temperature=0.7
)

# Extract the text from the response
completion_text = response.choices[0].text

print(f"\nCompletion: {completion_text}")
print(f"\nUsage: {response.usage}")

## Batch Processing Example

Whenever possible, you should try to run multiple prompts at once. This is more and more efficient as the batch size increases, but uses more VRAM. You should be able to run hundreds or thousands of prompts at once with your setup.

In [None]:
# Generate completions for multiple prompts
prompts = [
    "Explain the process of DNA replication.",
    "What are the main functions of mitochondria?",
    "How do enzymes work in biological reactions?"
]

print("Generating batch completions...")
print(f"Processing {len(prompts)} prompts\n")

# Generate batch completions using a loop
response = client.completions.create(
    model=model,
    prompt=prompts,
    max_tokens=200,
    temperature=0.8
)

# Display results
for i, (prompt, answer) in enumerate(zip(prompts, response.choices)):
    print(f"\n--- Question {i+1} ---")
    print(f"Prompt: {prompt}")
    print(f"Answer: {answer.text}")
print(f"Tokens used: {response.usage.total_tokens}")

## Prompting with system prompt and multi-turn conversations

You have the choice between using the OpenAI Chat Completions API or format the conversation using the Transformers AutoTokenizer.

Using the OpenAI Chat Completions API directly with a list of messages :


In [None]:
messages=[
        {"role": "system", "content": "You are a poetic biology tutor. Use analogies and paint a pretty picture."},
        {"role": "user", "content": "Explain photosynthesis in simple terms."},
        {"role": "assistant", "content": "Like a chef making a meal, a plant uses sunlight, water, and air to create food."},
        {"role": "user", "content": "OK, please be a little more specific. I want to learn the molecular science behind it !"},
]

# Chat completion example
chat_response = client.chat.completions.create(
    model=model,
    messages=messages,
    max_tokens=500,
    temperature=0.7
)

print("Chat Response:")
print(f"Assistant: {chat_response.choices[0].message.content}")
print(f"\nUsage: {chat_response.usage}")

If you want to send batches of such conversations (which you should), you can use the AutoTokenizer library to format the conversation.

It simply uses the correct special tokens for the model you are using and formats the conversation as one string.
See https://huggingface.co/docs/transformers/en/chat_templating

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model)

prompt = tokenizer.apply_chat_template(messages, tokenize=False)

response = client.completions.create(
    model=model,
    prompt=[prompt],
    max_tokens=200,
    temperature=0.8
)

print(f"Prompt: {prompt}")
print(f"Answer: {answer.text}")

## Chat Completions with Tool Calling

If the vLLM server is started with the `enable-auto-tool-choice` option, it can generate its own tool calls when it deems appropriate.

See https://docs.vllm.ai/en/stable/features/tool_calling.html

In [None]:
def get_current_temperature(location: str, unit: str = "celsius"):
    """Get current temperature at a location.

    Args:
        location: The location to get the temperature for, in the format "City, State, Country".
        unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])

    Returns:
        the temperature, the location, and the unit in a dict
    """
    return {
        "temperature": 26.1,
        "location": location,
        "unit": unit,
    }


tool_functions = {"get_current_temperature": get_current_temperature}

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_temperature",
        "description": "Get the current temperature in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City, State, Country, e.g., 'San Francisco, CA, USA'"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location", "unit"]
        }
    }
}]

messages = [
    {"role": "user", "content": "What's the weather like in San Francisco, CA, USA?"}
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

tool_call = response.choices[0].message.tool_calls[0].function
print(f"Function called: {tool_call.name}")
print(f"Arguments: {tool_call.arguments}")
print(f"Result: {tool_functions[tool_call.name](**json.loads(tool_call.arguments))}")
print(f"Assistant: {response.choices[0].message.content}")
print(f"\nUsage: {response.usage}")

In [None]:
messages.append(response.choices[0].message.model_dump())

def get_function_by_name(name):
    if name == "get_current_temperature":
        return get_current_temperature

if tool_calls := messages[-1].get("tool_calls", None):
    for tool_call in tool_calls:
        call_id: str = tool_call["id"]
        if fn_call := tool_call.get("function"):
            fn_name: str = fn_call["name"]
            fn_args: dict = json.loads(fn_call["arguments"])
        
            fn_res: str = json.dumps(get_function_by_name(fn_name)(**fn_args))

            messages.append({
                "role": "tool",
                "content": fn_res,
                "tool_call_id": call_id,
            })

print(messages)

In [None]:
response = client.chat.completions.create(
    model=model,
    messages=messages,
    tools=tools,
    temperature=0.7,
    top_p=0.8,
    max_tokens=512,
    extra_body={
        "repetition_penalty": 1.05,
    },
)

print(response.choices[0].message.content)

## Summary

This notebook demonstrates:

1. **OpenAI-Compatible Interface**: Using vLLM with the standard OpenAI Python client
2. **Client Initialization**: How to set up the OpenAI client to connect to a Docker-based vLLM server
3. **Single Completions**: Generating responses for individual prompts using the Completions API
4. **Batch Processing**: Efficiently processing multiple prompts at once
5. **Chat Completions**: Using the Chat Completions API for conversational interactions
6. **Tool Calling**: Using function calling capabilities with the Chat Completions API
7. **Biology Applications**: Using the model for biology-related questions
8. **Usage Statistics**: Monitoring token usage

The vLLM Docker setup with OpenAI-compatible interface provides a scalable and familiar way to serve large language models for biology research and education applications. The standard OpenAI client makes it easy to integrate with existing applications and frameworks.
