# 📓 The GenAI Revolution Cookbook

**Title:** How to Build a Local LLM API with Ollama and the OpenAI SDK

**Description:** Build a Local LLM API fast: keep OpenAI SDK code, change base_url and dummy key, stream with cancel, iterate offline.

---

*This jupyter notebook contains executable code examples. Run the cells below to try out the code yourself!*



## Why Use Ollama for This Problem

Ollama provides a local, OpenAI-compatible API server that lets you iterate on GenAI applications without cloud latency, API costs, or internet dependency. For builders prototyping chat completions, streaming responses, and benchmarking model performance, Ollama offers a drop-in replacement for the OpenAI SDK—just swap the `base_url` and use a placeholder API key. This tutorial focuses on connecting the OpenAI Python SDK to a local Ollama server, implementing streaming with early cancellation, and measuring basic latency metrics.

**Prerequisites:** Ollama installed and running locally (`ollama serve`), Python 3.8+, and the OpenAI SDK (`openai>=1.30`). This guide assumes macOS or Linux; Windows users should use WSL or Docker. You'll need at least 8GB RAM for 7B models.

**What you'll build:** A Python script that points the OpenAI SDK to Ollama, streams a chat completion with early stop, and measures time-to-first-token and throughput for local model evaluation.

## Core Concepts for This Use Case

**OpenAI-compatible endpoint:** Ollama exposes `/v1/chat/completions` and `/v1/models` endpoints that mirror OpenAI's API contract, allowing SDK reuse without code changes beyond configuration.

**Streaming and cancellation:** The SDK's `stream=True` returns an iterator; you can break early and call `.close()` to stop generation mid-response, saving compute and time during interactive dev loops.

**Local benchmarking:** Measuring latency (time-to-first-token, total duration) and throughput (tokens/sec) helps compare models and quantization levels on your hardware before committing to a production choice.

## Install Dependencies

Run this cell first to install the OpenAI SDK:

In [None]:
!pip install --upgrade openai

## Set Up the OpenAI Client for Ollama

Configure the SDK to point to your local Ollama server. Use environment variables for flexibility across environments:

In [None]:
import os
from openai import OpenAI

# Point to local Ollama server; override via environment if needed
BASE_URL = os.getenv("OPENAI_BASE_URL", "http://localhost:11434/v1")
API_KEY = os.getenv("OPENAI_API_KEY", "ollama")  # Placeholder for local use

client = OpenAI(base_url=BASE_URL, api_key=API_KEY)

**Why this works:** Ollama doesn't enforce authentication locally, so any non-empty string satisfies the SDK's required `api_key` parameter.

## Verify the Connection

Check that Ollama is running and the model is available:

In [None]:
import requests

# List available models via OpenAI-compatible endpoint
response = requests.get(f"{BASE_URL.replace('/v1', '')}/v1/models")
models = response.json()
print("Available models:", [m["id"] for m in models.get("data", [])])

If this fails, ensure `ollama serve` is running in another terminal and you've pulled a model (e.g., `ollama pull llama3.1:8b`).

## Stream a Completion with Early Cancellation

Stream tokens as they're generated and stop after the first complete sentence:

In [None]:
def stream_with_cancel(client, model="llama3.1:8b"):
    """
    Stream a chat completion and cancel after the first sentence.
    
    Args:
        client: OpenAI client instance
        model: Model identifier (default: llama3.1:8b)
    """
    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": "Explain gradient descent in two lines."}],
        stream=True,
        temperature=0.2,
        max_tokens=80,
    )
    
    sentence_endings = {".", "!", "?"}
    buffer = []
    
    for chunk in stream:
        token = chunk.choices[0].delta.content or ""
        print(token, end="", flush=True)
        buffer.append(token)
        
        # Stop after first sentence
        if any(token.endswith(end) for end in sentence_endings):
            break
    
    stream.close()  # Release server resources
    print("\n[Stream canceled after first sentence]")

stream_with_cancel(client)

**Why cancel early:** During development, you often need just enough output to verify behavior—stopping generation saves time and compute.

## Measure Latency and Throughput

Benchmark a model's performance on your hardware:

In [None]:
import time

def benchmark_completion(client, model, prompt):
    """
    Run a completion and measure duration and throughput.
    
    Args:
        client: OpenAI client instance
        model: Model identifier
        prompt: User prompt string
        
    Returns:
        Tuple of (duration_seconds, response_text)
    """
    messages = [
        {"role": "system", "content": "Be concise and concrete."},
        {"role": "user", "content": prompt}
    ]
    
    start = time.perf_counter()
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.3,
        max_tokens=256,
    )
    duration = time.perf_counter() - start
    
    text = response.choices[0].message.content or ""
    usage = getattr(response, "usage", None)
    
    if usage and usage.completion_tokens:
        tokens_per_sec = usage.completion_tokens / duration
        print(f"{model}: {duration:.2f}s, {usage.completion_tokens} tokens, {tokens_per_sec:.1f} tok/s")
    else:
        chars_per_sec = len(text) / duration
        print(f"{model}: {duration:.2f}s, ~{len(text)} chars, ~{chars_per_sec:.0f} char/s")
    
    return duration, text

# Example: benchmark a single model
benchmark_completion(
    client,
    "llama3.1:8b",
    "Give three bullet points on pros/cons of vector databases."
)

**Key metrics:** Time-to-first-token (TTFT) and tokens-per-second help you choose models and quantization levels that fit your latency budget.

## Measure Streaming Metrics

Track time-to-first-token separately from total duration:

In [None]:
def measure_streaming(client, model):
    """
    Measure TTFT and total streaming duration.
    
    Args:
        client: OpenAI client instance
        model: Model identifier
    """
    start = time.perf_counter()
    first_token_time = None
    total_chars = 0
    
    stream = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": "Summarize Raft consensus in 3 bullets."}],
        stream=True,
        max_tokens=180,
        temperature=0.2,
    )
    
    for chunk in stream:
        token = chunk.choices[0].delta.content or ""
        if token and first_token_time is None:
            first_token_time = time.perf_counter() - start
        total_chars += len(token)
    
    duration = time.perf_counter() - start
    stream.close()
    
    print(f"{model}: TTFT={first_token_time:.2f}s, total={duration:.2f}s, chars={total_chars}")

measure_streaming(client, "llama3.1:8b")

**Why TTFT matters:** Low TTFT improves perceived responsiveness in chat UIs; total duration reflects overall throughput.

## Compare Multiple Models

Pull additional models and benchmark them side-by-side:

In [None]:
# Run these in your terminal before executing the cell below
ollama pull mistral:7b-instruct
ollama pull qwen2.5:7b-instruct

In [None]:
models = ["llama3.1:8b", "mistral:7b-instruct", "qwen2.5:7b-instruct"]
prompt = "Give three bullet points on pros/cons of vector databases."

for model in models:
    print(f"\n--- {model} ---")
    benchmark_completion(client, model, prompt)
    measure_streaming(client, model)

**Comparison tips:** Run each model 3 times and average results; first runs may be slower due to model loading.

## Run and Evaluate End-to-End

Combine all steps into a single workflow:

In [None]:
def full_workflow(client, model="llama3.1:8b"):
    """
    Complete workflow: verify connection, stream with cancel, benchmark.
    """
    print("1. Verifying connection...")
    response = requests.get(f"{BASE_URL.replace('/v1', '')}/v1/models")
    print(f"   Models available: {len(response.json().get('data', []))}")
    
    print("\n2. Streaming with early cancel...")
    stream_with_cancel(client, model)
    
    print("\n3. Benchmarking completion...")
    benchmark_completion(client, model, "Explain CAP theorem in 2 sentences.")
    
    print("\n4. Measuring streaming metrics...")
    measure_streaming(client, model)
    
    print("\n✓ Workflow complete")

full_workflow(client)

## Limitations and Considerations

**Unsupported OpenAI features:** Ollama does not support the Assistants API, vision models, DALL·E, or advanced function-calling. JSON mode and tool calls have limited support—test thoroughly.

**Error handling:** Add timeouts and retries for production use:

In [None]:
from openai import OpenAI
import time

client = OpenAI(base_url=BASE_URL, api_key=API_KEY, timeout=30.0)

def safe_completion(client, model, messages, retries=3):
    for attempt in range(retries):
        try:
            return client.chat.completions.create(model=model, messages=messages)
        except Exception as e:
            if attempt == retries - 1:
                raise
            time.sleep(2 ** attempt)  # Exponential backoff

**Model tags:** Use instruct-tuned variants (e.g., `mistral:7b-instruct`) for chat; base models may produce poor conversational output.

**Hardware impact:** Quantization (e.g., Q4 vs Q8) trades accuracy for speed; benchmark on your target hardware to find the right balance.

## Conclusion

You've configured the OpenAI SDK to use Ollama as a local inference server, implemented streaming with early cancellation, and built a basic benchmarking harness to measure latency and throughput. This setup accelerates your dev loop by eliminating cloud dependencies and API costs while maintaining SDK compatibility.

**Next steps:**
- Explore [Creating Custom Modelfiles in Ollama](/article/creating-custom-modelfiles-ollama) to standardize system prompts and parameters across your team
- Learn [Deploying Ollama for Team Access](/article/deploying-ollama-team-access) to share models over your local network with authentication