Skip to content
Ori Pekelman edited this page May 18, 2026 · 1 revision

Tep::Llm

A ruby-openai-shaped chat-completions client. Speaks the /v1/chat/completions wire shape — same one Ollama, OpenAI proper, vLLM, and the sibling toy project's tep_demo/openai_api.rb all serve — so backends are configuration, not code.

Sync and streaming variants both ship: chat(messages) for one-shot replies, chat_stream(messages, out) for an SSE-style live token feed into a Tep::Stream.

Scope

Feature v1
Synchronous chat(messages) yes
Streaming chat_stream(messages, out) yes (SSE)
OpenAI wire protocol over HTTP/1.0 yes
Bearer-token auth (OpenAI / OpenRouter) yes
Configurable base_url + model yes
Single system prompt yes
Multi-turn history (caller-managed) yes
Token usage stats no (advisory only)
Tool / function calling no
Vision / multimodal no
HTTPS / TLS no (front a proxy)

For TLS-terminating proxies (OpenAI direct, OpenRouter), front Tep with nginx/caddy that handles the outbound HTTPS. Plain HTTP to a local backend (Ollama, vLLM, toy) is the design point.

API

client = Tep::Llm.new("http://localhost:11434")   # Ollama default
client.set_model("llama3")
client.set_api_key("")                             # empty = unset
client.set_system_prompt("You are helpful.")      # optional

msgs = [
  Tep::Llm::Message.new("user", "What is 2+2?"),
]

reply = client.chat(msgs)
puts reply.content       # "4"
puts reply.stop_reason   # "stop" | "length" | "error"

Tep::Llm::Message.new(role, content)role is "system", "user", or "assistant". The optional system prompt set via set_system_prompt is prepended automatically; do not also push a "system" Message yourself.

Tep::Llm::Response:

Field Type Notes
content String The assistant reply text. "" on transport/parse failure.
role String Echoes the assistant role from the response.
stop_reason String "stop", "length", "error". Advisory; not load-bearing.

Backends

Tep::Llm.new("http://localhost:11434")    # Ollama
Tep::Llm.new("http://localhost:8080")     # toy/tep_demo/openai_api
Tep::Llm.new("https://api.openai.com")    # OpenAI proper (proxy via http://)

OpenAI direct requires an API key. Practically, run a TLS terminator (nginx, caddy) listening on plain HTTP locally and forwarding HTTPS to api.openai.com; point Tep::Llm at the local plaintext endpoint.

Streaming

set :scheduler, :scheduled   # streaming wants the scheduled server

class ChatStreamer < Tep::Streamer
  attr_accessor :messages
  def pump(out)
    client = Tep::Llm.new(ENV.fetch("CHAT_BACKEND"))
    client.set_model(ENV.fetch("CHAT_MODEL"))
    client.chat_stream(@messages, out)
    0
  end
end

post '/api/send' do
  history = load_history(req)
  s = ChatStreamer.new
  s.messages = history
  res.headers["Content-Type"] = "text/event-stream"
  stream s
end

The out_stream parameter is anything with write(String) -> Integer; the framework's Tep::Stream (from Tep::Streamer#pump) is the canonical caller. Each SSE delta lands as one data: {"content":"<delta>"}\n\n chunk. A final data: [DONE]\n\n marks end-of-stream after a stop / disconnect.

chat_stream returns the accumulated assistant content as a String so the caller can persist the full reply once the stream completes.

Multi-turn

History is the caller's job. Persist to Tep::SQLite, hold in session, do whatever fits the app:

def history_for(conversation_id)
  db = open_db
  db.prepare("SELECT role, content FROM messages WHERE conv_id = ? ORDER BY id")
  db.bind_int(1, conversation_id)
  out = []
  while db.step == "row"
    out.push(Tep::Llm::Message.new(db.col_str(0), db.col_str(1)))
  end
  db.finalize
  out
end

# Append the user turn, run the model, persist the reply.
history = history_for(conv_id)
history.push(Tep::Llm::Message.new("user", req.params["message"]))
reply = client.chat(history)
persist_message(conv_id, "user", req.params["message"])
persist_message(conv_id, "assistant", reply.content)

Error handling

Like Tep::Http, there are no exceptions. The Response always comes back. On transport failure (connect refused, timeout, malformed upstream response) or parse failure (no choices[0].message.content in the body):

reply = client.chat(msgs)
if reply.stop_reason == "error"
  res.set_status(503)
  return "upstream LLM unreachable"
end

Inspect reply.stop_reason == "error" for the catch-all. The framework will not retry or fall back to another model on its own — that's application policy. See examples/chatbot/ for a multi-backend compare pattern.

Cookbook

Bare one-shot

client = Tep::Llm.new("http://localhost:11434")
client.set_model("llama3")
reply = client.chat([Tep::Llm::Message.new("user", "summarise: " + body)])
res.headers["Content-Type"] = "text/plain"
reply.content

System prompt + JSON output coaxing

client.set_system_prompt(
  'You are a translator. Reply with a JSON object: ' \
  '{"translation": "<text>", "lang": "<iso>"}'
)
reply = client.chat([Tep::Llm::Message.new("user", "Hola")])
# `Tep::Json.get_str(reply.content, "translation")` to read it out.

Multi-backend fan-out (compare)

See examples/chatbot/app.rb for the full pattern: run the same prompt against N backends, return them side-by-side. Today the dispatch is sequential (the parallel fork fan-out via Tep::Parallel is blocked on matz/spinel#575).

Pitfalls

  • set_api_key("") is a no-op deletion. An empty key string leaves any prior Authorization header in place. To rotate, call set_api_key with the new value, or stand up a fresh client.
  • Content-Length on POST. The underlying Tep::Http sets it from body.length; the OpenAI JSON is built internally so this is handled. Don't add Content-Length yourself.
  • System prompt + an explicit system message in the array. The system prompt is prepended automatically. If you also push a Tep::Llm::Message.new("system", ...) into messages, the model sees two system turns. Pick one.
  • Streaming under the blocking server. chat_stream writes chunks as they arrive; under prefork-blocking, one slow stream holds the accepting worker. Use set :scheduler, :scheduled for any non-trivial streaming workload.

Reference

Clone this wiki locally