Llm

`Tep::Llm`

A ruby-openai-shaped chat-completions client. Speaks the /v1/chat/completions wire shape — same one Ollama, OpenAI proper, vLLM, and the sibling toy project's tep_demo/openai_api.rb all serve — so backends are configuration, not code.

Sync and streaming variants both ship: chat(messages) for one-shot replies, chat_stream(messages, out) for an SSE-style live token feed into a Tep::Stream.

Scope

Feature	v1
Synchronous `chat(messages)`	yes
Streaming `chat_stream(messages, out)`	yes (SSE)
OpenAI wire protocol over HTTP/1.0	yes
Bearer-token auth (OpenAI / OpenRouter)	yes
Configurable `base_url` + model	yes
Single system prompt	yes
Multi-turn history (caller-managed)	yes
Token usage stats	no (advisory only)
Tool / function calling	no
Vision / multimodal	no
HTTPS / TLS	no (front a proxy)

For TLS-terminating proxies (OpenAI direct, OpenRouter), front Tep with nginx/caddy that handles the outbound HTTPS. Plain HTTP to a local backend (Ollama, vLLM, toy) is the design point.

API

client = Tep::Llm.new("http://localhost:11434")   # Ollama default
client.set_model("llama3")
client.set_api_key("")                             # empty = unset
client.set_system_prompt("You are helpful.")      # optional

msgs = [
  Tep::Llm::Message.new("user", "What is 2+2?"),
]

reply = client.chat(msgs)
puts reply.content       # "4"
puts reply.stop_reason   # "stop" | "length" | "error"

Tep::Llm::Message.new(role, content) — role is "system", "user", or "assistant". The optional system prompt set via set_system_prompt is prepended automatically; do not also push a "system" Message yourself.

Tep::Llm::Response:

Field	Type	Notes
`content`	String	The assistant reply text. `""` on transport/parse failure.
`role`	String	Echoes the `assistant` role from the response.
`stop_reason`	String	`"stop"`, `"length"`, `"error"`. Advisory; not load-bearing.

Backends

Tep::Llm.new("http://localhost:11434")    # Ollama
Tep::Llm.new("http://localhost:8080")     # toy/tep_demo/openai_api
Tep::Llm.new("https://api.openai.com")    # OpenAI proper (proxy via http://)

OpenAI direct requires an API key. Practically, run a TLS terminator (nginx, caddy) listening on plain HTTP locally and forwarding HTTPS to api.openai.com; point Tep::Llm at the local plaintext endpoint.

Streaming

set :scheduler, :scheduled   # streaming wants the scheduled server

class ChatStreamer < Tep::Streamer
  attr_accessor :messages
  def pump(out)
    client = Tep::Llm.new(ENV.fetch("CHAT_BACKEND"))
    client.set_model(ENV.fetch("CHAT_MODEL"))
    client.chat_stream(@messages, out)
    0
  end
end

post '/api/send' do
  history = load_history(req)
  s = ChatStreamer.new
  s.messages = history
  res.headers["Content-Type"] = "text/event-stream"
  stream s
end

The out_stream parameter is anything with write(String) -> Integer; the framework's Tep::Stream (from Tep::Streamer#pump) is the canonical caller. Each SSE delta lands as one data: {"content":"<delta>"}\n\n chunk. A final data: [DONE]\n\n marks end-of-stream after a stop / disconnect.

chat_stream returns the accumulated assistant content as a String so the caller can persist the full reply once the stream completes.

Multi-turn

History is the caller's job. Persist to Tep::SQLite, hold in session, do whatever fits the app:

def history_for(conversation_id)
  db = open_db
  db.prepare("SELECT role, content FROM messages WHERE conv_id = ? ORDER BY id")
  db.bind_int(1, conversation_id)
  out = []
  while db.step == "row"
    out.push(Tep::Llm::Message.new(db.col_str(0), db.col_str(1)))
  end
  db.finalize
  out
end

# Append the user turn, run the model, persist the reply.
history = history_for(conv_id)
history.push(Tep::Llm::Message.new("user", req.params["message"]))
reply = client.chat(history)
persist_message(conv_id, "user", req.params["message"])
persist_message(conv_id, "assistant", reply.content)

Error handling

Like Tep::Http, there are no exceptions. The Response always comes back. On transport failure (connect refused, timeout, malformed upstream response) or parse failure (no choices[0].message.content in the body):

reply = client.chat(msgs)
if reply.stop_reason == "error"
  res.set_status(503)
  return "upstream LLM unreachable"
end

Inspect reply.stop_reason == "error" for the catch-all. The framework will not retry or fall back to another model on its own — that's application policy. See examples/chatbot/ for a multi-backend compare pattern.

Cookbook

Bare one-shot

client = Tep::Llm.new("http://localhost:11434")
client.set_model("llama3")
reply = client.chat([Tep::Llm::Message.new("user", "summarise: " + body)])
res.headers["Content-Type"] = "text/plain"
reply.content

System prompt + JSON output coaxing

client.set_system_prompt(
  'You are a translator. Reply with a JSON object: ' \
  '{"translation": "<text>", "lang": "<iso>"}'
)
reply = client.chat([Tep::Llm::Message.new("user", "Hola")])
# `Tep::Json.get_str(reply.content, "translation")` to read it out.

Multi-backend fan-out (compare)

See examples/chatbot/app.rb for the full pattern: run the same prompt against N backends, return them side-by-side. Today the dispatch is sequential (the parallel fork fan-out via Tep::Parallel is blocked on matz/spinel#575).

Pitfalls

set_api_key("") is a no-op deletion. An empty key string leaves any prior Authorization header in place. To rotate, call set_api_key with the new value, or stand up a fresh client.
Content-Length on POST. The underlying Tep::Http sets it from body.length; the OpenAI JSON is built internally so this is handled. Don't add Content-Length yourself.
System prompt + an explicit system message in the array. The system prompt is prepended automatically. If you also push a Tep::Llm::Message.new("system", ...) into messages, the model sees two system turns. Pick one.
Streaming under the blocking server. chat_stream writes chunks as they arrive; under prefork-blocking, one slow stream holds the accepting worker. Use set :scheduler, :scheduled for any non-trivial streaming workload.

Reference

API definition: lib/tep/llm.rb.
Test coverage: test/test_llm.rb.
Full example: examples/chatbot/app.rb — exercises sync chat, streaming chat_stream, multi-backend compare, session-backed multi-turn, all in one app.
toy/tep_demo: https://github.com/OriPekelman/toy/tree/main/tep_demo — the GPT-2 / DistilGPT2 backend used in dev to avoid hitting OpenAI for every test loop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llm

`Tep::Llm`

Scope

API

Backends

Streaming

Multi-turn

Error handling

Cookbook

Bare one-shot

System prompt + JSON output coaxing

Multi-backend fan-out (compare)

Pitfalls

Reference

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally