-
Notifications
You must be signed in to change notification settings - Fork 1
Llm
A ruby-openai-shaped chat-completions client. Speaks the
/v1/chat/completions wire shape — same one Ollama, OpenAI proper,
vLLM, and the sibling toy
project's tep_demo/openai_api.rb all serve — so backends are
configuration, not code.
Sync and streaming variants both ship: chat(messages) for one-shot
replies, chat_stream(messages, out) for an SSE-style live token
feed into a Tep::Stream.
| Feature | v1 |
|---|---|
Synchronous chat(messages)
|
yes |
Streaming chat_stream(messages, out)
|
yes (SSE) |
| OpenAI wire protocol over HTTP/1.0 | yes |
| Bearer-token auth (OpenAI / OpenRouter) | yes |
Configurable base_url + model |
yes |
| Single system prompt | yes |
| Multi-turn history (caller-managed) | yes |
| Token usage stats | no (advisory only) |
| Tool / function calling | no |
| Vision / multimodal | no |
| HTTPS / TLS | no (front a proxy) |
For TLS-terminating proxies (OpenAI direct, OpenRouter), front Tep with nginx/caddy that handles the outbound HTTPS. Plain HTTP to a local backend (Ollama, vLLM, toy) is the design point.
client = Tep::Llm.new("http://localhost:11434") # Ollama default
client.set_model("llama3")
client.set_api_key("") # empty = unset
client.set_system_prompt("You are helpful.") # optional
msgs = [
Tep::Llm::Message.new("user", "What is 2+2?"),
]
reply = client.chat(msgs)
puts reply.content # "4"
puts reply.stop_reason # "stop" | "length" | "error"Tep::Llm::Message.new(role, content) — role is "system",
"user", or "assistant". The optional system prompt set via
set_system_prompt is prepended automatically; do not also push
a "system" Message yourself.
Tep::Llm::Response:
| Field | Type | Notes |
|---|---|---|
content |
String | The assistant reply text. "" on transport/parse failure. |
role |
String | Echoes the assistant role from the response. |
stop_reason |
String |
"stop", "length", "error". Advisory; not load-bearing. |
Tep::Llm.new("http://localhost:11434") # Ollama
Tep::Llm.new("http://localhost:8080") # toy/tep_demo/openai_api
Tep::Llm.new("https://api.openai.com") # OpenAI proper (proxy via http://)OpenAI direct requires an API key. Practically, run a TLS terminator
(nginx, caddy) listening on plain HTTP locally and forwarding HTTPS
to api.openai.com; point Tep::Llm at the local plaintext
endpoint.
set :scheduler, :scheduled # streaming wants the scheduled server
class ChatStreamer < Tep::Streamer
attr_accessor :messages
def pump(out)
client = Tep::Llm.new(ENV.fetch("CHAT_BACKEND"))
client.set_model(ENV.fetch("CHAT_MODEL"))
client.chat_stream(@messages, out)
0
end
end
post '/api/send' do
history = load_history(req)
s = ChatStreamer.new
s.messages = history
res.headers["Content-Type"] = "text/event-stream"
stream s
endThe out_stream parameter is anything with write(String) -> Integer; the framework's Tep::Stream (from Tep::Streamer#pump)
is the canonical caller. Each SSE delta lands as one
data: {"content":"<delta>"}\n\n chunk. A final data: [DONE]\n\n
marks end-of-stream after a stop / disconnect.
chat_stream returns the accumulated assistant content as a String
so the caller can persist the full reply once the stream completes.
History is the caller's job. Persist to Tep::SQLite, hold in
session, do whatever fits the app:
def history_for(conversation_id)
db = open_db
db.prepare("SELECT role, content FROM messages WHERE conv_id = ? ORDER BY id")
db.bind_int(1, conversation_id)
out = []
while db.step == "row"
out.push(Tep::Llm::Message.new(db.col_str(0), db.col_str(1)))
end
db.finalize
out
end
# Append the user turn, run the model, persist the reply.
history = history_for(conv_id)
history.push(Tep::Llm::Message.new("user", req.params["message"]))
reply = client.chat(history)
persist_message(conv_id, "user", req.params["message"])
persist_message(conv_id, "assistant", reply.content)Like Tep::Http, there are no exceptions. The Response always comes
back. On transport failure (connect refused, timeout, malformed
upstream response) or parse failure (no choices[0].message.content
in the body):
reply = client.chat(msgs)
if reply.stop_reason == "error"
res.set_status(503)
return "upstream LLM unreachable"
endInspect reply.stop_reason == "error" for the catch-all. The framework
will not retry or fall back to another model on its own — that's
application policy. See examples/chatbot/ for a multi-backend
compare pattern.
client = Tep::Llm.new("http://localhost:11434")
client.set_model("llama3")
reply = client.chat([Tep::Llm::Message.new("user", "summarise: " + body)])
res.headers["Content-Type"] = "text/plain"
reply.contentclient.set_system_prompt(
'You are a translator. Reply with a JSON object: ' \
'{"translation": "<text>", "lang": "<iso>"}'
)
reply = client.chat([Tep::Llm::Message.new("user", "Hola")])
# `Tep::Json.get_str(reply.content, "translation")` to read it out.See examples/chatbot/app.rb for the full pattern: run the same
prompt against N backends, return them side-by-side. Today the
dispatch is sequential (the parallel fork fan-out via
Tep::Parallel is blocked on
matz/spinel#575).
-
set_api_key("")is a no-op deletion. An empty key string leaves any priorAuthorizationheader in place. To rotate, callset_api_keywith the new value, or stand up a fresh client. -
Content-Length on POST. The underlying
Tep::Httpsets it frombody.length; the OpenAI JSON is built internally so this is handled. Don't addContent-Lengthyourself. -
System prompt + an explicit
systemmessage in the array. The system prompt is prepended automatically. If you also push aTep::Llm::Message.new("system", ...)intomessages, the model sees two system turns. Pick one. -
Streaming under the blocking server.
chat_streamwrites chunks as they arrive; under prefork-blocking, one slow stream holds the accepting worker. Useset :scheduler, :scheduledfor any non-trivial streaming workload.
- API definition:
lib/tep/llm.rb. - Test coverage:
test/test_llm.rb. - Full example:
examples/chatbot/app.rb— exercises syncchat, streamingchat_stream, multi-backend compare, session-backed multi-turn, all in one app. - toy/tep_demo: https://github.com/OriPekelman/toy/tree/main/tep_demo — the GPT-2 / DistilGPT2 backend used in dev to avoid hitting OpenAI for every test loop.