A Retrieval-Augmented Classification system implemented in Go using:
- YZMA (hybridgroup/yzma) for local embedding generation via llama.cpp
- PostgreSQL + pgvector as the vector database for similarity search
- MCP (Model Context Protocol) for AI assistant integration
- OpenAI-Compatible API for seamless integration with LLM frameworks
Instead of fine-tuning a model for classification, this project uses a RAG approach to classify user queries into topic names. This allows for dynamic addition or modification of query-to-topic mappings without retraining.
For a given user query, the system creates an embedding, searches for the top match in the classifier table using cosine similarity, and returns the associated topic. If no match meets the threshold or the database is empty, it returns a configurable default_topic (default: "none").
- Local embedding generation using any GGUF embedding model
- Vector similarity search using PostgreSQL's
pgvectorand HNSW indexing - OpenAI-compatible
/v1/chat/completionsendpoint for classification - MCP server with stdio transport for integration with AI assistants (Claude, Amp, etc.)
- JSONL-based export and import for easy data portability
- Handcrafted CLI for efficient management of classification entries
- Go 1.25+
- PostgreSQL with the pgvector extension installed
- llama.cpp library — download or build, then set
YZMA_LIB - GGUF embedding model — e.g., T5Gemma 2-270M (Recommended for 640-dim vectors)
# Using yzma CLI to download
go install github.com/hybridgroup/yzma/cmd/yzma@latest
yzma lib install
# Set the library path
export YZMA_LIB=/path/to/libllama.so # Linux
export YZMA_LIB=/path/to/libllama.dylib # macOS# Build the binary (no CGo required)
go build -o classifier .
# Run the classifier, both OpenAI API and MCP Server (stdio)
./classifier serve./classifier add billing "How can I pay my bill?"
./classifier add billing "I have a question about my invoice"
./classifier add technical "My internet is not working"
./classifier add technical "I am experiencing slow speeds"./classifier query "I want to pay my monthly subscription"Output:
Classified Topic: billing (Score: 0.9123)
# Export all entries to a directory (grouped by topic)
./classifier export ./backup
# Import entries from a directory
./classifier import ./dataThe classifier provides an endpoint at http://localhost:8080/v1/chat/completions. It takes the last message's content as the query and returns the classified topic as the assistant's message.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "classifier",
"messages": [{"role": "user", "content": "My router is blinking red"}]
}'classifier supports configuration via config.yaml, environment variables, and command-line flags.
model: "./models/nomic-embed-text-v1.5.Q8_0.gguf"
lib_path: "/usr/local/lib/libllama.so"
database_url: "postgres://postgres@localhost:5432/postgres?sslmode=disable"
default_topic: "none"
context_size: 512
batch_size: 512
verbose: false
server:
port: "8080"
transport: "stdio"| Variable | Description | Default |
|---|---|---|
CLASSIFIER_MODEL |
Path to GGUF embedding model | — |
YZMA_LIB |
Path to llama.cpp library | — |
CLASSIFIER_DATABASE_URL |
PostgreSQL connection string | postgres://... |
CLASSIFIER_DEFAULT_TOPIC |
Topic returned when no match found | none |
CLASSIFIER_CONTEXT_SIZE |
Context size for embeddings | 512 |
CLASSIFIER_BATCH_SIZE |
Batch size for processing | 512 |
CLASSIFIER_VERBOSE |
Enable verbose logging | false |
.
├── api.go # OpenAI-compatible HTTP API
├── mcp_server.go # MCP server implementation
├── rag.go # RAG core (PostgreSQL + pgvector + YZMA)
├── main.go # CLI entry point and flag parsing
├── config.go # Configuration management
├── command.go # Handcrafted command registry
├── cmd_add.go # Add entry command
├── cmd_query.go # Query/Classify command
├── cmd_export.go # JSONL Export command
├── cmd_import.go # JSONL Import command
└── cmd_serve.go # Start API and MCP servers
- Automatic Prefixing: To optimize performance for instruction-tuned models like T5Gemma, the system automatically prepends the
classification:prefix to any text before generating its embedding (if not already present). This ensures the model weights semantic features correctly for classification tasks while keeping your stored content clean. - Embedding Generation: The text (with prefix) is tokenized and passed through the local GGUF model via YZMA/llama.cpp to produce a high-dimensional vector.
- Vector Storage: The original content, topic, and normalized embedding are stored in PostgreSQL.
- Similarity Search: For classification queries, the query text is embedded with the same prefix and compared against stored vectors using cosine similarity (
1 - (embedding <=> $1)). - Classification: The system returns the topic with the highest similarity score, or the
default_topicif no matches are found.
MIT