A minimal, dependency‑free C extension for Ruby that loads GGUF embedding models and computes text embeddings locally.
- Zero external dependencies – no TensorFlow, PyTorch, or ONNX runtime.
- Single‑file C extension – fast loading and local sentence embeddings.
- Runs BERT/GTE embedding models end to end – token embeddings, transformer layers, mean pooling, and L2 normalization.
- Supports the GGML tensor types declared by this extension –
F32,F16,Q4_0,Q4_1,Q5_0,Q5_1,Q8_0,Q8_1, and K-quants throughQ8_K. - Works entirely offline – your data never leaves your machine.
- Perfect for weekend projects, proof‑of‑concepts, or learning about embeddings.
Add this line to your application's Gemfile:
gem 'mini_embed'Then execute:
bundle installOr install it globally:
gem install mini_embedA POSIX system (Linux, macOS, BSD) – Windows via WSL2 works.
A C compiler and make (for compiling the native extension).
A GGUF embedding model file (see Where to get models).
require 'mini_embed'
# Load a GGUF embedding model
model = MiniEmbed.new(model: '/path/to/gte-small.Q4_0.gguf')
# Get embedding as an array of floats (default)
embedding = model.embeddings(text: 'hello world')
puts embedding.size # e.g. 384
puts embedding[0..4] # e.g. [0.0123, -0.0456, ...]
# Or get the raw binary string (little‑endian 32‑bit floats)
binary = model.embeddings(text: 'hello world', type: :binary)
embedding_from_binary = binary.unpack('e*')Note: The type parameter is optional – it defaults to :vector which returns a Ruby Array<Float>. Use type: :binary to get the raw binary string (compatible with the original C extension).
You can also request L2 normalization for the fallback token-averaging path:
model = MiniEmbed.new(model: '/path/to/model.gguf', normalize: :l2)For supported BERT/GTE GGUF models, MiniEmbed already returns L2-normalized sentence embeddings to match llama.cpp embedding output.
For BERT/GTE-style GGUF embedding models, MiniEmbed uses the model's WordPiece vocabulary, adds CLS/SEP tokens, runs the transformer stack, mean-pools the sequence output, and L2-normalizes the result.
For non-BERT GGUF models, MiniEmbed falls back to pre-tokenization plus vocabulary/BPE lookup and averages token embedding rows. That fallback is useful for simple experiments, but it is not equivalent to running a full transformer model.
If you need a model/tokenizer family that is not covered by the current C path, you can:
- Pre‑tokenize in Ruby using the tokenizers gem and pass token IDs (not yet exposed in the C API, but easy to add).
- Run llama.cpp as a server and call its embeddings endpoint.
| Type | Description |
|---|---|
| 0 | F32 (float32) |
| 1 | F16 (float16) |
| 2 | Q4_0 |
| 3 | Q4_1 |
| 6 | Q5_0 |
| 7 | Q5_1 |
| 8 | Q8_0 |
| 9 | Q8_1 |
| 10 | Q2_K |
| 11 | Q3_K |
| 12 | Q4_K |
| 13 | Q5_K |
| 14 | Q6_K |
| 15 | Q8_K |
The extension validates tensor row alignment while loading the GGUF file and dequantizes rows as they are used. Q4_0 linear layers have a ggml-like optimized dot path; other quantized linear layers use correctly dequantized float rows.
MiniEmbed supports the tensor types listed above. It does not currently implement newer llama.cpp formats that are not declared in this extension, such as IQ, MXFP, or NVFP variants.
Hugging Face offers many GGUF models, e.g.:
gte-smallall‑MiniLM‑L6‑v2
You can convert any safetensors or PyTorch model using the convert‑hf‑to‑gguf.py script from llama.cpp.
For testing, we recommend the gte-small model (384 dimensions, ~30k vocabulary).
- Single‑threaded, blocking C code – embedding computation runs on the Ruby thread, freezing the interpreter.
- No batching – only one text at a time.
- BERT/GTE support is intentionally narrow and only covers the tensor/tokenizer shapes implemented in the C extension.
- Model files are memory-mapped and tensor rows are dequantized on demand, but large GGUF files still consume address space and memory bandwidth.
- No GPU support – CPU only.
- Error handling is minimal – invalid models may crash the Ruby process.
If you need a robust, scalable solution, consider:
- Running llama.cpp as a server (./server -m model.gguf --embeddings) and calling its HTTP endpoint.
- Using a cloud embeddings API (OpenAI, Cohere, VoyageAI, etc.).
- Deploying a dedicated inference service with BentoML or Ray Serve.
Bug reports and pull requests are welcome on GitHub. To run the tests:
bundle exec rspecThe gem uses rake-compiler to build the extension. After making changes to the C source, run:
bundle exec rake compileMIT License. See LICENSE.