Rust bindings for llama.cpp
The crate llama-cpp
contains idiomatic Rust bindings for llama.cpp.
It offers a low-level synchronous API and a high-level asynchronous API.
A simple command line interface that also serves as an example is included in llama-cpp-cli
. Try running:
cargo run -- chat -m path/to/model.gguf
llama-cpp-sys
contains the low-level FFI bindings to llama.cpp. It has the llama.cpp source code in a git submodule.
The build script takes care of building, and linking to llama.cpp. It links statically to it and generates bindings using bindgen.
Make sure to pull the submodule:
git submodule update --init -- llama-cpp-sys/llama.cpp/
Both Tokio and async-std are supported. You choose which one is used by enabling one of the following features:
runtime-async-std
runtime-tokio
- Text Generation
- Embedding
- GPU support
- LORA
- grammar sampling
- repetition penalties
- beam search
- classifier free guidance
- logit bias
- tokio runtime
- async-std runtime
- API server
- Sliding context
- Prompt templates
Examples are located in examples/
. They are standalone crates.
// load model asynchronously
let model = ModelLoader::load(model_path, Default::default())
.wait_for_model()
.await?;
// prompt
let prompt = "The capital of France is";
print!("{}", prompt);
stdout().flush()?;
// create an inference session.
let session = Session::from_model(model, Default::default());
// create a sequence and feed prompt to it.
let mut sequence = session.sequence();
sequence
.push(Tokenize {
text: prompt,
add_bos: true,
allow_special: false,
})
.await?;
// create a response stream from it
let stream = inference.stream::<String>(Default::default());
pin_mut!(stream);
// stream LLM output piece by piece
while let Some(piece) = stream.try_next().await? {
print!("{piece}");
stdout().flush()?;
}