A Dart/Flutter plugin for llama.cpp. Run LLM inference directly in Dart and Flutter applications using GGUF models with hardware acceleration.
Actively Under Development. The core features are implemented and running. Many more features are in the pipeline, including:
- High-level APIs for easier integration.
- Multi-modality support (Vision/LLaVA).
We welcome contributors to help us test on more platforms (especially Windows)!
| Platform | Architecture(s) | GPU Backend | Status |
|---|---|---|---|
| macOS | Universal (arm64, x86_64) | Metal | ✅ Tested (CPU, Metal) |
| iOS | arm64 (Device), x86_64/arm64 (Sim) | Metal (Device), CPU (Sim) | ✅ Tested (CPU, Metal) |
| Android | arm64-v8a, x86_64 | Vulkan (if supported) | ✅ Tested (CPU, Vulkan) |
| Linux | x86_64 | CUDA / Vulkan | |
| Windows | x86_64 | CUDA / Vulkan | ❓ Needs Testing |
| Web | WASM | CPU (WASM) | ✅ Tested (WASM) |
Add llamadart to your pubspec.yaml:
dependencies:
llamadart: ^0.1.0No manual setup required.
The plugin automatically builds llama.cpp for iOS (Device/Simulator) when you run flutter build ios.
Note: The first build will take a few minutes to compile the C++ libraries.
The package handles native builds automatically via CMake.
- macOS: Metal acceleration is enabled by default.
- Linux/Windows: CPU inference is supported.
No manual setup required. The plugin uses CMake to compile the native library automatically.
- Ensure you have the Android NDK installed via Android Studio.
- The first build will take a few minutes to compile the
llama.cpplibraries for your target device's architecture.
Zero-config by default (uses jsDelivr CDN for wllama).
- Import and use
LlamaService. - Enable WASM support in Flutter web:
flutter run -d chrome --wasm # OR build with wasm flutter build web --wasm
Offline / Bundled Usage (Optional):
- Download assets to your
assets/directory:dart run llamadart:download_wllama
- Add the folder to your
pubspec.yaml:flutter: assets: - assets/wllama/single-thread/
- Initialize with local asset paths:
final service = LlamaService( wllamaPath: 'assets/wllama/single-thread/wllama.js', wasmPath: 'assets/wllama/single-thread/wllama.wasm', );
- Metal: Acceleration enabled by default on physical devices.
- Simulator: Runs on CPU (x86_64 or arm64).
- Sandboxing: Add these entitlements to
macos/Runner/DebugProfile.entitlementsandRelease.entitlementsfor network access (model downloading):<key>com.apple.security.network.client</key> <true/>
- Architectures:
arm64-v8a(most devices) andx86_64(emulators). - Vulkan: GPU acceleration is enabled by default on devices with Vulkan support.
- NDK: Requires Android NDK 26+ installed (usually handled by Android Studio).
GPU backends are enabled by default where available. Use the options below to customize.
Control GPU usage at runtime via ModelParams:
// Use GPU with automatic backend selection (default)
await service.init('model.gguf', modelParams: ModelParams(
gpuLayers: 99, // Offload all layers to GPU
preferredBackend: GpuBackend.auto,
));
// Force CPU-only inference
await service.init('model.gguf', modelParams: ModelParams(
gpuLayers: 0, // No GPU offloading
preferredBackend: GpuBackend.cpu,
));
// Request specific backend (if compiled in)
await service.init('model.gguf', modelParams: ModelParams(
preferredBackend: GpuBackend.vulkan,
));Available backends: auto, cpu, cuda, vulkan, metal
To disable GPU backends at build time:
Android (in android/gradle.properties):
LLAMA_DART_NO_VULKAN=trueDesktop (CMake flags):
# Disable CUDA
cmake -DLLAMA_DART_NO_CUDA=ON ...
# Disable Vulkan
cmake -DLLAMA_DART_NO_VULKAN=ON ...import 'package:llamadart/llamadart.dart';
void main() async {
final service = LlamaService();
try {
// 1. Initialize with model path (GGUF)
// On iOS/macOS, ensures Metal is used if available.
await service.init('models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf');
// 2. Generate text (streaming)
final prompt = "<start_of_turn>user\nTell me a story about a llama.<end_of_turn>\n<start_of_turn>model\n";
await for (final token in service.generate(prompt)) {
stdout.write(token);
}
} finally {
// 3. Always dispose to free native memory
service.dispose();
}
}- Flutter Chat App:
example/chat_app- A full-featured chat interface with real-time streaming, GPU acceleration support, and model management.
- Basic Console App:
example/basic_app- Minimal example demonstrating model download and basic inference.
See CONTRIBUTING.md for detailed instructions on:
- Setting up the development environment.
- Building the native libraries.
- Running tests and examples.
MIT